Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies:

GridPP Operations Minutes

Approximate Attendance List
Alessandra forti
Andrew Lahiff
Andrew McNab
Brian
Daniela
David C
Duncan Rand
Elena
Ewan McM
Freridco Melaccio
Gang Q
Gareth R
Gareth Smith
Govind
hepvc (Chris W)
Ian Loader
Jeremy Coles
John Bland
John Hill
Matt D
Mass RB
Oliver S
Raja Nandakumar
Raul
Rob H
Rob F
Robert Frank
Sam S
Steve Jones
Tom Whyntie


Agenda interpolated with minutes

Tuesday, 1 July 2014
11:00 - 11:20 Experiment problems/issues 20' 
Review of weekly issues by experiment/VO

- LHCb

No update. Nothing to report.

- CMS
 (some update in chat)
Daniela: UK CMS meeting on Thurs/Fri at Imperial (Physics, not computing). But if enough computing people turn up, it could split off ;)
CMS issues - Bristol just has Winnie atm(and just a bit of her), tickets waiting for Luke to return to deal with them.


- ATLAS

(report in attached material in Indico)

Elena:
Multicloud 
Production system under validation, work in progress.
Site avail for Analysis: BHAM, RALPP having problem. 
Glasgow Frontier-Squid needs to be added to AGIS (working with Alastair Dewhurst on this).
For Lancs&Liverpool&Glasgow was peak load on squids with lots of short analysis jobs.

David Crooks: load has indeed dropped off a bit, but we're still interested in having a second frontier squid.

(There was some discussion of how usual it was to have more than one frontier - Chris Walker noted that it would be difficult to distinguish them if they were DNS load-balanced sets)

Alessandra noted that there will be more multicore activity while we wait for the next release. (50million events on old releases.)

Other VOs:

Chris W:
HyperK has requested additional sites to support them (triggered by QMUL being their only site in the UK) (ECDF, RHUL are the obvious other sites outwith the T2K supporting sites that would be default). Data will be staged out to iRODS (@QMUL, not Grid), but there will be a need for "a little local storage, perhaps 2TB". 
Need to discuss spacetokening.
Need CVMFS (from RAL repo).

cern@school have run stuff. And are experimenting with WebDAV access to storage. (If Dirac supports WebDAV Federation, we'd like to do that.)

Tom W: running cern@school jobs with DIRAC & CVMFS has been very good, as has the data access via WebDAV. (Really good for schools, as web interface natural). We have a news item on progress on GridPP Website.

( There was some discussion about how HyperK should be using data - is the stageout model really best suited to them? v direct writing across the network. Depends on their growth rate. )


11:20 - 11:50 Meetings & updates 30' 
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

- General updates

Michel Jouvin has confirmed that the Sept GDB will be on Clouds.
EGI Ops meeting: Security updates talk of interest. EGI review happening 2/3 July. GFAL+lcg_utils out of support later this year. Operational level agreement defined for Core Services. NGI updates, most of NGIs seems unclear as to what they were doing, and how they were arranging resources Cloud v Grid. Resource Provisioning via e-Grant system (4 active pools inc one from Brunel). Review of SAM tests, reconsolidating of monitoring infrastructure. 

Weekly T1 Liaison meeting. 

CA Notification: Legacy OpenCA interface will be closed mid-July.

High load on Squid/Frontier discussed already.

WLCG Workshop.

Brian: asked about GFAL/lcg_utils deprecation. (it'll be replaced by GFAL2, which is 90% of the functionality, weaker re LFCs)


- WLCG ops coordination

Multicore Task Force Meeting this afternoon.
Alessandra: (this will be on the experience of the few sites who've run test CMS workloads for the past month or so)

Middleware Readiness Working Group tomorrow, discussion of testing site packages (would be nice to have sysadmins present - developed a non-Pakiti solution...)

 

- Tier-1 status

Gareth S: declared a warning on the site tomorrow, - Generator/UPS load test, 10am-11am.
Working through CASTOR upgrades - discussing with ATLAS for 8th July. 

T2K + tail-end of CMS are the only ones using FTS2 currently. (T2K intend to move to FTS3, no reason to stay.)

- Storage and data management

- Accounting

Still no HEPSPEC06 figures for EFTDA-JET or UCL. Could they be prodded?

Steve Lloyd's Metrics table - is it still accurate? Are there any issues?

APEL - RHUL, MAN, DURHAM publishing delays? Alessandra noted that they're migrating to EMI3 at Manchester.

- Documentation

Currently deciding how to reconfigure the KeyDocs.


- Interoperation

Ops meeting today cancelled because of EGI overview. T2 EMI2 decommissioning update - reminder, any site which is not in d/t and is EMI2 should not be so as of today. (There might have been an issue with UCL, which is now fixed?)

David C: nothing more to add.

- Monitoring

David C: consolidation update, there's a meeting on Friday. Will report back after that.
In the vein of the wiki page on Batch system plans etc, will be setting up a wiki page to collect site monitoring configuration (what packages do you use - Nagios, Ganglia, etc...)
This is just in the way of a survey to see what the situation currently is.

- On-duty

Quiet last week. EMI2 ticket with Sussex still?
UCL ticket re storage.

- Rollout

Early Adopter Table needs updated.

- Security

Debrief last week of Security Challenge.
Ewan noted that it was warmly received by everyone involved, and definitely of use in future.

Update from Sven in OMB on Challenges run against NGIs. 
Talk about doing something similar and site focussed. (should we review our site contact details on GOCDB? Everyone do this!)

Discussion about moving ARGUS instances to Regional ARGUS.

EGI-AMBER report last week regarding incident at IN2P3 (this was the final report).

Security Team Meeting 16th July.

Duncan noted the PerfSonar vulnerability needs to be dealt with by upgrading. We've been asked to report back to confirm when people have done so. 
(Review next meeting)
Discussion about need for more GOCDB fields for Perfsonar to report more details for tracking.

- Services

No important events (other than previously mentioned T1 UPS and CA closing openCA)


- Tickets

Monthly Tickets Update (early).
See Matt's email.

Sussex - closed EMI3 ticket, space tokens, Only SNO+ ticket remains.
Bristol - 4 CMS tickets. Seems to be some confusion.
ECDF: glExec ticket continues
Man: EMI2 APEL (in progress)
Lancs: LHCb issue with old cluster (probably just going to turn them off). glExec. Perfsonar issues (not a bottleneck, as iPerf is fine,... mysterious issue)
UCL: 3 tickets - Nagios probe fails, glExec (no updates for a while), Perfsonar (hardware failure, but no update since)
RHUL: dead pool node (Govind replacing motherboard), Biomed asking for GridFTP access for reading namespace (no word since Govind's initial attempt). EMI2 APEL ticket. Publishing issue (probably due to dead pool node).
QMUL: Biomed issues with HTTPS (this is waiting for StoRM update)
Imperial: Biomed wanting GridFTP access for metadata. Being rather insistent. Of course, it's up to Imperial what they do (are "thinking about that one").
New Cloud Site: ticket about VMs using proxies going straight to stratum 0 - shoal at Oxford sees accesses from Imperial machines (so maybe it now works). Ewan noted that the ticket was Steve Traylen at CERN commenting on their services being slammed (this should not be happening), but it would still be useful for Imperial to have shoal so it can do it's own.
EFTDAJET: very old LHCb problem. Probably JET have given up.
T1: Vidyo router fw ticket. Inconsistent BDII/SRM storage (for LHCb) some debate over technicalities. CMS pilots losing connection to submit hosts at RAL - looks like one of the Bristol tickets similar (106325). Publishing - RAL Castor is "not publishing a sane version" (Brian suspects a rogue colon.)


- Tools

Monitoring issue last week on 26th regarding ARC CEs (was actually due to an update on one of the monitoring boxes' default storage endpoints, now fixed).

- VOs

 

- Site updates

 

11:50 - 12:00 Discussion 10' 
- Items to cover at the WLCG workshop: https://indico.cern.ch/event/305362/other-view?view=standard.

First session: Update on WLCG overall status and operations. Update on T0 (probably re: efficiency issues they've been having, and the development of the "agile" infrastructure.) Multicore jobs, database services, the usual GDB topics.

Tues: medium-term evolution (Cloud resources, future of networking, IPv6, network monitoring and metrics (the new WLCG co-ordination group), data and Storage evolution, new processor architectures)

Experiment session. Plans for Run2. Request to have computing models described for Runs 3/4 unlikely to bear fruit.

Weds: Future evolution. Request to build common ground for everyone outside HEP (bonds with Astro and Astrophysics). Future of EGI update. Future of WLCG Update.

 


- General updates.12:00 - 12:01 AOB 1' 

Chris W asked about the banning of ARGUS services. Ewan noted that the UKI NGI ARGUS sends only what the NGI group knows about (and anything upstream). Discuss out of band.

- Dissemination updates
- There will be no ops meeting next Tuesday due to the Barcelona workshop.

Actions Review:

"Use of Robot Certificates", "Renewing Server Certs without Browser", "Backup LFC at T2?" (possibly not needed anymore, does this apply to the Dirac File Catalogue as well?), "Future of CEs/Batch system integration" (needs closed), "LFC / SE consistency" (update to  checking Dirac F C)
rest of tickets closed.

Gareth - some changes to our procedures, discussed the "national services thing". Discussed internally if we would change aliases if we brought a single service down.
(action closed)

Puppet Config to make CVMFS changes (closed?) Chris noted that there was much complaining in Ops meetings, and that after prodding Steve Traylen an update appeared. 


-- AOB

Dissemination: Tom W - some followup items on CVMFS and DIRAC based on the cern@school work. 
If anyone has any news they want reported...


Reminder: Registration open for GridPP33.

No meeting next week thanks to WLCG Workshop.


-
Chat Log
-

Jeremy Coles: (01/07/2014 10:56)
Sam is taking minutes today.
Brian; Daniela; David C; Elena; Federico; Gareth R; Jeremy C; Matt RB; Rob F; Sam; Alessandra; rf; Ewan M
Daniela Bauer: (11:00 AM)
Is anyone talking ?
Jeremy Coles: (11:00 AM)
Yes
Daniela Bauer: (11:01 AM)
Uh oh
Ha, I can hear Elena now
OK CMS Collaboration meeting
https://indico.cern.ch/event/318319/
Jeremy Coles: (11:02 AM)
Andrew L, Gang, hepvc, John B, John H, Chris W, Duncan R, Matt D, Ian L
Christopher John Walker: (11:02 AM)
Only available to CMS people that indico page
Jeremy Coles: (11:02 AM)
What are the highlights on it?
Daniela Bauer: (11:04 AM)
It's mainly Physics, but we might have a computing chat if enough computing people turn up

Alessandra Forti: (11:08 AM)
in AGIS
Elena Korolkova: (11:10 AM)
https://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=10102#time=24&start_date=&end_date=&values=false&spline=false&debug=false&resample=false&sites=all&clouds=UK
If Bham and RALPP want to recalculate their availability for atlas, please send email to cloud support atlas-suppor-cloud-uk@cern.ch
Tom Whyntie: (11:13 AM)
Yes
Elena Korolkova: (11:14 AM)
it's availability based on atlas analysis jobs running (site is online), which affects data distribution (not Site availability based on SAM tests)
Tom Whyntie: (11:15 AM)
CERN@school simulations run with CVMFS and DIRAC: http://www.gridpp.ac.uk/news/?p=3332
Jeremy Coles: (11:18 AM)
Andrew M, Govind, Oliver, Steve J, Tom
Govind: (11:20 AM)
I will discuss in group and let you know
Steve Jones: (11:21 AM)
We support most Approved VOs. We'll add HyperK.
Ewan Mac Mahon: (11:22 AM)
I'm completely happy in principle to add HyperK to the CPU resources. I won't be happy about adding them to the storage until or unless we've got them nailed into a space token.
I wouldn't mind them running on the CPUs and staging out directly to someone else's storage though.
Matt Doidge: (11:23 AM)
Our pipes certainly are big enough.
Ewan Mac Mahon: (11:25 AM)
And it makes the management easier, and if all they're trying to to is get the stuff into QMUL iRods anyway, they might as well save themselves the indirection.
Jeremy Coles: (11:31 AM)
http://pprc.qmul.ac.uk/~lloyd/gridpp/metrics.html
Alessandra Forti: (11:32 AM)
I was kicked out. is it only me? 
Ewan Mac Mahon: (11:33 AM)
Of the meeting? I think it probably was just you; there's only been one recent 'joining bing'.
Alessandra Forti: (11:34 AM)
sorry
it happened again
APEL: we are still moving to EMI3 we have all the pieces now
I just have to republish everything.
Jeremy Coles: (11:36 AM)
Thanks
Matt Raso-Barnett: (11:38 AM)
ok, we should be up to date though
Christopher John Walker: (11:39 AM)
We SHOULD make a security workshop a regular event. 
Ewan Mac Mahon: (11:39 AM)
Harder to justify the spend on the grounds of 'fun' though, I'd have thought :-)
Matt Doidge: (11:40 AM)
Blag it as team building?
Ewan Mac Mahon: (11:40 AM)
If we pitch 'team building' to the PMB they'll want to do it on the top of some godforsaken mountain though.
In a tent.
In the rain.
Or snow.
Christopher John Walker: (11:41 AM)
I like being taken to the top of mountains. 
Ewan Mac Mahon: (11:43 AM)
So, that's a plan then - next year's security workshop in the Snowdon cafe.
Daniela Bauer: (11:49 AM)
The CMS date issue derives from teh fact that in a CMS ticket you can set the time/date of the problem by hand and I assume the shifters either put in garbage and/or submitted them from somewhere where GGUS couldn't pcik up the correct time and date (it defaults to the current time in UTC as far as I can tell)
Govind: (11:52 AM)
lookslike mic not working for me..
Jeremy Coles: (11:52 AM)
Hi Govind - do you have any comments on the RHUL tickets please? In particular the APEL update? Thanks.
Govind: (11:53 AM)
biomd ticket- Frank has to check again 
APEL - i am setting up a new VM for EMI-3 and requested network guy to open firewall ports
So hopefully it will ready by this week
Jeremy Coles: (11:54 AM)
Thanks.
https://indico.cern.ch/event/305362/other-view?view=standard

https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
Raja: (12:12 PM)
From DIRAC chats, apparently there is also going to be some effort to do some extra developments on the Dirac File Catalog over the next 6 months or so.
Jeremy Coles: (12:17 PM)
http://www.gridpp.ac.uk/gridpp33/registration.html
Tom Whyntie: (12:17 PM)
Thanks, bye 

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
      Slides
    • 11:20 11:50
      Meetings & updates 30m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:50 12:00
      Discussion 10m
      - Items to cover at the WLCG workshop: https://indico.cern.ch/event/305362/other-view?view=standard. - General updates.
    • 12:00 12:01
      AOB 1m
      - Dissemination updates - A reminder to register for GridPP33: http://www.gridpp.ac.uk/gridpp33/registration.html. - There will be no ops meeting next Tuesday due to the Barcelona workshop.