Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies:

Attendance:

Andrew Washbrook, Brian Davies, Daniela Bauer, David Crooks (minutes), Elena Korolkova, Ewan MacMahon, Federico Melaccio, Gang Qin, Gareth Smith, Gordon Stewart, Govind  Songara, Ian Loader, Ian Neilson, Jens Jensen, Jeremy Coles (chair), John Bland, John Hill, Kashif Mohammed, Matt Doidge, Oliver Smith, Pete Clarke, Raja Nandakumar, Raul Lopes, Robert Fay, Robert Frank, Sam Skipsey, Steve Jones, Terry Froy


LHCb, Raja: Have some jobs running on the grid. Most UK sites fine. Some major problem with RALPP problem, discuss in another section. Only other site with possible problems (said with care) is Birmingham. Submitting pilots regularly, pilots not running, then deleting them and resubmitting fresh ones. Fairshare issue? Other than that T1 had one user not submit jobs, Andrei, now fixed. Pretty much all, DiRAC update in 10-14 days. 

Jeremy: Can followup with Birmingham, probably is fairshare issue. 

Raja: Will talk to Mark Slater later. 

Jeremy: Andrei? What was the underlying reason?

Raja: Argus. 2 issues, can Gareth pick upon that? My tentative feeling was - older versions of software, really hard flushing of ARGUS servers. 

Gareth S. Was this the unbanning (yes). Had been banned some time ago. Problem was from submitting jobs from different VO. Unbanning didn't go as easily as hoped. Pretty sure this was the site ARGUS.

Jeremy: Details of RALPP issue?

Raja: I think it's something to do with VM hardware underlying ARC CEs at T2.

Jeremy: Followup on resolution.

CMS, Daniela: As far as I'm aware we're fine.

ATLAS, Elena: Tickets: Andy just closed deletion errors ticket at ECDF, restarted services , that solved the problem. Ticket for MAN-HEP with deletion errors, problem known with one disk server which is down, Alessandra is following this. News from last ADC, report on dist analysis status, there was analysis of failed analysis jobs, most wallclock consumption comes from cancelled jobs and lost heartbeats. Report on user jobs using high memory. Most successful jobs consume reasonable mem, < 2GB. Some successful jobs > 4GB, not killed because site hasn't set max memory. User job might have different requirements, more than 2G , some due to memory leaks, difficult to educate uers. Suggestion to raise memory limit. Jobs which consume high memory will be killed automatically. Slides on new FTS. There are different new Athena version and ROOT analysis version, and automatic procedure will be to replicate one (small) dataset to all sites that run the tests. UK sites are OK with this analysis tests. There was also discussion about sites' storage consistency checks, Alastair sent slides about consistency checks to storage list, to be discussed at tomorrow's Storage.

Jeremy: Could you put link in chat to the slides you're referring to? What are the consequences of the memory limit discussion? Sounded like different issues developing in what limits sites were using/issues jobs were having.

DiRAC, Jens: Spent some time debugging expiring proxies. With all the files being transferred, the proxy would expire before transfer takes place. Now have proxies without VOMS attributes, can have longer proxies (a week or something). Main priority is to get as much data ingested as possible. Then we can sort out any other details after. 70-80TB which is not a lot. Lydia had a user with millions of files. 

Brian: In total has 88 million files, of which one user is 4.5 million.  


Jens: In batches of 2000, could take some time - encouraging Lydia to submit as much as possible, at least have sorted out the proxy problem. 

Jeremy: Not a huge amount of data .

Jens: No - should have been ticking along while other things were happening. 

Jeremy: Lydia's document, what's the availability? Is this DIRAC work heading in the right direction?

Jens: Document: was going to circulate to other DIRAC sites, could ask if willing to share more generally. Brian was surprised that the recipe that worked for T2K didn't work for this. Some of the proxies had 60 times shorter lifetime than expected from the documentation. Some things wrong in the FTS documentation. We know about this now and can compensate, but we should have discovered this much sooner. 

Jeremy: Plan is to continue with Durham then look at other sites?

Jens: Wait until we have all the data, 2PB, then look at other sites. Also depends on number of files. A batch of 2000 files may not saturate the bandwidth

Brian: In general we have a 0.25 GB/sec rate, so can do 20TB/day. 2PB is going to take 100 days running at max. Assume running at 40-50% assume 6 months. 

Jeremy: I was expecting it to be a bit quicker than that. I guess the question is, how will others perceive this? A lot of data. How confident are you in 20TB/day? 

Brian: 20 is 250MB/sec. Will look up achieved speeds.

LIGO, Catalin: (Gareth checked in with him: no news to report)

LOFAR, George: (Gareth: there's a meeting this afternoon)

LSST, Alessandra: 

Pete Clarke: LSST jobs started running at ECDF. 

Andrew W: Running at multiple sites, Man, maybe Liv? 

Steve Jones: Several hundred successful jobs, looking at tuning issues, fairshare.

Jeremy: Good, progress. 

LZ, Dave Colling: 

Daniela: (Dave's away) Enterprising user submitted a couple of jobs through WMS, can't add them to DiRAC for some reason. Also trying to register in the EGI Ops portal like any other VO, and this has been difficult as they can't register their site as it's US based. There's a GGUS ticket. 

Jeremy: Is it the age of VOMS servers or the location?

Daniela: Location. To register in the EGI ops portal you have to register your VOMS server in the GOC DB, to register in GOC DB you have to have a site, they haven't got a site, and it's all downhill from there. It has turned Kafka-esque. [see transcript]

Pete: Is there a quick resolution, or there is not known solution so need to invent case law?

Daniela: VOMS server need to invent case law, other things more quickly.

Jeremy: I'll look at the ticket after the meeting, surprised EGI aren't more engaged on this. 

Ewan: Could we cheat and register it against a different site? (RAL eg)

Jeremy: Under the NGI we have NGI services.

Ewan: I think the NGI services have to be resident at a site as well?

Daniela: I wouldn't put a server at my site I didn't have access to, that's doesn't feel right.

Elena: 2 VOMS servers it will be a mess

Jeremy: It wouldn't be 2 VOMS servers, it would be 1, it's jsut that we can't register it because it's in the States. Follow up offline, look into it as a suggestion

Ewan: NGI service not registered at a site would be interesting. 

UKQCD: Some communications with Craig but haven't had an update recently.

GalDyn: No update

PRaVDA: Follow up offline.

Jeremy: Note Brian's note of achievable 250MB/sec for a 2 hour period. 

DIRAC/Cloud: (Robert: Andrew McNab on holiday) Jeremy: Delays look like fairshare issues, no particular problems. 

Meetings and updates (see https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest)
====================

ROOT CA, Jens: The bottom line was I deployed a root CRL at 0919 BST, one that was not yet valid. The impact of that would probably not be noticeable because fetch-crtl refuses to install a CRL which is not valid. In theory sites using fetch-crl should not have installed this. An error was reported by RAL-T1 at around 1130, when I noteiced that the CRL had not been valid, then it was rolled it back to the previous CRL and I sent a notice to TB-SUPPORT and IGTF general. After ~2pm after the new crl became valid I redeployed it. The root CRL is issued once a year and the root CA is offline, which means we need to get a laptop, normally locked away somewhere, booted up with a readonly LIVE OS on a CD. Take a box out of safe, take it and laptop to my office, normally with two people working at same time with cross checks. Knew that we had to sign a new CRL, only a month left which is too little, normally issue once a year with two year lifetime on the root. Had signed earlier but gave it MD5 signature not SHA1. 

After coming back from holiday, made some mistakes in the signing, all of which contributed. Didn't check time on laptop, don't check it often. Checked CRL after signing, but didn't check start time (we would have discovered the problem). When installing it couldn't remember exact command for checking validity against CA certificate (same check as fetch-crl runs), which would have helped. 

Basically should have waited for a few days, but with only a month left was fairly urgent. Shouldn't have been much impact, but some people did notice so we might want to know why they saw problems. Also want to improve checklist. Also get CRL script that checks manually deployed CRLs. Final thing is that I checked the logs - 36100 downloads, coming from 9000 IPs most once or twice, a few ~1000 times. If these were running fetch-crl they wouldn't have installed it and installed it when it did become valid. The T1 saw the CRLs, not sure why.

Steve Jones: Most important point from my perspective is the checklist. If the list is sufficiently detailed and followed, can incrementally eliminate errors and improve process - thanks for mentioning that.

Jeremy: Mon/Thurs WLCG Ops. 

Matt: DPM became a bit flaky, load issue? Sam pointed me to DPM tuning page, packed with new tuning tips, recommended. 

Developers@CERN Jeremy gave link to advertise forum to share ideas and best practice. 

Multicore inefficiencies:

Andrew RAL, CMS slots not well used, Gareth/Glasgow concern around CPU previously used for ATLAS not being used because of mismatches.

Jeremy: Any approach recommended?

Steve: They are aware of that issue with sites and a multicore/single core imbalance.

Jeremy: Seems to affect some sites but not others?

Steve: Problem if site cores are not integer multiple of 8.

Jeremy: Should be homogeneous across sites?

Steve: 8 slots: one plus none. 12: one plus 4 singles. Not all CPUs are a multiple of 8. 

Sam: This is also complicated by differences in memory limits. Understood by Multicore TF. 

Jeremy: Outcome was request to keep many single core jobs in flow?

Matt/Sam: Not sure what the outcome was (speak to Alessandra).

Jeremy: What do we want the outcome to be?

Sam: Explicit understanding of what needs to be done to improve efficiency in filling sites? (Don't want to speak for Alessandra) 

Jeremy: Follow up another week.

Jeremy: ATLAS probes, experience pages on how they are used in different contexts. Mibght be of interest with some people. Might look at what tests we want to run. We should come back to it in September?

Ewan: Do we have any probes in stock?

Jeremy: I think we have 2? But I can more stock if I need to circulate them.

Ewan: Worth having a couple for JISC end to end initiative group?

Jeremy: UK oversubscribed, but maybe not on the academic side?

Jeremy: Recommending GitBook:

David C: Also recommend mkdocs as another lightweight Markdown static site documentation generator: http://www.mkdocs.org

Tom: Tried GitBook, it's beautiful, directly link it to GitHub repo, may well trial some of the DiRAC/CERNVM stuff into that format, watch this space.

T-1 updates
===========

Gareth: Flag that this morning we announced an at risk, having run 2 weeks with fix in for the pair of routers have had a lot more stability, exercise was primary router was borrowed from PPD which we needed to put back. We have a detailed set of network changes we need to make now we have that sorted out. I'll put a set of notes in that people can read. Weekend issue: separate problem - primary found that it couldn't connect upwards into site, second took over, fallback to primary and waited for fix. Timing was bad happening a few days after the change.

"Ongoing testing of worker nodes running a new configuration where they obtain the grid middleware via cvmfs. Will now move to rollout across more nodes."

Change other people might be interested in - also the change of linux kernel i/o scheduling choice on CMS disk servers that has improved case where CASTOR behaving badly for CMS pileup jobs.

Documentation: Jeremy: Lydia's FTS3 doc - how do we save this? Is there a place we can put this - Jens?

Jens: GridPP website, wiki?

Jeremy: Want to make sure they are readably find/searchable.

ACTION Steve talk to Jens about documentation storage.

David C: Advert for release of grafana 2.1 with interesting new features (including templating).

On duty: Jeremy: Handover report? (Andrew's away this week)

Rollout: Jeremy: Ewan and Raul, machine job features work. Is there anything I need to push?

Ewan: Probably not, had a look an instructions, looks like we're a big leap further forward in that there are some instructions. Haven't deployed in practice, there's something we can have a go with. Waiting for me to get on with it (as with Raul).

Jeremy: IC have deployed this, but wasn't being used.

Security: Ian: General summary - reminder of work for security team meeting and review document.

Services: Jeremy: LHCOPN and LHCONE joint meeting in October

Tickets: Matt:

TIER 1
114992 Looks like it can be closed. Original problem was solved. Other problems are not RALs problem per se.

EFDA-JET
115613 115448 115496 

Looking alright, working through these (the last not being a site issue)

CAMBRIDGE
115655 

Jeremy: How should these alarms be handled? 

Set on hold.

LIVERPOOL
114248 

Hasn't had update for a while, Steve pointed out, underlying problem still there, wondering what to do next. 

Steve: Shared file areas filling up. How do we deal with that? I could close the ticket, we need a policy that people need to write to their own areas. 

Matt: Might be useful for a site to have more power to clean up files based on a policy.

Steve: We should think about this.

Tools, Kashif: Things look OK.

Discussion
==========

Janet end to end initiative workshop, contribution from GridPP.

Pete: Just in case, the background. This new end to end SIG is a reincarnation of one form 2 years ago tha I chaired one meeting for. No other meetings took place. Tim Chown has been employed to reincarnate it. He is IPv6 expert from Southhampton CS, seconded into Jent to run end to end performance things. General idea is that he goes out to talk to communities to discuss their end to end performance issues and problems; he's a good guy.

He sent a message asking for talks. In making a suggestion we send in 2, don't want to inhibit. Talks might be ~15 minutes. Might be good to stand up and remind that LHC has been doing this for a while. Could we say 2 talks coming from GridPP rather than leave it to individuals. Two things I was thinking of: formal end to end performance, perfsonar, technical monitoring. In addition scale of data management that we do, and practical application like availability tests, not like perfsonar but high level - important message to say what the applications see for their jobs.

Brian: That sounds good - possibly work on RIPE and IPv6?

Pete: 19th October, UCL. RIPE - yes. IPv6 interesting, but is that end to end performance? 

Jeremy: Put in chat the para about the purpose of this group (see transcript)

Pete: Tim is coming to GridPP - we could put in placeholders and leave it to GridPP35 to decide it exactly, while it would be good to decide who will go earlier - taking on board Ewan's suggestion of Duncan for perfsonar.

Jeremy: Issues and factors affecting end to end network performance. At what point does network end and data start. 

Pete: I would interpret it as, we are probably the most advanced, or if not equal to a only a few others in the country at getting end to end between the applications. 

Jeremy: Talking about end to end includes tuning

Pete: I would say that should be our message - we care about what the applications see, we don't care per se what perfsonar sees.

Ewan: That might fit into the data centric talk. How do we tune our TCP stacks, in a 15 minute talk it's going to be a high level.

Pete: Not sure about the 15 minute, might be longer/shorter

Jeremy: Terry gives in chat a note about IPv6 performance (see transcript)

Brian: IPv6 WG are doing transfer tests over IPv6, maybe just a comment that we're looking at both.

Ewan: Might fit into "Duncan's talk" - have been interesting practical results, including in principle vs in practice IPv6 performance. 

Pete: Really good point, worth pointing out to people. I'll suggest 2 outline talks from GridPP which we can make detailed plans at GridPP - suggest probably Duncan for one and need someone for the application centric view of data transfer.

Pete: On IPv6 - Ewan, I want to contact you and Duncan, I've extended the IPv6 session, Dave Kelsey is down to talk about the high level stuff, could you and Duncan lead discussion on where we are with IPv6? All the Janet people will be there. Useful to have discussion from the floor. Is that OK?

Ewan: That's fine in principle.

Jeremy: Follow up on application talk offline. 

Actions
=======

LIGO strategy, tests, timeline:
-------------------------------

Wiki page created - useful to follow up on, action on each of these areas

Documents to ensure GridPP contacts in VO (and general MoU issues):
-------------------------------------------------------------------

Several issues of implementing new policy so minor VOs could have GridPP members part of them, more general question if AUP is right place for this. 

Ewan: Problem is what we were aiming for was bilateral agreement - we have AUP, but we don't have rights and responsibilities of both parties. Needs everyones involvement to sort out. 

AOB
===

Matt: SL5 wiki page? (see transcript)

Jeremy: We did ask people to fill in that page - was that just a general reminder?

Matt: Wasn't sure if that was something that people should be filling it in?

Jeremy/David: Yes, it's something that people should be filling in - notably sites with SL5 DPM nodes.

Transcript
==========

Ewan Mac Mahon: (18/08/2015 11:05)
Can't see anyone from Birmingham here; might need to email/ticket them.
Federico Melaccio: (11:10 AM)
the problem at RALPP is that the network attached storage backend to ARC-CEs and Condor VMs is very slow. At the moment we do not know the actual cause for this behavior, but network is our main suspect since everything started with a network glitch on Sunday at 5 am
Ewan Mac Mahon: (11:13 AM)
It's not just got itself confused and negotiated down to 100Mbit on the interface or something like that?
Federico Melaccio: (11:15 AM)
it is hard to say because we have no log entries on the affected systems and the aggregated network bandwidth are just fine
bandwidths
Tom Whyntie: (11:21 AM)
No updates from GalDyn, to aid with flow
:-)
Daniela Bauer: (11:23 AM)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=115539
Matt Doidge: (11:25 AM)
Or could we mirror it?
Daniela Bauer: (11:25 AM)
now my audio has given up
I'll reconnect
Brian Davies: (11:26 AM)
Regarding vo.dirac.ac.uk FTS rates, have achieved ~250MB/s for a two hour period.
http://dashb-fts-transfers.cern.ch/ui/#date.from=201507200000&date.interval=0&date.to=201507220000&grouping.dst=%28country,site%29&grouping.src=%28country,site,host%29&p.bin.size=h&p.grouping=dst&server=%28lcgfts3.gridpp.rl.ac.uk,0%29&tab=transfer_plots&vo=%28nil,vo.dirac.ac.uk%29
Ewan Mac Mahon: (11:28 AM)
Just on dirac - if the problem is bajillions of small files, should we take a traditional backup approach of tar-ing things up and then FTSing the tarball?
This /is/ a backup, after all.
Daniela Bauer: (11:29 AM)
I have to admit "tarball" was the first thing to come to my mind as well...
Ewan Mac Mahon: (11:35 AM)
Not sure if the lesson here is more holidays for Jens, or no more at all.
Jens Jensen: (11:36 AM)
:-)
... or :-( ?
Ewan Mac Mahon: (11:37 AM)
TBH, I'm inclined to file something like that under 'shit happens' and just not worry about trying to change too much - whatever the systems are there's always the possibility to just do things wrong, and sometimes making things more complex just makes smaller failures more likely.
Matt Doidge: (11:37 AM)
https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Admin/TuningHints 
<-- DPM tuning page, for the minutes
12 and 20 core nodes are the ones at risk here.
Ewan Mac Mahon: (11:41 AM)
The other option for a 12 core node is to run two eight-way jobs and just let the scheduler sort them out.
We don't have to give them dedicated cores if it's more harmful than helpful.
(memory requirements allowing)
Matt Doidge: (11:41 AM)
there goes my audio...
David Crooks: (11:45 AM)
http://www.mkdocs.org
Terry Froy [QM]: (11:45 AM)
There is a UKNOF meeting in Sheffield in September (https://indico.uknof.org.uk/conferenceOtherViews.py?view=standard&confId=34) where a RIPE probe ambassador will be present and dishing out probes.

John Bland: (11:48 AM)
"Ambassador, with these RIPEs you're really spoiling us!"
David Crooks: (11:51 AM)
http://docs.grafana.org/v2.1/guides/whats-new-in-v2-1/
raul: (11:53 AM)
I intend to work on it this week
I'll contact Ewan
Daniela Bauer: (11:56 AM)
If in doubt definitely deploy glexec even if only for your own peace of mind.
Tom Whyntie: (11:56 AM)
I have to go now but just to say thanks again to Steve J for going through the DIRAC docs on GitHub and for the feedback :-)
Ewan Mac Mahon: (11:57 AM)
Quite - if you can deploy glExec, I definitely think that you're better off as a site admin having it rather than not. It's very much a 'pro site' thing, not an awkward imposition.
David Crooks: (11:59 AM)
For the minutes, could I check who (Guest) 1234 is?
Jeremy Coles: (12:01 PM)
My guess is Steve J.
1234: (12:03 PM)
Yes - 1234 is my name for now.
Vidyo crashed and I hurried to type in the coordinates.
Jeremy Coles: (12:04 PM)
We will identify, document and share best practices on high performance networking, to raise awareness amongst Janet network connected communities of the issues and factors affecting end-to-end network performance
Ewan Mac Mahon: (12:07 PM)
I think Duncan on perfSonar is an obvious choice.
And someone (Brian/Jens/Sam?) on the more data centric what we're actually using it for stuff?
Terry Froy [QM]: (12:09 PM)
IPv6 typically gives better end-to-end performance over the Internet as fragmentation is not supported in IPv6 - it gives an incentive to ISPs to fix MTU issues rather than 'fixing' them by turning on 'ip fragment' - it works for IPv4 but not for IPv6.
Ewan Mac Mahon: (12:15 PM)
Sam seems to be keeping his head down :-(
Jens Jensen: (12:15 PM)
Sounds like you're a good person to do it brian
ooo
Ewan Mac Mahon: (12:17 PM)
Er. Nothing to add, I think.
Just that we need to have an agreement for new VOs and we don't.
Jens Jensen: (12:19 PM)
I won't be at gridpp... (I was on leave so missed the reg'n deadline)
Matt Doidge: (12:19 PM)
Were sites asked to fill in https://www.gridpp.ac.uk/wiki/SL5_status ?
Terry Froy [QM]: (12:20 PM)
Re: SL5 status - will chat with Dan Traynor when he gets back - but I understand that once we get Lustre 2.x up and running on SL6, we will no longer have any need to run any SL5 boxen.
Ewan Mac Mahon: (12:21 PM)
Ideally everyone would fill it in, even if just to say "we're SL5 free already".

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      * This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards. * The team composition has been changing. If everybody contributes then the task comes around less often. * From the start of GridPP4+ those in fully funded GridPP positions will be expected to contribute. Others are welcome to volunteer! * The minutes should contain a list of who attended; apologies; note who took the minutes and highlight actions. * A count is maintained at https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items. * After uploading minutes to the agenda page the minute taker is expected to: ** Update the list of ops actions. ** Update their 'count' so the task can be shared fairly. Thank you for your support!
    • 11:01 11:20
      Experiment problems/issues 19m
      Review of weekly issues by experiment/VO - LHCb - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel See comment last week: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin - ATLAS - Other -- DiRAC: Jens -- LIGO: Catalin -- LOFAR: George -- LSST: Alessandra -- LZ: David -- UKQCD: Jeremy -- UCLan/GalDyn: Tom -- PRaVDA: Mark/Matt - DIRAC status -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:40 12:00
      Discussion 20m
      1. Janet End-to-End Performance Initiative Workshop - talks discussion. In relation to https://www.jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative. "This end-to-end performance workshop seeks to bring together stakeholders who have a common interest in developing and extending best practices to allow communities to make optimal use of the Janet network for high-performance network applications." GridPP may wish to submit 1-2 talks. PerfSONAR may make a good topic for example. Other ideas?
    • 12:00 12:05
      Actions & AOB 5m
      * https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items