Operations team & Sites
EVO - GridPP Operations team meeting
Attendees:
Alessandra; Andrew L; Brian; Catalin; Dan; Daniela; Elena; Ewan; Gareth; Jeremy (chair, notes); John H; Kashif; Mark S; Matt D; Matt W; Winni; Raja; Sam; Steve J; Terry Froy.
Experiment updates:
Raja – LHCb having issues. Not site related. Major issues not too many jobs waiting and DIRAC had problems over/before Christmas – site directors that submit pilots to sites started having authentication errors. Also some problems in accounting int32 hit max size.
ATLAS – Issues at QMUL. Wrong storage being used by production team. Jobs running now. Metrics for December available. Several sites below 90% availability. Cambridge – unscheduled DT. Glasgow – set into test from 28th and then in DT. Lancaster – flooding led to unscheduled DT. Can negotiate. RHUL – need to check. QMUL – some scheduled DT and ATLAS issue at the end of December.
RHUL – one storage server down at the end of the month.
Glasgow – temporary powercut that affected cooling. Machine room came back up without cooling. Security staff turned off power.
Lancaster submitted a ticket for ATLAS recalculation. https://its.cern.ch/jira/browse/ADCMONITOR-412
Other VOs:
US LSST modified ops portal to add both VOMS instances. Action is now closed. Vomssnooper should get the right information now. Sites looking to upgrade should get the right information too.
DIRAC – 120TB moved over the holiday period. Moving 250GB size files. 350MB/s in transfer part. See blog pos.
For main Bulletin updates read: https://www.gridpp.ac.uk/wiki/Operations_Bulletin_110116
Tier-1:
Holiday. Doubled link around bypass link. Double bandwidth to T2s. Some ACLs not working. Dropping back to 1x10Gb/s. Some batch issues just before Christmas and resolved just after but before new year.
Quiet leading up to Christmas. Long-term unavailability for Lancaster. Last weekend lot of alerts but did not ticket for them. Glasgow being down was affected.
Minor issue with GridPP nagios. Glasgow serve in the list for replication. Winnie posted on TB-SUPPORT.
Tickets: Sussex tickets – Jeremy Maris to have a look. Glasgow ticket updated for task forcemonitoring: https://ggus.eu/?mode=ticket_info&ticket_id=118052
Site updates from holiday period:
Manchester – did okay. Still catching. Submitted procurement so chasing order.
ECDF – Andy was away. No major problems
QMUL – ATLAS stopped sending jobs around Christmas. Site was up but unusable.
Imperial – all okay
Sheffield – no major problems. Running ATLAS jobs. Moved servers to SL6.
Oxford – CRON on pool node. CRLs not updated. Fixed quickly.
RHUL – no issue.
Cambridge – quiet. Few days no jobs ATLAS issue.
Birmingham – some funny issue with WNs not rebooted before period. Small number.
Lancaster – Smooth. Just before NYeve failed few HC tests. Fixed itself.
Glasgow – evolving situation. There was a power cut for building. Most services off. Aircon for downstairs did not come back due to dodgy valve. Dave/Gareth/Gordon working on it as affects department. A lot of stuff okay. Some PDUs lost. Site will not be up until understand impacts.
Liverpool – low number of jobs. Nothing significant.
Bristol update: No Problems, clusters busy & healthy.
Mark: In process adding stuff to wiki.
GANGA workshop/HEPSYSMAN:
https://github.com/ganga-devs/ganga/wiki/Full-Tutorial
Current plan. If people want to do something else happy to make changes.
CERNvm might be nice to include. Gives you CVMFS and that gives you GANGA. Move to DIRAC job submission.
Ganga can run whatever – pure python. Need to think about how to run stuff.
Several parts. Practicalities for us but also get closer to the new stack of stuff so we can use it. Most have proper UIs. As an exercise perhaps use Tom’s CERN VM approach. Should work nicely. Valuable to run through as an exercise… so people just need to be able to run a VM to start CERNvm. Mark will make sure this will run this way – some idiosync in config.
Min requirement laptop should be able to run VM. Having access to UI/remote cluster an advantage.
Must not waste too much time. Only the afternoon. Prerequiste that bring a laptop that already has CERNvm running. Any problems then plan B institute.
Action on Mark to nudge Tom.
HEPSYSMAN:
Other topics without speaker yet (could be just discussion but an introduction would be useful). * Htcondor progress * puppet/foreman * configuration sharing HEP-puppet if it updated? * how foreman will affect this * WLCG workshop related topics discussion * https://indico.cern.ch/event/433164/timetable/#20160201 * ganga mini workshop summary (though now that it is a tutorial may be more difficult).
Ganga/DIRAC day -> what doing for new user communities. Thursday may teach some useful lessons. May be short. Good to keep a slot.
HEP puppet not updated much. Only Lucas. HTC evolving quickest. AL patch. Nobody updated HEP puppet stuff which creates a diverging way of things.
Kashif can put together some slides on github/puppet. To make modules generic then becomes a lot more effort. Oxford often update locally.
A: Some of the modules are used by CERN.
AOB:
Sean – leaving and moving to Dublin. May impact services.
Ewan – heading off to another bioinformatics/gene sequenceing. Revamp DPM. Kashif will be looking after both systems.
IPv6 will be going away.. perhaps move to Glasgow. Future impacts unclear. In short-term ease off on T2C testing but TBD.
VOMS/gridppnagios Kashif.
Security team.
Local cluster.
Ewan moving to Oxford Computational biology.
Chat window contents:
Matt Doidge: (05/01/2016 11:08)
I submitted a JIRA request to have Lancaster's ASAP recalculated.
https://its.cern.ch/jira/browse/ADCMONITOR-412
Not sure if I was jumping the gun though.
Ewan Mac Mahon: (11:16 AM)
Are you using a headset Jeremy, or re you using loudspeakers? There's a HUGE feedback/echo thing whenever anyone else talks and you're not muted.
Yuo.
Everyone with hardware money should spend some on headsets.
Somone should spend some on one for Jeremy.
Alessandra Forti: (11:17 AM)
@matt I've assigned it.
Matt Doidge: (11:18 AM)
Thanks, I thought I might have done something wrong.
Alessandra Forti: (11:27 AM)
I've updated the agenda about with some more topics for hepsysman we discussed in November which we should discuss and possibly find a speaker for. The WLCG workshop topics are about evolution of the WLCG computing so the aim would be not to have a presentation but more a brainstorming so we have sites view represented as usual.
Samuel Cadellin Skipsey: (11:28 AM)
On the workshop, I might be presenting the Site Storage perspective on that evolution topic, if people want to give me their views.
Alessandra Forti: (11:29 AM)
same goes for htcondor and puppet there were pending questions on how to proceed and I believe the presentations could be short and aimed at generating discussion
ganga summary may need re-evaluation now that ganga workshop is organised as hands on tutorial
Ewan Mac Mahon: (11:30 AM)
Nothing springs immediately to mind,
(on security)
Alessandra Forti: (11:30 AM)
but could still be interesting to have highlights
@sam: that would be interesting. I'd like to hear also for my session accounting information system and benchmarking
Paige Winslowe Lacesso: (11:38 AM)
Must dash (sorry); Bristol update: No Problems, clusters busy & healthy.
Ewan Mac Mahon: (11:43 AM)
Modern kit is remarkably heat resistant.
Alessandra Forti: (11:45 AM)
https://indico.cern.ch/event/474233/
I put the link in agenda
Mark William Slater: (11:45 AM)
https://github.com/ganga-devs/ganga/wiki/Full-Tutorial
Alessandra Forti: (11:47 AM)
I did suggest it in case there were big differences. Since there are not, I do prefer a more uniform tutorial with highlights.
Ewan Mac Mahon: (11:48 AM)
I think we can play this time by ear a bit, but it's safe to say that the objective is to be able to run through the 'standard stack' of ganga submission to the Imperial dirac.
We might want to maybe do that from CernVM instances too, to get the complete set.
I think we ave all the bits, but there's a bit of joining up to do.
Alessandra Forti: (11:49 AM)
LSST user could move from direct job submission to dirac changing two lines
Ewan Mac Mahon: (11:50 AM)
Using cernvm might actually be a good way to run the tutorial too - it gives the nice stable platform regardless of what people have on their laptops etc.
So i would suggest that you (Mark) have a go at starting a gridpp cernvm and making sure you can run ganga on it, ideally, if you can.
Alessandra Forti: (11:50 AM)
I usually connect to my UI.
Matt Doidge: (11:51 AM)
If you have cvmfs you have access to a UI
:-)
Alessandra Forti: (11:52 AM)
we only have 4-5 hours
Ewan Mac Mahon: (11:53 AM)
Indeed; if we're going to do that then everyone should make sure they can run a cernvm in advance of the meeting.,
The advantage would be that we'd then all be using the exact same environment on the day.
So we wouldn't lose time faffing with people's UIs being slightly different etc.
Alessandra Forti: (11:54 AM)
lxplus works
Ewan Mac Mahon: (11:55 AM)
Yup, what he said.
I'm fine with either.
Mark William Slater: (12:15 PM)
I think we've got one of those :)