Operations team & Sites

EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the Janet(UK) Community area. Direct link http://evo.caltech.edu/evoNext/koala.jnlp?meeting=MDMaM82v2nD2Du999sD99D - The phone bridge number is +44 (0)161 306 6802. The phone bridge ID is 1001002 with code: 4880. Apologies: Mark M
Experiment problems/issues (20')   Slides pdf file  
Review of weekly issues by experiment/VO
- LHCb
Raja: Nothing too much to report. Monte carlo and a few user jobs. Most UK sites seem to be no problem. 2 problems from last week: EFDA-JET and SHEFFIELD. [Elena, transcript: In Sheffield we are looking into network problem which is not the fastest problem to resolve. We excluded problem related to sl6 move]
Daniela: No real problem. Looking into Glasgow xrootd enabling, not urgent. Sam: [see transcript]. T2s are working fine. CMS discussing what to do over Christmas holidays. 
Jeremy: Changes made as far as Sam's aware. 
- ATLAS (see slides attached by Elena)
Elena: Attached availability/reliability report. Hammercloud tests were down for a bit. Cambridge is 100% availability, well done. Glasgow <90%, problem with LFC; please write a couple of lines of explanation. Not much production running at the moment. THere was a peak of jobs during the weekend, but not much now. ATLAS is planning to revise things and announce their plans for Christmas. There was a number of requests for multicore jobs. There was an issue on Friday between 9 and 1, all prod queues set to test because jobs output couldn't be written on storage. Queues put online manually, some went back to test. Problem of one server in CERN. Rucio renaming is going. Problem at QMUL, Lancs and Manchester and RALPP, talk about it offline. 
Brian: wrt QMUL, there's an upgrade to Storm, Chris:
Chris: I'm aware of it, they haven't released the new version yet. I may not upgrade before Christmas. 
- Other
Chris: Steve Jones made progress with instant UI, pretty much have bare bones of one, could do with instructions. Backup VOMS server, most VOs have upgraded VOMS, Southgrid and Scotgrid haven't. T2K are only VO to test resources with exception of John Hill at Southgrid. Encourage VO managers to test resources, unless it's felt this isn't necessary.
Jeremy: Made some progress, would be useful to test against other VOs to check against general problems. Kashif configured Nagios to use backup VOMS, so generic test done against all sites. 
Chris: Tests I've done haven't suggested general problems. Have list of people to nag; David Crooks or Mark Mitchell for Scotgrid.
David: I'm aware, I'm on it. 
Chris: Northgrid, Alessandra and Robert. Southgrid Pete Gronbech; could you test or delegate the testing of the Southgrid VO?
Pete: It has been mentioned.
Alessandra: I'll put it on the list.
Chris: Londongrid, Duncan, Dave Colling or Daniela.
Jeremy: Could Duncan or Daniela take an action?
Duncan: Yep.
Chris: That's progress, let me know of progress and fill in the wiki. 
Jeremy: Should start ticketing VO managers in a week.
Chris: Hoping that GridPP VOs could lead this. 
Meetings & updates (20')  
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
Although ready, the UK CA will wait to move to default SHA-2 certificates in January (WLCG overall has not confirmed readiness).
There is an EGI push for ARGUS deployment - a central server is being configured at RAL.
Jeremy: for T2s, maybe half of sites have ARGUS installed, not sure about configuration. (Come back to it). 
Minutes from Monday's regular WLCG ops call are available. Generally quiet.
Alessandra: Multi core TF, looking at participation, have a few names from the UK but others are welcome. Need to follow up with T1 for dynamic queue, are encouraged to participate. 
Jeremy: Gareth, could you check with RAL about an intention to participate? Gareth: Discuss it in liaison meeting. Alessandra: will attend as well.
- Tier-1 status
Gareth: Haven't made final plans for Christmas: services should stay up, support may be reduced on critical days.
- Accounting
- Documentation
- Interoperation
Update on EGI ops meeting on 2nd December, see updates in bulletin.
Jeremy: Any concerns about difficulty of UK CA to issue certs using both SHA1 and SHA2. 
Chris: Not sure I've seen confirmation that things work
Jeremy: Jens and John Kewley have been testing most things, the only thing that doesn't work is the website; haven't had a tag meeting for a while, might try to have one before Christmas. 
Chris: Undesirable to have changes like this over Christmas. 
Jeremy: glexec tests are critical for those that are advertising it.
- Monitoring
- On-duty
Alessandra: Last update on UCL was that there was some issues. 
Daniela: Haven't been getting notifications by email, has anyone else seen anything?
Jeremy/Pete Gronbech/Daniela: Kashif has updated things but is away so can't do anything. Manchester should have received 2 emails but didn't see anything.
- Rollout
- Security
- Services
- Tickets (Matt)
Sussex: glexec ticket, perfsonar (Emyr this will be new person's first job).
RALPP: Problem with dCache info provider, solved by Chris. Jeremy: is this solved? Dave Kelsey mentioned this on Friday. Matt: Solved, info provider wasn't publishing the right information. 
Oxford: Ticket about removing ngs VO from backup VOMS server. 
Bristol: Nagios ticket, they're on it, ARC problem. Jeremy: 2 sites mentioned in tickets but not T1. Do we not see this at T1? Matt: There's no ticket for it. Daniela says her ARC CE is working. 
Glasgow: perfsonar, need to look into that again. hyperk, waiting for testing. CMS ticket, getting CMS to work, good reference for other sites wanting to get CMS glide-ins to work. 
ECDF: glexec ticket, Matt apologies for the tarball install. publishing tickets - clean up some bad values. At Lancaster we had to upgrade site bdii to fix similar issues. Even with up to date CEs, may still need up to date site bdii to fix this. 
Durham: glexec, being worked on.
Sheffield: glexec, hoping to get done in mid-November, looking at new timeline. Matt: What's the deadline on this? Jeremy: Became critical before December, for sites that advertise. Experiments still need to say they're happy with it. (Elena updates in transcript). LHCb problem, lots of discussion. Biomed has problems with dynamic publishing. Elena: problem comes because we are running local and grid jobs submitted to torque server. LCG-rollout had link to a ticket on this. Will try to upgrade torque. Matt: What are people running? We installed the Nikhef version. Maybe look at torque versions. Look at alternatives to torque [see transcript]. Alessandra: Torque has not been maintained for 2 years. Not sustainable. Adaptive computing not maintaining Maui but instead Moab. Matt: Especially as we're looking at MPI, might be easier with newer batch.
Manchester: perfsonar, will upgrade soon. Request to remove ngs VO from VOMS, handled by Robert.
Lancaster: glexec, effort needed by Matt, haven't been able to commit time to this.
UCL: Slow progress, don't have any dedicated effort for this. Ben working on SL6 WNs, might come to Londongrid guys for help. Chris: Could he use xrootd access to data? Matt: Good suggestion. Wahid: Not sure there's an option for a diskless site for ATLAS. Not sure you'll get credit. 
 QMUL: problem with SE publishing that they support biomed while they don't. Chris needs the time to figure out how to fix it. CHris: Can fix it, need to write a YAIM post config script to permanently fix it (to stop YAIM blatting the fix). Ticket from Brian about spacetokens that has been taken care of. 
 Brunel: APEL publishing ticket, Raul has ticket open with devs. Might be an upgrade issue. APEL might be little understaffed at the moment. Jeremy: Stuart should be up to date with APEL now, might be working through issues. Matt: It's a busy time. They might be snowed under. 
 EFDA-JET: JHCb job failures, putting a lot of effort into fixing this. glexec, installed but not quite working. Submitted ticket to ARGUS devs. 
 T1: One last 2012 ticket, correlated packet loss on  perfsonar. Gareth: We have done a little work on this, we need to do more on this, nothing much has changed recently.The hey point is we haven't installed the new one yet.
 4 CVMFS request tickets: T2k got back on theirs. Running out of time, might be same for snoplus, and cern@school. On 3rd reminder ticket is automatically closed. Out of band email might be useful. Jeremy: If reminders aren't going out, I'll follow up with GGUS. RAL Myproxy: Gareth: that needs some work. publishing defaults - closed, then reopened. Might just be YAIMing. WedDAV: Chris said his tests worked well and closed the ticket. 
- Tools
- VOs
- SIte updates
Monitoring discussion (15')  
- Led by David. (Notes from Jeremy!)
Two main things to cover.
First – monitoring consolidation group update. The main purpose of the group has been to consolidate backends and the transport. Also provide suitable APIs.
Other VOs had monitoring in addition to Nagios. Proposed to make Nagios optional. Nevertheless there would be a unified transport between them. Now moving on to deployment – using the prototype developed.
Will now deploy on experiment side over the next 6 months. Want to make sure we are engaged at the site level.
The link is http://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome.
Request: Please provide David any more feedback on this by Thursday.
Second – to start a conversation on one of the things they are clear about. As well as the prototype there should be APIs. From our perspective this is useful as the monitoring provided may not cover our use-cases fully.
At the moment we generally use Nagios and Ganglia. At Glasgow graphite has been working well as a high-level aggregator. So are there any parts of the monitoring you currently use that might inform how we look at and provide tests of the data. Any comments?
[12:08:49] Christopher Walker QMUL is using OpenNMS - replacing zenoss - Dan has details of why he chose it.
[12:08:57] Alessandra Forti one of the questions that occurred to me is that now that the experiments sam tests become critical we cannot maintain all the experiments in the BDII if they don't use the site
AF: In the dashboard… shows all the experiments even those we do not properly support like ALICE and CMS.  Concern that this may affect site availability.
DC: So if you advertise for a VO you only provide spare CPU for… then the concern is that you may be impacted by the availability for that VO.
CW: Also availability per VO is okay. Otherwise the site looks down or poor … want an OR not an AND… or the latter only for the supported experiments!
PG: Class of test. Primary and opportunistic and only OR those.
Will feedback on Friday.
Working on graphite stuff in wiki. Planning to package up the collector scripts for others to use.
Q: If people are doing things that are interesting or non-standard then it would be good to hear from you!
ARGUS/glexec (10')  
- Deployment timetable
AOB (1')  
glexec/roundtable has been delayed for a few weeks, we'll definitely get to it next week. 
Chris: QMUL is hosting a course on ARCHER on December 11th if anyone is interested. Will send the details to TB-SUPPORT. 
Meeting ends.
[10:51:27] RECORDING David joined
[10:52:29] Alessandra Forti joined
[10:56:52] Lukasz Kreczko joined
[10:57:28] Sam Skipsey joined
[10:58:18] Jeremy Coles joined
[10:58:24] Wahid Bhimji joined
[10:58:53] IPPP1 UofDurham joined
[10:58:56] Elena Korolkova joined
[10:59:19] Jeremy Coles Thanks for agreeing to take minutes David.
[10:59:53] Govind Songara joined
[11:00:53] Daniela Bauer joined
[11:01:18] Raja Nandakumar joined
[11:01:28] Jeremy Coles Will start in 1 minute
[11:01:59] Matt Doidge joined
[11:02:43] John Hill joined
[11:03:22] Wahid Bhimji no ggus ticket for ECDF
[11:03:25] Wahid Bhimji sorry 
[11:03:41] Christopher Walker joined
[11:03:55] Jeremy Coles It was EFDA
[11:04:12] Elena Korolkova In Sheffield we are looking into network problem which is not the fastest problem to resolve
[11:04:46] Elena Korolkova We excluded problem related to sl6 move
[11:05:17] Brian Davies joined
[11:05:17] Brian Davies left
[11:05:20] Daniela Bauer I don't know why
[11:05:26] Daniela Bauer there's nothing to report
[11:05:27] Brian Davies joined
[11:06:44] Sam Skipsey (As far as I'm concerned, I've made the various changes that seem to be required.)
[11:07:13] Daniela Bauer But the nagios tests are failing (though that might not be xrootd related)
[11:07:16] Sam Skipsey (Although I've prodded the CMS redir on the DPM a bit, so I'm not quite sure why it isn't working)
[11:07:45] Robert Frank joined
[11:07:47] Sam Skipsey (It's definitely xrootd related, it's just that I need to work out what the CMS file lookup isn't finding files where it expects them.)
[11:08:12] Daniela Bauer Just update the GGUS ticket with more questions... Even if you have access to teh documentation it's not helpful.
[11:10:13] Duncan Rand joined
[11:10:45] Gareth Smith joined
[11:15:09] Pete Gronbech joined
[11:15:09] Pete Gronbech left
[11:15:26] Andrew Washbrook joined
[11:17:19] Elena Korolkova A link for Multicore TF (Alessandra gave it last week)
[11:17:21] Elena Korolkova https://twiki.cern.ch/twiki/bin/view/LCG/DeployMultiCore
[11:18:54] Alessandra Forti yes, I'm surveying participation now. It will be officially approved this thursday
[11:20:23] Alessandra Forti I have already few names from the UK, let me know if you want to join
[11:21:18] Alessandra Forti https://e-groups.cern.ch/e-groups/Egroup.do?egroupId=10111467
[11:23:17] Sam Skipsey I just added myself (since we're going to be looking at this kind of thing in the very near future here - although Gareth Roy'd be the best person, he's still on holiday until Friday)
[11:23:44] Alessandra Forti I know I contacted him but got an automatic reply
[11:24:08] Alessandra Forti problem is we have to cover all the experiments and all the batch systems it's not going to be easy
[11:24:33] Sam Skipsey Aye, I'll follow along for the next while until he's back. We actually have working multicore (and MPI) support here now, so I think we're happy to test stuff.
[11:28:19] Pete Gronbech Hi Chris, backup VOMS servers now added to SouthGrid VO ID
[11:29:01] Wahid Bhimji when is that glexec test becoming "critical " ... 
[11:29:24] Wahid Bhimji ah ok 
[11:29:30] Wahid Bhimji perfect ! 
[11:31:31] David Crooks Just for clarity, I was talking about SHA-2 support, not glexec.
[11:31:47] Gareth Smith For the record the ROD durty was done by Kashif. I did very little. 
[11:32:43] Wahid Bhimji yes I did realise david, sorry for asking about an item that passed... it was just percolating to the front of my fingers... 
[11:33:06] David Crooks  
[11:34:32] Robert Frank left
[11:34:38] Robert Frank joined
[11:38:29] Daniela Bauer You can say that loudly !!!
[11:38:48] Daniela Bauer Though our ARCCE pass :-D
[11:39:08] Lukasz Kreczko we are trying out the differences, Andrew is helping me out
[11:39:14] Jeremy Coles Thanks
[11:40:42] Wahid Bhimji we don't care about the tarball glexec - please never develop it. 
[11:41:50] Steve Jones joined
[11:41:52] Jeremy Coles Others do care though... 
[11:42:05] Alessandra Forti sorry
[11:42:11] Wahid Bhimji do you know specifc others who care.. 
[11:42:47] Jeremy Coles Most of them wear security hats.....
[11:42:49] Elena Korolkova I'll try to finish glexec this week
[11:43:27] Alessandra Forti they are also drawing manpower from other fields
[11:43:59] Alessandra Forti one of the dashboards developers has been reassigned to get panda to work with glexec
[11:44:29] Wahid Bhimji yet more real physics down the glexec draain... sigh 
[11:45:27] Jeremy Coles Did you lobby Dave B ahead of the MB decision?
[11:45:35] Elena Korolkova https://ggus.eu/ws/ticket_info.php?ticket=98748
[11:45:48] Steve Jones I'll check at Liv
[11:45:55] Elena Korolkova which is on hold
[11:46:03] IPPP1 UofDurham 2.5.7
[11:46:33] Steve Jones torque-2.5.7-9.el6.x86_64
[11:46:36] Steve Jones at liv
[11:47:07] David Crooks We're looking at Condor
[11:47:31] Steve Jones We stick with torque because it is well integrated. If other things work as well or better, we'd switch.
[11:47:33] Jeremy Coles Sounds like a good hepsysman topic as Matt said.
[11:47:43] Sam Skipsey Maui is so unmaintained that a long standing MPI bug took *ages* to be fixed (as I just discovered recently)
[11:48:12] Steve Jones Maui has several problems, e.g. backfilling is useless.
[11:48:24] Lukasz Kreczko left
[11:48:34] Steve Jones It causes problems even if you don't want to use it
[11:48:38] Sam Skipsey Backfilling works perfectly well, if you can tell maui accurately how long your jobs will last.
[11:48:44] David Crooks For Glasgow Perfsonar: I will look at this Wednesday/Thursday
[11:48:57] Steve Jones No - it has very serious bugs in backfilling.
[11:49:12] Daniela Bauer UCL is a lost cause.
[11:49:15] Sam Skipsey The MPI bugs are things like "if you specify both nodes and processes-per-node, then maui will actually only schedule things on one node"
[11:49:57] Steve Jones The backfiling bugs are things like "if I try backfilling anything, and fail, I won't schedule anything at all"!
[11:50:38] Wahid Bhimji um... I don't think you can (for ATLAS) setup as nly using remote xrootd
[11:50:59] Sam Skipsey So, at least we're agreed that maui is janky and unreliable, Steve  
[11:51:05] Steve Jones Spot on!
[11:51:06] Duncan Rand I am here but no sound
[11:51:40] Jeremy Coles Ok thanks. The general point for LT2 is that there are now quite a lot of open tickets at UCL.
[11:51:48] Steve Jones I am the opposite, Duncan - sound, but no mic.
[11:53:02] Daniela Bauer just write yourself a shell script...
[11:53:10] Daniela Bauer That's what I do !!
[11:54:01] Christopher Walker Problem is working out exactly what the shell script needs to do...
[11:54:09] Christopher Walker But yes
[12:00:18] IPPP1 UofDurham we have the same WWT problem but there improved tips seem more helpful
[12:01:44] Alessandra Forti sure
[12:05:06] Alessandra Forti http://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome
[12:08:49] Christopher Walker QMUL is using OpenNMS - replacing zenoss - Dan has details of why he chose it. 
[12:08:56] Alessandra Forti one of the questions that occurred to me is that now that the experiments sam tests become critical we cannot maintain all the experiments in the BDII if they don't use the site
[12:09:10] Gareth Smith left
[12:09:29] Alessandra Forti for example I'd rather not have the availability conditioned by CMS in manchester
[12:09:52] Alessandra Forti http://wlcg-mon.cern.ch/dashboard/request.py/siteview#currentView=default&search_0=UKI-NORTHGRID-MAN-HEP
[12:10:32] Christopher Walker http://dashb-ai-548.cern.ch/dashboard/request.py/getWLCGNavigationLink?columnid=181
[12:10:34] Christopher Walker dashboard.common.InvalidRequestException: This request of type 'GET' is unknown to the service
[12:12:53] Alessandra Forti http://wlcg-mon.cern.ch/dashboard/request.py/sitehistory?site=UKI-NORTHGRID-MAN-HEP#currentView=default
[12:15:27] Wahid Bhimji left
[12:19:40] Andrew Washbrook left
[12:19:41] IPPP1 UofDurham left
[12:19:42] Raja Nandakumar left
[12:19:43] Elena Korolkova left
[12:19:44] Daniela Bauer left
[12:19:44] John Hill left
[12:19:46] Matt Doidge left
[12:19:48] Alessandra Forti left
[12:19:52] Christopher Walker left
[12:19:54] Robert Frank left
[12:19:58] Brian Davies left
[12:20:06] Duncan Rand left
[12:20:32] Sam Skipsey left
There are minutes attached to this event. Show them.
    • 1
      Experiment problems/issues
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 2
      Meetings & updates
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 3
      Monitoring discussion
      - Led by David.
    • 4
      - Deployment timetable
    • 5