Date/Time: Tuesday, April 24, 2012 - 11:00 (Europe/London)
Location: EVO - GridPP Operations team meeting
Attendance, see transcript
Tuesday, April 24, 2012
Experiment problems/issues (20')
Review of weekly issues by experiment/VO
Raja: Nothing much to report. Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
Stuart: Nothing much here either. Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
Alessandra: One of the problems has been discussed, UCL (see later). Second thing: ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.
Brian: Issue with how ATLAS recover files from a lost disk server at T2. There may be issue regarding how they get the files back to the T2 which needs to be looked at in more depth. Will discuss with ATLAS UK in more depth before coming back here.
- Wider VOs
-- Known issues
-- Site support of these VOs (the 10% target)
Chris: T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Jeremy: Is there an open ticket?
Chris: Not yet, well, WMS not working there is, proxy renewal not yet, need to check it's not a problem at there end.
Jeremy: GridPP vs CIC portal pages, Network monitoring.
Santanu: We are failing Nagios tests, but are confused as to how many VOMS certificates for Ops?
Chris: Answer should be on CIC portal.
Jeremy: see link in transcript.
Meetings & updates (20')
- ROD team update
Stuart: Hands up - was at GridPP saying this shouldn't happen, off on Friday. Mostly quiet, people at GridPP not fiddling. COD raised tickets. Kashif?
Kashif: Alarms were left on Friday, Bham ticket was 23 days old. Some were 72/73 hours old. Closed those tickets. Decided to stick to if on Friday evening to close alarm or raise ticket even if only 2-3 hours old.
-- Also note: UKI ROC is being decommissioned. dteam groups/roles should be under NGI_UK.
- Nagios status
- EGI ops
Meeting on 16th April. Update from Stuart to core ops. Some interesting things...
Stuart: Nothing to add beyond report.
Jeremy: Note new WMS release. Plan for EMI2 is release ~ 7 May. Call for testing in place. If you have an interest in other platforms make yourself known.
Stuart: There is /a/ WN tarball available, but some problems. Can't post ticket against EMI for this.
Jeremy: BDII, question of instability.
- GridPP middleware status [placeholder]
Daniela: This is a bit out of date, but also not much testing, will update this week.
- Tier-1 update
1. On Thursday 12th April we had a series of Atlas disk servers loose network connectivity. Although not confirmed we believe this problem is fixed by a newer kernel (and network driver) and this was rolled out to the affected Atlas disk servers (those with a particular 10Gbit network card) that afternoon and the following morning. We have just (yesterday evening) seen what looks like a similar thing on a LHCb disk server, and are planning a further rollout of the newer kernel.
2. We had a problem on one of the Atlas Castor headnodes caused by time drift. We have been checking for the ntp daemon running, but that was not sufficient. We have now rolled out a nagios test for time drift - which has picked up a number of systems that were out by some seconds.
3. We had a problem with xrootd access to the AtlasStripDeg service class - traced to a configuration problem.
4. We found an unnecessary restriction on our 4GB batch queue - a limit that we have raised.
5. We have added two new FTS front end systems on virtual machines. We backed out of this change at first as a number of problems were encountered. (One of these was that sites that had not updated their CA certificates since the new UK one was released were unable to submit FTS transfers). We have since re-applied the update (i.e. we do now have the two new FTS front ends in the alias).
Jeremy: what affect will this have on Tier-2 transfers?
Brian: There's a known issue with FTS for a couple of particular sites for CMS, intermittent and being looked into. Don't think is associated with front end changes and there is also a new spike of failures for ATLAS which will be investigated. Could be related to front end, don't think this is the case. We've seen issues this week, don't think this is due to the front end.
Jeremy: Performance increase?
Brian: Don't know, wasn't expecting any. Could be if more front ends decrease load but this was for resilience.
- Security update [to be placed as first item]
Mingchao leaving in a few weeks; discussions as to how to bring this into core tasks more fully.
Ewan: Security discussion.
- T2 issues
- General notes.
There was a GDB last week: https://indico.cern.ch/conferenceDisplay.py?confId=155067. A summary for the next ops meeting is being put together.
Jeremy: John still really running this one.
- Documentation review [placeholder this week]
Jeremy: Planning to every other week or so to review this. This will become a standing item, there are some assigned names to things marked in red.
Sage Matt says: " I'd appreciate it if everyone checked to see if their site has any crusty looking tickets that need a spring clean. I'll be chasing you from next week otherwise."
The new neuro science VO nearly has a name. Nearly. The devil's in the details (as always).
This got sent to Liverpool by accident. John rallied it to the right place, but it may have slipped under the radar. Ticket is from lhcb, sounds like cvmfs problems causing job failures.
Alessandra: Being looked into. Due to changes made last week.
Biomed complaining about negative space advertised by the CE.
Matt: DPM shouldn't be reporting -ve space.
A: There is free common space.
Matt: Maybe Storage Group can look at this.
This ticket can be put to bed, the user doesn't see the problems anymore. I'm not sure what Santanu did to fix things though.
Looks like this old ticket can be closed to (with the appropriate saga recorded in the solution).
Jeremy: How were these resolved?
Santanu: 80732: Couldn't replicate. Nothing particular fixed issue.
Has the heavy load on the WMS evened itself out?
If the WMS has started to behave, will you be able to look at enabling SNO+ soon?
Matt: WMS is poorly.
Stuart: Keep poking and finding things that aren't right. Need ~ 3 day clean health to pronounce fixed.
https://ggus.eu/ws/ticket_info.php?ticket=80527 - CE stability
https://ggus.eu/ws/ticket_info.php?ticket=81434 - CVMFS?
https://ggus.eu/ws/ticket_info.php?ticket=80527 <- repeat
Has a couple of tickets, likely to be caused (or not helped) by the extreme transition going on at Birmingham. It might help Mark to put these onto On Hold if they can't be solved.
Is there anything anybody can do to help get your SE back up? We stand ready to assist.
There could be useful information here (if your problem is similar to Lancaster's and other crashing sites):
Or it could be easier to upgrade (1.8.3 should be out soon, I'm not sure if the storage group have an stance on this).
Alessandra: Ben has been on holiday, is now on medical leave. Duncan has extended DT to end of month.
Jeremy: There is coming a change in policy, if a site is underperforming it will be suspended from queues.
Daniela: We will hit 30 day DT on UCL soon.
Jeremy: Following process, would be suspended.
Alessandra: Experiments and EGI have parallel suspension policy.
From the Solved Case pile:
The only one that jumps out at me is:
Another case where the renewal of a VO Admin's certificate under the new CA cert causes shenanigans (no other word for it). One to watch out for other UK people over the coming months as they renew their certs.
Network monitoring (GridMon & Perfsonar) (20')
- Discussion on situation
- Clarification of plans
Mark: After feedback on what's going on, the initial plan was to run GridMon internally and use PerfSonar for external. PS has superceded GridMon. Propose that we Sonar platform and use all sites with one GridMopn box for a Sonar Bandwidth box. Convert collaboration to PS. Caveat is that it is an unknown platform that we can always fall back to GridMon. Fairly agnostic about what we use then if we're comfortable with PS then let's go with that, lot's of work involved with kickstarting GridMon
Chris: Could use one box for bandwidth & latency, results might not be quite as good but would still give us numbers.
Alessandra: I think we agree.
Mark: We do have documentation on how to set this up; I'll reissue the docs again. We can still hand data over to Janet which was part of the plan; we also have a standardised platform. For the sites with one box, I would consider that we could do both on one box - not ideal, start with bandwidth first then extend to latency if all is well. If by Sept/Oct no issues with the platform, can step back from GridMon.
Jeremy: Action Timeline?
Mark: Deadline of June/July for PS: I think we should stick with this. Some installation still required - shall we say end of July?
Jeremy: put link into transcript of GDB talk on PerfSonar. Don't and won't have full mesh of sites. Duncan took responsilibty for sending requests to Dashboard.
Jeremy: Plans to evolve dashboard; guys looking for input before making progress.
Chris: Worth noting that there are 2 PS projects; US and Euro version. RAL, Oxford, QMUL using US version. Euro version has some monitoring which might be worth looking at in the future. GEANT are going Euro route. US version is de facto Grid version. Will look up the contact details of contact on this.
Brian: My transcript comments were about configuration
Jeremy: Ewan, was the config page related to the US version?
Ewan: Yes. Does anyone at the T1 know if that's what they have?
Chris: Ian said yes.
Ewan: So we're OK to install US version?
Ewan: Wahid being concerned about the provenance of the image, Brian on what communities we should use. We can tag your servers with as many arbitray community tags which then communicate up to central server. If you set up your server to be in the GridPP community you can set up tests to others in the UK. Might be worth sticking with GridPP unless there are particular tests which specific sites need to run. Are people going to have problems with "install this image?"
Jeremy: Did have some things like this with GridMon; Imperial
Wahid: Just speculating, could be anything we're installing. If it can be isolated might get round this - haven't spoken to systems team about this.
Ewan: Does need to be reasonably representative of your production environment. Doesn't need any access to other machines, freestanding box
Chris: Recommendation was that it was connected as close as possible to GridFTP servers for diagnostics.
Wahid: Is there remote access to this box?
Ewan/Mark: No. Mark: The advantage of this over GridMon is that the GridMon footprint is much larger than for PerfSonar. GridMon did also have degrees of remote access - something we want to get away from.
Jeremy: Would be useful if some other sites could work through the documentation.
To be completed:
Jeremy went through actions.
O-110516-01 Organise reviewers for web pages and wiki sections : ongoing
O-110524-07 Timeline required for relocatable GLEXEC tarball : being followed up
O-110816-02 Check out state of DPM-LFC checker and make available for testing. : still open
O-110830-04 Solve APEL/LSF parser mismatch, allowing Lanc. accounting to be published.
Matt: Had a go at latest version, still didn't work. Need to follow up with APEL support. Not sure if there is something different with Lancaster, parser doesn't parse. Have passed scripted accounting to Alessandra.
O-110906-04 Cream jobwrapper loses job exit status : Chris B: still open
O-110927-03 Develop the https://www.gridpp.ac.uk/wiki/Documentation page.: still progressing
O-120312-01 Report to Andrew the three Key Docs for your area. : need to revisit adding 3 pages for KeyDocs.
O-111122-01 Find out how to get EGI to distribute RPMs containing LSC files. : Looking for follow up from uk
O-120313-01 Investigate getting YAIM extracts for VO config via CIC portal. &
O-120320-05 discuss poss. automation of VOID transfer
: Steve and Chris - still open
O-120313-02 Email everyone on how to hack the publishing system to avoid publishing incorrect GlueSubClusterWNTmpDir. :Stuart. Closed?
Stuart: I think so. YAIM defaults to providing something even for sites which aren't defining something which means what is published might not be accurate. I need to double check this : still open.
O-120313-03 Investigate state of Cambridge SE now that it has been upgraded. : Went to Storage Group
O-120313-04 Put link on Wiki to EGI procedure for correcting availability/reliability stats. : Chris is following up on this.
Chris: I suspect that the Feb 29th was ignored in the stats. Expected to go over 90% but didn't, not of general interest.
O-120320-04 Pls add storage key docs ; should be closed and grouped with other KeyDoc action
O-120410-01 Check up on those UCL/VPN errors (for Atlas) : still open
O-120410-02 Research Backup VOMs server ideas: open. In Manchester a good resilience in place
O-120410-03 Look at how to improve and document Voms Admin policies and procedures. : ongoing.
- Topics for next week: Review of GridPP28 discussion outputs (actions); Review of April GDB updates
- Reminder: HEPSYSMAN 10th/11th May: http://hepwww.rl.ac.uk/sysman/May2012/main_as_before_meeting.html. The agenda focus: site reports; storage & MySQL.
Gareth: Registration largely closed in that we can't guarantee accommodation. Haven't sorted out putting it on EVO.
- Next core tasks review meeting Friday 27th (follow-up from GridPP28). There will be a security team meeting in early May (those concerned will have received a poll request from DK).
[10:59:29] Rob Harper joined
[10:59:50] Govind Songara joined
[11:00:19] Andrew McNab joined
[11:01:13] RECORDING David joined
[11:01:49] Brian Davies joined
[11:02:29] Santanu Das joined
[11:02:31] Jeremy Coles David is taking minutes.
[11:02:42] Queen Mary, U London London, U.K. joined
[11:02:48] Sam Skipsey joined
[11:02:53] Alessandra Forti joined
[11:03:17] Daniela Bauer joined
[11:03:25] Mark Slater joined
[11:03:55] Jeremy Coles We've started with Meetings & Updates
[11:04:00] raul lopes joined
[11:04:10] Ewan Mac Mahon joined
[11:05:13] Stephen Jones joined
[11:06:12] Raja Nandakumar joined
[11:09:31] Stuart Purdie Ah, as of yesterday, new ticket (_not_ in the walled garden, for WN/UI tarball: https://ggus.eu/ws/ticket_info.php?ticket=81496)
[11:15:11] Alessandra Forti did we skip exp updates?
[11:15:22] Jeremy Coles Yes - 11:03
[11:16:42] Gareth Smith joined
[11:19:33] Wahid Bhimji joined
[11:24:17] Stuart Wakefield joined
[11:26:44] Ewan Mac Mahon Is suspending it in EGI actually that big a deal if it's actually broken? Presumably it gets un-suspended when it comes back?
[11:29:17] Matthew Doidge I think there has to be some kind of recertification
[11:30:38] Jeremy Coles There is a procedure to go through before the site can be put back in production. As we just discussed though the actual impact of suspending a site needs further investigation since the expts. do not always use the information.
[11:31:32] Raja Nandakumar Sorry!
[11:36:58] Stuart Purdie https://wiki.egi.eu/wiki/OPS_vo
[11:38:43] Stuart Purdie And from the Ops-portal, the ops VO has 2 voms servers, voms.cern.ch and lcg-voms.cern.ch
[11:40:05] Ewan Mac Mahon I think everyone knows this, but I strongly support the new plan.
[11:41:29] Ewan Mac Mahon There's a wiki page at: https://www.gridpp.ac.uk/wiki/PerfSonarInstall that a) should be useful as-is, and b) can be built on.
[11:42:08] Wahid Bhimji is there any other option other than using this image
[11:42:15] Jeremy Coles From last week's GDB: https://indico.cern.ch/materialDisplay.py?contribId=6&materialId=slides&confId=155067
[11:42:32] Wahid Bhimji I can see ecdf systems team being nervous about the image
[11:43:08] Ewan Mac Mahon Really? Why?
[11:43:36] Queen Mary, U London London, U.K. Someone is typing int he background - can they mute please.
[11:43:45] Ewan Mac Mahon Also, there's not much option - it uses a custom kernel.
[11:43:55] Wahid Bhimji well how can they trust it security wise - could be anything
[11:44:03] Ewan Mac Mahon So could SL.
[11:44:12] Ewan Mac Mahon And especially EPEL.
[11:44:31] Wahid Bhimji hah well epel is only used for certain rpms
[11:44:32] Brian Davies is GridPP the expected community which we expect all uksites al least to associate themsleves to?
[11:45:06] Ewan Mac Mahon And who cares anyway? It's going to have priveliged access to anything at ECDF, so as far as the rest of the system is concerned, it's just another random box in an internet full of them.
[11:45:29] Ewan Mac Mahon Sorry - it's NOT going to have special access to anything.
[11:45:43] Brian Davies oh and LHC?
[11:46:00] Ewan Mac Mahon ^ Maybe not the LHC one in the first instance.
[11:46:37] Ewan Mac Mahon But that's just based on a generally conservative approach rather than a specific objection.
[11:47:03] Santanu Das is the OPS thing right here: https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#CERN_VOMS_server_VOs
[11:52:09] Ewan Mac Mahon Also, Jeremy - can you see the chat window now?
[11:54:21] Jeremy Coles Yes
[12:04:43] Matthew Doidge Friday would be difficult for me.
[12:04:57] Ewan Mac Mahon A Wednesday (after the storage meeting) might be easier.
[12:05:06] Matthew Doidge Agreed
[12:05:30] Stuart Wakefield left
[12:05:41] raul lopes left
[12:05:42] Brian Davies left
[12:05:42] Raja Nandakumar left
[12:05:43] Mark Slater left
[12:05:45] Mark Mitchell left
[12:05:45] Rob Harper left
[12:05:45] Andrew McNab left
[12:05:46] Matthew Doidge left
[12:05:46] Govind Songara left
[12:05:48] Gareth Smith left
[12:05:48] Ewan Mac Mahon left
[12:05:51] Gareth Roy left
[12:05:53] Sam Skipsey left
[12:05:54] Mohammad kashif left
[12:05:55] Wahid Bhimji left