Operations team & Sites

Name: Operations team & Sites
Start: 2011-07-26T11:00:00+01:00
End: 2011-07-26T12:16:00+01:00
Location: EVO - GridPP Operations team meeting

Tuesday 26 Jul 2011, 11:00 → 12:16 Europe/London

EVO - GridPP Operations team meeting

Description

- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 78425 with code: 4880. Apologies: Pete, Matt, Kashif

Hide

Tuesday 26 July 2011
11:00
Meetings & updates (20') MB Tech Groupspdf file

- ROD team update

Only issue from John Walsh last week was QMUL, mostly SE issues.

- Nagios status
SL tests now have a Nagios version. http://pprc.qmul.ac.uk/~lloyd/gridpp/nagios.html. No experiment SAM equivalent but do we need it?

Steve Lloyd has revamped his page to use Nagios tests, not CERN SAM pages. Only for Ops VO.

Jeremy: Do we need experiment Nagios tests on these pages? Do people find them useful? If not, will expire once the SAM tests close at the end of August.

Chris: I hadn't realised that there were expt Nagios pages. SAM pages have been useful in debugging site issues.

Wahid: They are useful.

Jeremy: We can give feedback to say that we'd like dashboard pages like Steve's.

- Tier-1 update

Gareth: 3 things to report. On Friday, ATLAS LFC failure caused test failures due to too may directory in a directory. Offlined by ATLAS as a result, but resolved now.

Over the weekend there were two disk servers which failed and were taken out of production, one LHCb and one CMS. Those are being worked on.

Finally, failures on some FTS channels, trying to track down the cause for this. Andrew Lahiff /Brian have done a lot of work tracking down the issue. We have a specific date for the start of the problem, July 14th. Don't yet have the cause/fix for the problem.

Brian: Simon at IC was worried that this was caused by his upgrade to new dCache (1.9.12). Don't think it's this. It is affecting Imperial, but not due to upgrade.

- Security update

Mingchao: Since last week, not much update from incident sites. Try to get more information, ask for final report. One problem is that one site is missing logs due to misconfiguration. Makes incident investigation very difficult.

-- T2 issues
Related to T2K requests. PMB response is that on the technical side the storage group advice should be followed. GridPP commitment to "other" VOs is 10%. If requests can not be managed then comment to this effect in response to tickets and make Jeremy aware of the status.

Jeremy: People should follow strategy from Storage Group. Our commitment to other VOs, 10% on storage. Action on Glen/Jeremy to make sure that this is documented somewhere.

Request is to respond to the tickets if you can't do it. Dave talked about the issues at Imperial, the response should be to respond to the ticket saying that it isn't possible.

Danela: Spoken to Dave, he had misundertood the issue. I can shuffle space, but needed final word on how much space - we've agreed a number and will borrow space.

Jeremy: from other sites, Oxford had asked how much space was needed.

-- General notes.
There is no August GDB. There are MBs with the next one being on 9th August where the mandate of the "technical working group"s will be discussed (see attached file).

Jeremy: Idea is to have a noumber of groups, breaking into Data Management, storage Management, Workload Management, Databases, Security, Operations. Ian was asking for feedback on the proposed groups and suggestions for chairs for the different group and the constituent members.

Mingchao interested in the security model working group
Chris noted that one thing missing from security model from UK is Moonshot. (see chat transcript for link) -
Mingchao: bring single sign on for application services as well as network services.

Brian will make sure that relevant sections get bropught up at Storage Group
Operations: something that Jeremy is interested in.

- Ticket status: http://tinyurl.com/3uo5get

TEAM

https://ggus.org/ws/ticket_info.php?ticket=

72768 - Glasgow CVMFS -> NFS for LHCb

Sam: CVMFS broke catastrophically for Glasgow in a weird and unrepeatable way. Errors in files that were cached on WNs in a semi random way. Meant that some jobs were fine, other jobs failed unpredicatably based on which files they were accessing. Turned off CVMFS. Since then quite a lot of support from CVMFS devs. We have a prerelease release which may fix the issue. We have made changes to squid.conf to make it more resilient. Will test CVMFS prob today - however, of note the problem didn't show until we had rolled out to full cluster. Will be very carefully rolling this out across the cluster.

Mark: Also going to look at QOS settings around the squid. Not sure if that will improve things but run as separate projects.

Alessandra: Since CVMFS is only Glasgow problem, seems local issue.

Sam: Probably related to configuration at Glasgow. However, I disagree with the exact cause of the issue. The patched version should fix the underlying problem. I do agree that people should not shy away from installing CVMFS, but should do so carefully.

72075 - ATLAS transfer issue with SE. On hold

Chris: Storm has been crashing for 2 weeks, got bad enough that I was checking it every minute. Have upgraded to Storm to 1.7 which didn't go smoothly. Upgraded on Friday, crashed 3 more times. Devs sent a version which they said would fix it. Better but still not perfect. Backend storage still heavily loaded at times. Put a cap on concurrent number of analysis jobs.

External network team have upgraded a router which has unfortunately made things worse. Have been problems across QMUL. Now hitting limits of external link.

Elena (transcript): Should be closed from Atlas perspective.

Jeremy: Can someone look at closing this?

Chris: I'll do this.

71640 - Cambridge - biomed copying issue. VO no longer supported [What is the process to clear a VO]

Santanu: They have not confirmed the status of the ticket, not answered fror a long time. Not sure of the steps needed to stop supporting a VO.

Jeremy: Should be documented, not sure that it is. Is anything documented in the Storage Group?

Brian: Not that I'm aware of. We do need to come up with a recommendation. We do have recommendations for when you close an SE.

Jeremy: Probably similar. Are the people you contact then the same as for a VO?

REGULAR

72728 - ATLAS UCL s/w area full. Move to CVMFS or delete older (but still production) releases.

Duncan : Have 2 clusters with 2 different software areas. Discussed yesterday rationalising that, thus doubling space available to one software. CVMFS answer in the long term.

Jeremy: What is the recommended space for software areas?

Duncan: Used to be 250(GB) but might have gone up.

Jeremy: Problem due to larger number of releases.

Duncan: Might be different for production vs analysis.

Alessandra: Should remove the older releases. Jobs targeted at releases that have deleted should fail for a short time until the site BDII updates with the release information. Panda does not currently update directly but goes through BDII which has a delay. The ATLAS requirement for space is 250 GB. Good idea to unify the two NFS areas.

72359 - T2K proxy delegation failed IC. Reassigned RAL. Ststus?

Jeremy: will follow up offline.

T2KORGDISK
72161 - IC (comments from user)
72160 - OX How much space?
72156 - QMUL No site response

72031 - Brunel. EMI CREAM CE trustmanager issue. hone jobs abort. Put ticket on hold?

Raul :Try to upgrade the CE and WN and see what happens

Heremy: It's not actually a production release?

Raul :EMI CREAM CE has been patched so far: mostly working OK.

71903 - sno+ on RAL LFC. Working so closed ticket.

71294 - pheno issue was Glasgow. Now assigned to SARA (in progress).

Jremey: They're looking at it, in progress.

Retirement of SL4 32-bit headnodes
68865 - UCL
68859 - Durham
68858 - Glasgow
68863 - RAL

Brian: Case of following up in Storage Group. Debate on how to handle multiple upgrades with EMI and UMD. Urgency is in debate.

68077 - RAL Info publishing. Needs an update from Jens.
64995 - RAL No GlueSACapability defined for WLCG Storage Areas. on hold

Jeremy: Gareth, could you ask Jens for an update on these?

Gareth: Will do.

57746 - Cambridge logging info issue. Needs outside support. Following up.

Jeremy: Followed up with Martin this morning

11:20
Experiment problems/issues (20')

Review of weekly issues by experiment/VO

- LHCb

Raja: Nothing significant to report. Disk server failed yesterday, I believe it is back at RAL.
Jeremy: Do you know what proportion of sites have moved to CVMFS?
Raja: Not sure, not all T1s have moved to CVMFS
Chris: Believe I have enabled for LHCb.
Raja: Nothing for LHCb to do, it's automatic. Just need to set the environment variable correctly. Should point to the same head location.

[Environment variables: see transcript]

- CMS

Stuart: Things going OK, less busy than last week. Imperial having a few issues with CMS tests, and having some issues with CREAM - flag up more test errors than previously. A mix of errors- seem generally less reliable. CMS has flagged Imperial as having problem, did have dCache issues 2 weeks ago. Seem to have got somewhere with that. Seems to be a problem with what CMS thinks is here vs what is here which is causing an error flag. Hopefully CMS will update their records which should fix this. (Dave Colling notes in transcript that Bristol/PPD having minor issues)

- ATLAS

Alessandra refers to uploaded report: QMUL have storage Problems for the week. Chris has upgraded on Saturday. Tested over the weekend, put back online. UCL has software area issue. Manchester has had a "black" weekend, servers went down one after another. Put production back online, most of infrastructure on CVMFS, analysis offline.

Glasgow had problem with HC tests due to CVMFS, under investigation.

Problems generated by ATLAS - LFC at RAL reached limit for these tests. To solve problem, turn off for entire cloud and each site manually whitelisted. DDM started clean up and things are back to normal. ATLAS devising procedure for this problem.

Jeremy: How long were we out for?

Alessandra: Just a few hours, Alistair quick to note issue

Manchester 3500 job failures, new mode of failure because we're multicloud. Transfers timeout because CERN thought that jobs were running locally at CERN. This plus Castor sickness over weekend caused failures. Parameters have been changed in Panda. Glasgow seems to have similar problem with French cloud to a lesser extent. Settings for UK, CERN and French clouds now all the same. Extended from 1 to 2 days.

Sheffield and RAL are suffering from one user's jobs which ran for 4 hours then failed. Worry from Sheffield about wasted CPU.

Jeremy: Primarily a VO issue?

Alessandra: Concerns over accounting.

Lancs, ECDF and Ox are now T2D. Liverpool should be added as a candidate soon.

Jeremy: Request has been made that all T2Ds become multi-cloud.
Alessandra: This will proceed slowly - multicloud site has to be very stable.

- Other

- Experiment blacklisted sites

- Experiment known events affecting job slot requirements

- Site performance/accounting issues

Jeremy: Come to Lancs in a moment.

- Metrics review

11:40
Site roundtable (10')

Jeremy: John Gordon has asked about whether sites have installed EMI Argus.

[see transcript]

Brunel: Running both EMI 1 and glite 3.2

Jeremy: Did you see any advantage in either?

Brunel: Both stable, problems with proxy delegation with EMI 1

Jeremy: Please send me any tickets you generate with respect to EMI 1.

Chris: These issues, are they specific to EMI?

Raul: Not sure, only seeing it on EMI 1

Jeremy: I'll feed this back to John.

Lancaster: "...confirmed last week that we haven't been publishing accounting data for one of our CEs. The version of apel we had couldn't parse the lsf log files. We've upgraded apel on our CE to a version that speaks lsf fluently, but pushing out the vast backlog of data is proofing troublesome. We're looking at that, as well as trying to hack together a tool that will give us an idea of the CPU hours we're being unaccounted for." Plan to install glexec in August.

Chris: Relocatable install: problems with SE have meant I havent done much more than have a brief go at compiling it. Fails to compile to lack of lcmaps library.

Jeremy: Ewan, can you comment? You mentioned that UK consensus hadn't been given to developers.

Ewan: Fair summary, we're going to intall from source. We'd like a binary tarball like old glite-WN tarball but for glexec. Could work round this in UK, but best done upstream.

Jeremy: Comments probably best coming from sites.

Chris: If running configure/make/make install works then I don't have a problem with that.

Jeremy: Odd that we have to configure/make software for ourselves. Guessing that other sites (ECDF) are waiting to see what happens?

Andy: That's correct.

Elena: Considering installing CVMFS in Sheffield. Issue on network, around becoming T2D for ATLAS
Daniela: Trouble with CREAMCE using SGE. Talking to developers but to no avail. I have the code, will go through and see if I can fix myself. Will install EMI WMS.
Govind: [see transcript]
Brunel: Nothing to add
David: Nothing to add for Glasgow
Mark Slater: Glexec ARGUS passing tests now. Not much else to report
Sam: Nothing to add
Rob Harper: Couple of dCache issues, got that sorted. Had LDAP issue, LDAP server ran out of file handles. had nscd issues on WNs, have max file handles set higher. Are about to migrate on LCG-CE to CREAM over the next week.
Chris: CREAM CE seems to be failing from time to time, definitely an issue. The SE network issues I've already mentioned - some downtime this week. Need to fix accounting
Duncan: Nothing to add
Steven Jones: Had to use disaster recovery plan - installed new CREAM CE, new APEL install looks in different place for files.
Andy: Nothing too much to report from ECDF. Need to replace a RAID card in a pool server. Done by engineer who didn't get it signed off, Wahid is looked at this. Issues with CREAM CEs with SGE. Have advance copy of one of the main processes causing issues, evaluating now.
Santanu: Try glexec ARGUS this week/next week. Try EMI version after that. Upgrading DPM
Robert Harrington: Nothing to add for ECDF
Stuart: Nothing huge - WMS is about staged rollout real soon. UMD stuff is there. General idea from EMI is that people should be using UMD. Maybe not quite yet for critical services.
Mark: Finally received GridMon configuration - having a look at this with Andy Pickford this week.
Jeremy: Need to be careful about how much effort to put into GridMon
Ewan: Software: not changing very much, so quite stable. Having difficulties on hardware side. New switches stop switching traffic. Taken us offline for chunks of time. Currently cabled with gigabit links everywhere. Only problems have been with 10G stuff. Network should be stable now. Will put 10G kit back in once problems resolved.

11:50
Status of glexec & ARGUS deployment (05')

See transcript

11:55
Actions (05')

- http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items

12:00
AOB (01')
http://indico.cern.ch/event/147927

[11:00:17] Mark Slater joined
[11:01:10] Raja Nandakumar joined
[11:01:11] Sam Skipsey joined
[11:01:32] Rob Harper joined
[11:02:08] Queen Mary, U London London, U.K. joined
[11:02:25] Duncan Rand joined
[11:02:39] Stephen Jones joined
[11:02:39] Jeremy Coles David is taking minutes today
[11:02:51] Gareth Smith joined
[11:02:51] RECORDING David joined
[11:03:26] Alessandra Forti joined
[11:03:59] Andrew Washbrook joined
[11:04:58] Wahid Bhimji Kashif replied
[11:04:59] Wahid Bhimji https://sam-atlas.cern.ch/nagios/

[11:06:05] Santanu Das joined
[11:06:10] Robert Harrington joined
[11:07:54] Phone Bridge joined
[11:08:21] Elena Korolkova UK wasn't offline for ATLAS. HC exclusion was off
[11:09:16] Stuart Wakefield joined
[11:09:39] Stuart Purdie joined
[11:10:37] Phone Bridge left
[11:10:40] Andrew McNab joined
[11:12:42] Mingchao Ma joined
[11:13:03] Mark Mitchell joined
[11:13:22] David Colling joined
[11:15:56] Stuart Purdie http://www.project-moonshot.org/
[11:23:18] Wahid Bhimji left
[11:23:31] Alessandra Forti left
[11:24:01] Elena Korolkova 72075 for QM - I think it should be close from ATLAS point of view
[11:24:29] Alessandra Forti joined
[11:26:00] Elena Korolkova 72728 is assign to ATLAS, not to UCL
[11:26:50] Mingchao Ma got to go, need to join the OMB meeting again
[11:27:04] Jeremy Coles Thanks
[11:27:05] Mingchao Ma left
[11:27:12] Andrew McNab I'm going to lose the connection at 11:30 due to planned emergency power test here. If I'm not back by the glexec section, then we have it installed and working with ARGUS on all the WNs of one cluster here and are about to request the Nagios testing of it
[11:27:43] Jeremy Coles Thanks
[11:27:49] Andrew McNab left
[11:30:50] Elena Korolkova if they open the ticket and do not answer to it I think you can close the ticket.
[11:32:30] Alessandra Forti 250gb
[11:32:43] Alessandra Forti yes
[11:32:43] Elena Korolkova 400 GB, I think
[11:39:27] Ewan Mac Mahon joined
[11:42:02] Alessandra Forti lhcb is automatic
[11:42:49] Alessandra Forti /cvmfs/lhcb.cern.ch
[11:44:49] David Colling so there are problems at the other T2 sites
[11:45:09] David Colling so has (PPD and Bristol)
[11:45:15] David Colling so has PPD (and Bristol)
[11:45:32] David Colling both minbor
[11:45:34] David Colling sorry
[11:45:38] David Colling need a new mike
[11:45:47] Alessandra Forti Chris, you need to add /cvmfs/lhcb.cern.ch to the cvmfs configuration too obviously
[11:52:46] Elena Korolkova it's analysis jobs
[11:58:01] Elena Korolkova we run glite3.2 Argus
[11:58:17] Mark Slater I think Bham are as well
[11:58:23] Mark Slater just making sure
[11:58:28] Queen Mary, U London London, U.K. Not running either - but clearly need to install one of them
[11:58:30] Govind Songara rhul also glite3.2
[11:58:45] Mark Slater BHAM confirmed - glite 3.2 ARGUS
[11:58:47] Stephen Jones glite ARGUS 3.2.4-2, no EMI
[11:58:49] Ewan Mac Mahon Oxford is running glite-32 argus.
[11:59:15] Stuart Purdie Gasgow is with gLite 3.2 SCAS, planning to move to EMI-argus at some point. I don't know if SCAS is implicated in this case, however
[11:59:33] Brian Davies left
[11:59:46] Rob Harper RALPP is on glite 3.2 ARGUS
[12:02:44] Gareth Smith left
[12:10:03] Stuart Purdie Technically, EMI-WMS is in verification / staged rollout; and not a production release as yet ... (as I'm sure Daniela is aware, but for the record....)
[12:12:48] Govind Songara RHUL having problem with DPM 1.8.x segfault for last one month (7-8 time at least), in touch with developer, but still no breakthrough,
[12:15:31] Rob Harper Have to go, sorry.
[12:15:34] Rob Harper left
[12:17:08] Alessandra Forti Manchester biggest problem right now is the storage and also top-bdii has periodic freshness problems (I have to go)
[12:17:23] Daniela Bauer To Stuart: I am on the early adopters list, hence my low expectations...
[12:17:31] Alessandra Forti left
[12:20:46] Jeremy Coles reminder: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting for those who might be interested in hearing more from the Tier-1.
[12:22:12] Robert Harrington left
[12:22:13] Mark Slater left
[12:22:17] Andrew Washbrook left
[12:22:17] Stephen Jones left
[12:22:18] Raja Nandakumar left
[12:22:18] Mark Mitchell left
[12:22:19] Duncan Rand left
[12:22:19] Ewan Mac Mahon left
[12:22:19] Elena Korolkova left
[12:22:25] Stuart Wakefield left
[12:22:50] Daniela Bauer left
[12:22:50] raul lopes left
[12:23:01] Govind Songara left
[12:26:20] Queen Mary, U London London, U.K. left

There are minutes attached to this event. Show them.

- 11:00 → 11:20
  
  Meetings & updates 20m
  
  - ROD team update - Nagios status SL tests now have a Nagios version. http://pprc.qmul.ac.uk/~lloyd/gridpp/nagios.html. No experiment SAM equivalent but do we need it? - Tier-1 update - Security update -- T2 issues Related to T2K requests. PMB response is that on the technical side the storage group advice should be followed. GridPP commitment to "other" VOs is 10%. If requests can not be managed then comment to this effect in response to tickets and make Jeremy aware of the status. -- General notes. There is no August GDB. There are MBs with the next one being on 9th August where the mandate of the "technical working group"s will be discussed (see attached file). - Ticket status: http://tinyurl.com/3uo5get TEAM https://ggus.org/ws/ticket_info.php?ticket= 72768 - Glasgow CVMFS -> NFS for LHCb 72075 - QMUL ATLAS transfer issue with SE. On hold 71640 - Cambridge - biomed copying issue. VO no longer supported [What is the process to clear a VO] REGULAR 72728 - ATLAS UCL s/w area full. Move to CVMFS or delete older (but still production) releases. 72359 - T2K proxy delegation failed IC. Reassigned RAL. Status? T2KORGDISK 72161 - IC (comments from user) 72160 - OX How much space? 72156 - QMUL No site response 72031 - Brunel. EMI CREAM CE trustmanager issue. hone jobs abort. Put ticket on hold? 71903 - sno+ on RAL LFC. Working so closed ticket. 71294 - pheno issue was Glasgow. Now assigned to SARA (in progress). Retirement of SL4 32-bit headnodes 68865 - UCL 68859 - Durham 68858 - Glasgow 68863 - RAL 68077 - RAL Info publishing. Needs an update from Jens. 64995 - RAL No GlueSACapability defined for WLCG Storage Areas. on hold 57746 - Cambridge logging info issue. Needs outside support. Following up.
  
  MB Tech Groups
- 11:20 → 11:40
  
  Experiment problems/issues 20m
  
  Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues - Metrics review
  
  Atlas Report
- 11:40 → 11:50
  
  Site roundtable 10m
  
  Q from John: "EMI have asked to end standard support for glite 3.2 Argus immediately. I argued to keep to the agreed end of glite 3.2 in October but perhaps to bring forward the end of security support from April 2012. Has anyone tried EMI-1 Argus yet? How many UK sites run glite 3.2 Argus?" Lancaster: "...confirmed last week that we haven't been publishing accounting data for one of our CEs. The version of apel we had couldn't parse the lsf log files. We've upgraded apel on our CE to a version that speaks lsf fluently, but pushing out the vast backlog of data is proofing troublesome. We're looking at that, as well as trying to hack together a tool that will give us an idea of the CPU hours we're being unaccounted for." Plan to install glexec in August.
- 11:50 → 11:55
  
  Status of glexec & ARGUS deployment 5m
- 11:55 → 12:00
  
  Actions 5m
  
  - http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
- 12:00 → 12:01
  
  AOB 1m

Choose timezone

Operations team & Sites

EVO - GridPP Operations team meeting