Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 108203 with code: 4880. Apologies: Raul, Tier-1 team (Gareth, Catalin, Brian), Mingchao
2011-08-30 GridPP Ops Team

Minutes by Steve Jones

Present: Stuart Purdie, Chris Walker, Jeremy Coles, Matthew Doidge, Ewan MacMahon, Andrew McNab, David Crooks, Alessandra Forti, Rob Fay, John Bland,  Mark Mitchell, Mark Slater, Sam Skipsey, Duncan Rand, Elana Korolkova, Daniela Bauer, Andrew Washbrook, Steve Jones.

11:00           
Meetings & updates (20')                

- ROD team update

Rota: https://www.gridpp.ac.uk/wiki/ROD_rota . It needs to be updated - will/could be sent directly (action JC).

Glasgow switched to NGI UK section last week. On duty team registered as supporters. Went well. David Crooks (Glasgow) noticed from Steve Lloyd test pages that SAM tests not arriving, maybe due to NGI switch?

Discussion on state of test pages. CW opines that status is unclear. Result of discussion is that status needs to be summarised and reported back (action JC). One known problem is that GOCDB contains no sub-ranges - no T2 pages before site detail is presented. This can be fixed, according to John Gordon.

- Nagios status 
Nothing to report.

- Tier-1 update
No-one available to comment - RAL closed for "discretionary day".

- Security update
No-one available to comment - RAL closed for "discretionary day". However, discussion took place of whether network topology details should be published freely online in a Twiki, i.e. http://www.gridpp.ac.uk/wiki/Site_networking

We don't want to make it too easy for wrong-doers. Final requirements from that discussion were:

A) The pages are required for several purposes (Robin Tasker, Janet, PMB etc.)
B) We don't want it open to the world - just those identified by (say) grid certificates.
C) Must be editable by admins, who will revise and maintain published info on subnet topology and monitoring software.

Questions remaining are the technical means (pages may make use of key/prefix such as "Protected"); material may be split out/reclassified for diverse purposes (TBD). Should policy be extend to HEPSYSMAN? Data may be duplicated in GOCDB (EM). Actions (AM, JC)

-- T2 issues
No issues reported.

-- General notes.
GDB and GridPP27 meetings coming up in September.

- Tickets

73878 - Manchester. LHCb. Dirac SAM jobs failed at worker node.
Comment (AM): Node problem. Fix in progress.

73773(T)- Durham. ATLAS. Jobs failing with "Expected output file does not exist". 
Comment (EK): Still awaiting a full explanation. Actions (EK) to contact shift persons.

73644(T) - ATLAS. RAL T1. DATADISK has destination error. Solved (time-out) then reopened. 
Comment: In progress.

72160: Oxford. T2K space-token. 
Comment (EM): Will close, despite niggles. Action EM.

73280(T) - Brunel. Biomed. SE dc2-grid-65.brunel.ac.uk has errors. Old lsc file. Solved then reopened.WMS? 
Comment: Still no definitive resolution.

 11:20          
Experiment problems/issues (20') 

- LHCb

Chris Walker is to ticket LHCB because some of the SAMS monitoring links are broken (Action CW). The links are reached from this page: http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Lhcb

- ATLAS

A discussion took place on Atlas disk space usage. An email describing the current thinking has been sent by Alistair Dewhurst. German and French sites have a higher allocation than UK sites. UK has the lowest. There are many factors involved with the decisions and a review will be delivered soon (EK).

EK described some outstanding Atlas tickets, including transfer problems between Manc/Lanc/Cern. As German and Italian sites have seen similar faults, it is considered to be a Cern problem.

There has been a full disk issue at Edinburgh, somehow related to software installation - test jobs have been requested.

Birmingham sporadically falling below Hammercloud threshold. Action (MS) to look into this. It may be a time-out condition.

Queen Mary (CW) had some trouble with storage that was fixed by a new release of Storm software.

- CMS
Nothing reported.

- Site performance/accounting issues

MD: Lancashire accounting (APEL) is broken. Matt suspects this is due to tiny format differences in the log files. APEL is inconsistent which his version of APEL. CERN uses different release. Plug ins offer poor support. Development is slow. Good assistance but so far, no joy. AF: use of LSF has risks - only CERN and INFS use customised releases.

A discussion took place on SGE (Sun Grid Engine). Several sites use it. Maintenance now done by UNIVER, who will attempt to remove Oracle dependency.

- Metrics review

 11:50         
Site news/issues/updates (05')         

Some VOs (biomed) have sporadic issues with CREAM, maybe due to proxy time-outs. Developers are aware.

 11:55         
Actions (05')         

- See http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

 12:00         
AOB (01')         

Mark Mitchell informs that transition to IPv6 made easier via STACK software, see: 
http://scotgrid.blogspot.com/2011/08/two-stacks-are-better-than-one.html 

Chat Window...

[10:52:09] Alessandra Forti joined 
... 
[11:02:40] Jeremy Coles Stephen is taking minutes 
[11:04:24] David Crooks http://pprc.qmul.ac.uk/~lloyd/gridpp/atlas_samtest.html 
[11:06:48] Andrew Washbrook left 
[11:17:40] Queen Mary, U London London, U.K. http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages 
[11:20:14] Ewan Mac Mahon Is someone filming a new Quatermass behind Elena? There are some very odd noises off. 
[11:21:10] RECORDING Stephen joined 
[11:22:28] Jeremy Coles http://panda.cern.ch:25980/server/pandamon/query?mode=pd2p&type=T2&period=30 
[11:24:23] Matthew Doidge thanks for clearing that up Elena, I'll stop scratching my head! 
[11:27:16] Andrew Washbrook ECDF: Dell are going to be on-site today to investigate our broken disk server 
[11:27:20] Alessandra Forti I hear hiccups 
[11:27:24] Elena Korolkova I'm sorry about the noise. It's construction work. 
[11:27:30] Ewan Mac Mahon Just on the data distribution; I don't want to disappear too much of this meeting into this discussion again, but I do think we've got more problems than just the total volume of data in the UK. 
[11:27:49] Jeremy Coles Yes. I am trying to have it discussed at GridPP27 
[11:27:52] Ewan Mac Mahon There are fundamental problems with some of the principles involved in the current setup. 
[11:28:04] Elena Korolkova Apparently, people think that nobody works in summer and we have such a noise every summer 
[11:28:06] Ewan Mac Mahon ^ Yes; looking forward to that 
[11:28:16] Ewan Mac Mahon (The meeting, not the building work) 
[11:28:29] Elena Korolkova which new release? 
[11:28:47] Elena Korolkova which was suppose to be in the end of August 
[11:29:19] Elena Korolkova I meant storm release? 
[11:31:52] Queen Mary, U London London, U.K. storm-frontend-server-1.7.1-5.sl5.dbg is the frontend release we are using. It's not been released to staged rollout yet AIUI. 
[11:33:52] Ewan Mac Mahon I thought there was a recent non-open release from Oracle, and the open source ones were essentially forks from the previous release(s). 
[11:34:03] Queen Mary, U London London, U.K. possibly. 
[11:35:03] Ewan Mac Mahon Though I think it's safe to say that we'd want to be using an open version of one sort or another. That's probably something worth nudging the EMI (?) folks that do the batch system integration about and seeding which grid engine they're going to support. 
[11:36:50] Stuart Purdie EMI _don't_ do batch system integration. It's the black hole in the whole middleware stack. 
[11:38:15] Ewan Mac Mahon Ah, couldn't quite remember where it lived at the moment, hence the question mark. No-where, by the sound of it. 
[11:38:22] Alessandra Forti @Stuart: and it is not the only problem. 
[11:39:00] Queen Mary, U London London, U.K. Indeed. Giving some money to univa to develop gridengine might be a good thing. 
[11:44:17] Alessandra Forti protected 
[11:45:35] Ewan Mac Mahon Sounds fine. We just need to know how to define the wildcard of who gets to see it, i.e. all certs, all UK certs, all dteam, all UK dteam. It rather depends what's easy/feasible to implement. 
[11:49:20] Queen Mary, U London London, U.K. Whoever can edit the wiki at the moment? 
[11:52:38] Ewan Mac Mahon We should probably take this to tb-support, but there are some scripts on the regional nagios that get lists of DNs that are allowed to access it. We might be able to reuse some of those. 
[11:57:25] Ewan Mac Mahon Sussex is/will be SGE 
[12:01:03] Jeremy Coles http://scotgrid.blogspot.com/2011/08/two-stacks-are-better-than-one.html 
[12:02:07] Alessandra Forti I can acceess it 
[12:02:28] Alessandra Forti http://planet.gridpp.ac.uk/ 
[12:03:42] Andrew McNab left
....
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Meetings & updates 20m
      - ROD team update Reminder to the ROD team to respond to the rota dates. Now urgent! Glasgow has run under NGI_UK for week - issues? - Nagios status - Tier-1 update - Security update -- T2 issues -- General notes. Direct link: http://tinyurl.com/3jjnvca if not working Indirect link: https://ggus.eu/ws/ticket_search.php (select support unit 'ROC_UK/Ireland' and Creation Date 'Any') or paste https://ggus.eu/ws/ticket_info.php?ticket= and type a ticket number for the URL. Below (T) means Team ticket. 73878 - Manchester. LHCb. Dirac SAM jobs failed at worker node 73773(T)- Durham. ATLAS. Jobs failing with "Expected output file does not exist". Coincident with powercut but issue continues. 73644(T) - ATLAS. RAL T1. DATADISK has destination error. Solved (timeout) then reopened. 73280(T) - Brunel. Biomed. SE dc2-grid-65.brunel.ac.uk has errors. Old lsc file. Solved then reopened.WMS? 73280 - Brunel. Biomed - reopened SE issue (from Nagios tests). 72359 RAL myproxy for T2K. Cross-ref 72358. Close now? 72161: IC-HEP. T2K. 3TB spacetoken created. - marked solved today. 72160: Oxford. T2K spacetoken. Ewan is this in progress or solved!? 68865: UCL-HEP. Retirement of SL4 and 32bit DPM Head nodes and Servers. On hold 68859: Durham. Retirement of SL4 and 32bit DPM Head nodes and Servers. On hold 68858: Glasgow. Retirement of SL4 and 32bit DPM Head nodes and Servers. On hold 68853: RAL T1. Master ticket. Brian reviewing recommended versions. 68077: RAL T1: Mandatory WLCG InstalledOnlineCapacity not published. Expect test version this month. 64995: RAL T1: No GlueSACapability defined for WLCG Storage Areas. should have something you can test this month (August.) 57746: Cambridge. Karl has tested again (in August) and still sees problems!
    • 11:20 11:40
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues - Metrics review
      ATLAS-report
      ATLAS-report-doc
    • 11:40 11:50
      Subnets 10m
      - Check through details here: http://www.gridpp.ac.uk/wiki/Site_networking
    • 11:50 11:55
      Site news/issues/updates 5m
      News from Brunel: - small issues biomed jobs in our EMI CreamCE. I believe another error with proxy delegation. I've emailed their developers. No reply. - plan to move Atlas to CVMFS in September, if I feel that it is stable
    • 11:55 12:00
      Actions 5m
      - Last 4: http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
    • 12:00 12:01
      AOB 1m