2011-08-30 GridPP Ops Team

Minutes by Steve Jones

Present: Stuart Purdie, Chris Walker, Jeremy Coles, Matthew Doidge, Ewan MacMahon, Andrew McNab, David Crooks, Alessandra Forti, Rob Fay, John Bland,  Mark Mitchell, Mark Slater, Sam Skipsey, Duncan Rand, Elana Korolkova, Daniela Bauer, Andrew Washbrook, Steve Jones.

Meetings & updates (20')                

- ROD team update

Rota: https://www.gridpp.ac.uk/wiki/ROD_rota . It needs to be updated - will/could be sent directly (action JC).

Glasgow switched to NGI UK section last week. On duty team registered as supporters. Went well. David Crooks (Glasgow) noticed from Steve Lloyd test pages that SAM tests not arriving, maybe due to NGI switch?

Discussion on state of test pages. CW opines that status is unclear. Result of discussion is that status needs to be summarised and reported back (action JC). One known problem is that GOCDB contains no sub-ranges - no T2 pages before site detail is presented. This can be fixed, according to John Gordon.

- Nagios status 
Nothing to report.

- Tier-1 update
No-one available to comment - RAL closed for "discretionary day".

- Security update
No-one available to comment - RAL closed for "discretionary day". However, discussion took place of whether network topology details should be published freely online in a Twiki, i.e. http://www.gridpp.ac.uk/wiki/Site_networking

We don't want to make it too easy for wrong-doers. Final requirements from that discussion were:

A) The pages are required for several purposes (Robin Tasker, Janet, PMB etc.)
B) We don't want it open to the world - just those identified by (say) grid certificates.
C) Must be editable by admins, who will revise and maintain published info on subnet topology and monitoring software.

Questions remaining are the technical means (pages may make use of key/prefix such as "Protected"); material may be split out/reclassified for diverse purposes (TBD). Should policy be extend to HEPSYSMAN? Data may be duplicated in GOCDB (EM). Actions (AM, JC)

-- T2 issues
No issues reported.

-- General notes.
GDB and GridPP27 meetings coming up in September.

- Tickets

73878 - Manchester. LHCb. Dirac SAM jobs failed at worker node.
Comment (AM): Node problem. Fix in progress.

73773(T)- Durham. ATLAS. Jobs failing with "Expected output file does not exist". 
Comment (EK): Still awaiting a full explanation. Actions (EK) to contact shift persons.

73644(T) - ATLAS. RAL T1. DATADISK has destination error. Solved (time-out) then reopened. 
Comment: In progress.

72160: Oxford. T2K space-token. 
Comment (EM): Will close, despite niggles. Action EM.

73280(T) - Brunel. Biomed. SE dc2-grid-65.brunel.ac.uk has errors. Old lsc file. Solved then reopened.WMS? 
Comment: Still no definitive resolution.

Experiment problems/issues (20') 

- LHCb

Chris Walker is to ticket LHCB because some of the SAMS monitoring links are broken (Action CW). The links are reached from this page: http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Lhcb


A discussion took place on Atlas disk space usage. An email describing the current thinking has been sent by Alistair Dewhurst. German and French sites have a higher allocation than UK sites. UK has the lowest. There are many factors involved with the decisions and a review will be delivered soon (EK).

EK described some outstanding Atlas tickets, including transfer problems between Manc/Lanc/Cern. As German and Italian sites have seen similar faults, it is considered to be a Cern problem.

There has been a full disk issue at Edinburgh, somehow related to software installation - test jobs have been requested.

Birmingham sporadically falling below Hammercloud threshold. Action (MS) to look into this. It may be a time-out condition.

Queen Mary (CW) had some trouble with storage that was fixed by a new release of Storm software.

Nothing reported.

- Site performance/accounting issues

MD: Lancashire accounting (APEL) is broken. Matt suspects this is due to tiny format differences in the log files. APEL is inconsistent which his version of APEL. CERN uses different release. Plug ins offer poor support. Development is slow. Good assistance but so far, no joy. AF: use of LSF has risks - only CERN and INFS use customised releases.

A discussion took place on SGE (Sun Grid Engine). Several sites use it. Maintenance now done by UNIVER, who will attempt to remove Oracle dependency.

- Metrics review

Site news/issues/updates (05')         

Some VOs (biomed) have sporadic issues with CREAM, maybe due to proxy time-outs. Developers are aware.

Actions (05')         

- See http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

AOB (01')         

Mark Mitchell informs that transition to IPv6 made easier via STACK software, see: 

