Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 77907 with code: 4880. Apologies:
Slides
ops meeting 2 August 2011
=====================


https://indico.cern.ch/conferenceDisplay.py?confId=148738

Meetings and updates
=================

ROD team update
==============

no update

Nagios status
===========
Steve's tests have started failing after yesterday's update? Not sure why this is - otherwise it looks OK

EGI
===
Stuart attended meeting yesterday - see his email - mostly concerned new versions of software. UMD repositories are now the source. Survey on SL4 and glite-3.1. 
Everything in SL4/glite-3.1 apart from the WMS is becoming unsupported. 
Please update the wiki page:  https://www.gridpp.ac.uk:443/wiki/SL4_Survey%2C_August_2011. Chris: Am using lcg-ce and it seems more reliable than CREAM at the moment.

Tier-1 update
==========

Not much to report. Slight problem with batch system not scheduling jobs on Friday. Tape migration problem for ATLAS. Couple of disk servers failing with memory problems. Site will lose external connectivity 8am-9am 9th August for an hour - firewall reboot. Minor problems with power supply being investigated - might need an interuption to the power supply one weekend (perhaps after bank holiday).    


Security update
============

Not much news


Tier-2 issues
==========

No news

Tickets
=====

Tickets: http://tinyurl.com/3uo5get

will include someone from FTS developers on the T2K myproxy tickets. Gareth: not sure who the appropriate person is, but he will try again.

T2K space tokens - Oxford waiting on size of token, QMUL - new token being tested - should be OK.

Biomed issue at Cambridge. Not clear how to stop supporting a VO. There is apparently some info on the storage wiki page.

SL4/DPM/32-bit - tickets on hold

Brian: re Cambridge: what happens if the user sets to be notified on solution (Kashif: which is the default) - difficult to discuss problem with them.


Experiment problems and issues
==========================

LHCb - No Raja
CMS - No Stuart

ATLAS - 

Tier-2: still problems with QMUL. Chris has updated 10 Gbps card driver. UCL has storage problem. RHUL had jobs failing but failures have now disappeared. Manchester had storage hardware failures - now recovered without loss of data. Glasgow had power-cut - now recovered. 

ATLAS news: CVMFS will be preferred software area in autumn - sites should start to look into it and a definitive timeline will be announced in next 2 weeks.

Brian: a lot of jobs failures due to stage out failures in the UK - trying to resurrect job recovery mechanism.

Other VO issues
=============

Perhaps we need to review what we're doing in our support for small VO's. Any other issues that are known about? Is this meeting known about? Smaller VO's lack man-power. T2K problem unclear where problem was and who therefore was responsible. 

No blacklisted sites. No known events affecting job slots. What happened for the skimming of ATLAS conference agenda? Stuart is converting it into an ical feed. It works. 

Accounting issues - Liverpool behind?

Any other topic needing discussed? Santanu: how do I separate torque server and CE? Suggestion to look at CREAM documentation.

Actions  
======

http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

O-110524-07: Glexec tarball status? Seems to be stalled.

O-112806-01: Fallback options for SE - Kashif now using 2 SE's for fallback (total 3). Quite robust now.

There was no other business

[11:00:54] Jeremy Coles Will wait 1 minute more.....
[11:02:23] Brian Davies is anyone talking?
[11:03:30] Jeremy Coles Yes
[11:03:48] Andrew McNab I can't hear anything either
[11:04:21] Andrew Washbrook i can hear you
[11:04:23] Jeremy Coles Most people seem connected ok.
[11:04:39] Andrew McNab joined
[11:04:55] John Bland left
[11:05:29] Andrew McNab left
[11:06:08] Mingchao Ma joined
[11:06:17] Andrew McNab joined
[11:06:23] Duncan Rand meeting URL?
[11:06:44] Alessandra Forti joined
[11:07:27] raul lopes joined
[11:07:54] Andrew Washbrook https://indico.cern.ch/conferenceDisplay.py?confId=148738
[11:08:24] Stuart Purdie https://www.gridpp.ac.uk:443/wiki/SL4_Survey%2C_August_2011
[11:08:29] Brian Davies phone bridge info on indico page os oncorrect, does someone ( jeremy) have correct details. ( Gareth Smith is trying to connect...)
[11:09:07] Gareth Smith joined
[11:10:37] Gareth Smith Can anyone confirm the phone bridge ID & code for this meeting? It seems incorrect on indico page. (I have no audio capability...)
[11:11:12] Jeremy Coles Yes I updated it. Try reloading the agenda page in case it is the old value.
[11:11:19] Jeremy Coles It is 77907
[11:12:27] Phone Bridge joined
[11:15:07] Elena Korolkova is RAL declaring DT for this time?
[11:15:27] John Bland joined
[11:15:41] Elena Korolkova on the (TH?
[11:15:41] Elena Korolkova 9th?
[11:15:45] Santanu Das joined
[11:18:20] Gareth Smith Just to confirm: RAL Site downtime on Tuesday 9th August (07:00 to 08:00 UTC) - declared in GOC DB. For reboot of site firewall.
[11:18:55] Elena Korolkova thanks. Gareth
[11:22:54] Queen Mary, U London London, U.K. ggus 72359 and 72358
[11:25:13] Jeremy Coles use https://ggus.org/ws/ticket_info.php?ticket=
[11:26:18] Elena Korolkova On t2k spacetoken: They don't use it. The fill ed our storage and sam tests were failing because of that.
[11:26:46] Elena Korolkova I 've decrease their spacetoken by 1 TB.
[11:30:27] Elena Korolkova Close the ticket. Say if you have more question please re-open
[11:30:36] John Bland elena: t2k have been filling up their pool here as well and are just starting to spill over into shared storage
[11:30:43] John Bland no space token usage that I can see, either
[11:39:04] Brian Davies http://dashb-atlas-job.cern.ch/dashboard/request.py/failedjobsstatus_individual?sites=UK&sitesSort=8&start=null&end=null&timeRange=lastMonth&sortBy=0&granularity=Daily&generic=0&type=aadp
[11:39:29] Stephen Jones oK
[11:47:58] Elena Korolkova Jon Perkin sometimes come to storage meetings
[11:54:12] Duncan Rand NoQueue in analysis activity since Jul 4 08:00
[11:56:35] Queen Mary, U London London, U.K. Can the conferences list end up on the wiki somewhere please.
[11:58:25] Jeremy Coles We might end up with pointers to existing pages.
[11:59:50] Elena Korolkova Can we run with one cream CE?
[12:00:01] Jeremy Coles http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
[12:00:07] Alessandra Forti yes but it's less redundant
[12:00:32] Alessandra Forti i.e. if it goes down the whole site is down
[12:00:33] Elena Korolkova So we should have 2 cream ce for the same cluster?
[12:00:52] Jeremy Coles http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
[12:00:54] Alessandra Forti it's not compulsory it's safer if you have the resources I'd do it
[12:01:36] Phone Bridge left
[12:01:37] Elena Korolkova It's my understanding but we have local disagreement on this issue
[12:02:02] Stuart Purdie https://svr001.gla.scotgrid.ac.uk/cgi-bin/atlas.py is an ical feed of all the conferences that have many conferences - tuned to be about 4 to 6 a year. There's also https://svr001.gla.scotgrid.ac.uk/cgi-bin/ukidowntime.py which mixes it with all UK downtimes.
[12:02:36] Stuart Purdie (So, for example, that RAL outage Garath mentioned was already in my calendar)
[12:04:43] Gareth Smith left
[12:06:23] Andrew McNab left
[12:06:42] Jeremy Coles http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
[12:06:59] Gareth Smith joined
[12:09:22] Gareth Smith left
[12:09:29] Mark Slater left
[12:09:31] Robert Harrington left
[12:09:31] Stephen Jones left
[12:09:31] Elena Korolkova left
[12:09:31] Brian Davies left
[12:09:31] Mark Mitchell left
[12:09:32] Alessandra Forti left
[12:09:33] John Bland left
[12:09:33] David Crooks left
[12:09:33] Mingchao Ma left
[12:09:33] Chris Brew left
[12:09:35] Mohammad kashif left
[12:09:35] Andrew Washbrook left
[12:09:36] Santanu Das left
[12:09:36] Govind Songara left
[12:09:37] Jeremy Coles Duncan took minutes
[12:09:38] Daniela Bauer left
[12:09:39] Sam Skipsey left
[12:09:39] Rob Harper left
[12:09:43] Matthew Doidge left



There are minutes attached to this event. Show them.
    • 11:00 11:20
      Meetings & updates 20m
      - ROD team update - Nagios status -- Steve's Nagios based page (http://pprc.qmul.ac.uk/~lloyd/gridpp/nagios.html) show errors since a Nagios update yesterday. - EGI update (thanks to Stuart P) "SGE-CREAM in SR - assuming it goes well, should be a new stable CREAM for SGE soon that should relieve some of the problems. gLite 3.1 survey: Can one person from each site send me a note on what/how many services they run on gLite3.1, and note if they can't upgrade the hardware to run SL5. (I think this is going to be mostly storage stuff, so if there's already a storage survey, point me at it.). Notes below: 1.1 EMI-1 update Lots of details on slides: https://www.egi.eu/indico/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=555 update 4 (21 July): BDII, WMS, ARGUS, trustmanager, CREAM and gridste. update 5, on Thursday, 4 August: ARC, BDII and VOMS Java api. Future (possibly 1st September): More BDII, FTS, HYDRA, GLExec and STORM. WMS showstopper found on Friday - so still with the devs. Looks like things are stabilising. We won't know if all the problems have been found, but it looks like the rate of bug finding is slowing (except maybe with the BDII - a lot of the work there is to replace the core of the gLite BDII with the one from ARC, because consolidation is good, and the ARC bdii is implementation is better). 1.2 Staged Rollout gLite 3.2 CREAM 1.6.7 for SGE. memory leak fix, in SR with LIP. UMD UMD release 1.1 yesterday: https://wiki.egi.eu/wiki/UMD-1:UMD-1.1.0 UMD 1.2 planned for 12 september, expected to contain (amongst others): MPI , VOMS Oracle, dCache, Globus 5 packages (from IGE), WMS, ARC, BDII and Storm. 2 Operational issues gLite 3.1 A survey is desired to determine the usage of gLite 3.1 services. There's a few exclusions - but it's probably simpler if I collate a list of all 3.1 services, and whether the hardware could run SL5, or not. I'll post the summary back to gridpp-ops before submitting it." - Tier-1 update - Security update -- T2 issues -- General notes. A reminder that there is no GDB this month. The next one is on 14th September. At the moment no pre-GDB has been arranged (it would clash with our joint PMB team meeting). The GridPP27 related PMB-ops team F2F meeting is on Wednesday 14th 16:00-18:00 (http://www.gridpp.ac.uk/gridpp27/programme.html). If you have topics for discussion please let Jeremy know. Tickets: http://tinyurl.com/3uo5get
    • 11:20 11:40
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other T2K are were still struggling with their proxy renewal issues yesterday. A few tickets have been opened but progressed slowly. Please remind users that they are welcome to attend this meeting to discuss their issues in this agenda slot! The Tier-1 liaison meeting is also held to help them (as well as the LHC VOs) resolve any problems. Details for the Tier-1 meeting are 1) The agenda: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting; 2) The joining details: * Meeting URL: http://evo.caltech.edu/evoNext/koala.jnlp?meeting=eleBevvaveanaIaeIl * Phone bridge ID: 83566 On the T2K issue.... - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues Liverpool slightly behind on accounting compared to other sites? - Metrics review
      Atlas Report
    • 11:40 11:55
      Discussion 15m
    • 11:55 12:00
      Actions 5m
      - http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
    • 12:00 12:01
      AOB 1m