Choose timezone
Your profile timezone:
ops meeting 2 August 2011 ===================== https://indico.cern.ch/conferenceDisplay.py?confId=148738 Meetings and updates ================= ROD team update ============== no update Nagios status =========== Steve's tests have started failing after yesterday's update? Not sure why this is - otherwise it looks OK EGI === Stuart attended meeting yesterday - see his email - mostly concerned new versions of software. UMD repositories are now the source. Survey on SL4 and glite-3.1. Everything in SL4/glite-3.1 apart from the WMS is becoming unsupported. Please update the wiki page: https://www.gridpp.ac.uk:443/wiki/SL4_Survey%2C_August_2011. Chris: Am using lcg-ce and it seems more reliable than CREAM at the moment. Tier-1 update ========== Not much to report. Slight problem with batch system not scheduling jobs on Friday. Tape migration problem for ATLAS. Couple of disk servers failing with memory problems. Site will lose external connectivity 8am-9am 9th August for an hour - firewall reboot. Minor problems with power supply being investigated - might need an interuption to the power supply one weekend (perhaps after bank holiday). Security update ============ Not much news Tier-2 issues ========== No news Tickets ===== Tickets: http://tinyurl.com/3uo5get will include someone from FTS developers on the T2K myproxy tickets. Gareth: not sure who the appropriate person is, but he will try again. T2K space tokens - Oxford waiting on size of token, QMUL - new token being tested - should be OK. Biomed issue at Cambridge. Not clear how to stop supporting a VO. There is apparently some info on the storage wiki page. SL4/DPM/32-bit - tickets on hold Brian: re Cambridge: what happens if the user sets to be notified on solution (Kashif: which is the default) - difficult to discuss problem with them. Experiment problems and issues ========================== LHCb - No Raja CMS - No Stuart ATLAS - Tier-2: still problems with QMUL. Chris has updated 10 Gbps card driver. UCL has storage problem. RHUL had jobs failing but failures have now disappeared. Manchester had storage hardware failures - now recovered without loss of data. Glasgow had power-cut - now recovered. ATLAS news: CVMFS will be preferred software area in autumn - sites should start to look into it and a definitive timeline will be announced in next 2 weeks. Brian: a lot of jobs failures due to stage out failures in the UK - trying to resurrect job recovery mechanism. Other VO issues ============= Perhaps we need to review what we're doing in our support for small VO's. Any other issues that are known about? Is this meeting known about? Smaller VO's lack man-power. T2K problem unclear where problem was and who therefore was responsible. No blacklisted sites. No known events affecting job slots. What happened for the skimming of ATLAS conference agenda? Stuart is converting it into an ical feed. It works. Accounting issues - Liverpool behind? Any other topic needing discussed? Santanu: how do I separate torque server and CE? Suggestion to look at CREAM documentation. Actions ====== http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items O-110524-07: Glexec tarball status? Seems to be stalled. O-112806-01: Fallback options for SE - Kashif now using 2 SE's for fallback (total 3). Quite robust now. There was no other business [11:00:54] Jeremy Coles Will wait 1 minute more..... [11:02:23] Brian Davies is anyone talking? [11:03:30] Jeremy Coles Yes [11:03:48] Andrew McNab I can't hear anything either [11:04:21] Andrew Washbrook i can hear you [11:04:23] Jeremy Coles Most people seem connected ok. [11:04:39] Andrew McNab joined [11:04:55] John Bland left [11:05:29] Andrew McNab left [11:06:08] Mingchao Ma joined [11:06:17] Andrew McNab joined [11:06:23] Duncan Rand meeting URL? [11:06:44] Alessandra Forti joined [11:07:27] raul lopes joined [11:07:54] Andrew Washbrook https://indico.cern.ch/conferenceDisplay.py?confId=148738 [11:08:24] Stuart Purdie https://www.gridpp.ac.uk:443/wiki/SL4_Survey%2C_August_2011 [11:08:29] Brian Davies phone bridge info on indico page os oncorrect, does someone ( jeremy) have correct details. ( Gareth Smith is trying to connect...) [11:09:07] Gareth Smith joined [11:10:37] Gareth Smith Can anyone confirm the phone bridge ID & code for this meeting? It seems incorrect on indico page. (I have no audio capability...) [11:11:12] Jeremy Coles Yes I updated it. Try reloading the agenda page in case it is the old value. [11:11:19] Jeremy Coles It is 77907 [11:12:27] Phone Bridge joined [11:15:07] Elena Korolkova is RAL declaring DT for this time? [11:15:27] John Bland joined [11:15:41] Elena Korolkova on the (TH? [11:15:41] Elena Korolkova 9th? [11:15:45] Santanu Das joined [11:18:20] Gareth Smith Just to confirm: RAL Site downtime on Tuesday 9th August (07:00 to 08:00 UTC) - declared in GOC DB. For reboot of site firewall. [11:18:55] Elena Korolkova thanks. Gareth [11:22:54] Queen Mary, U London London, U.K. ggus 72359 and 72358 [11:25:13] Jeremy Coles use https://ggus.org/ws/ticket_info.php?ticket= [11:26:18] Elena Korolkova On t2k spacetoken: They don't use it. The fill ed our storage and sam tests were failing because of that. [11:26:46] Elena Korolkova I 've decrease their spacetoken by 1 TB. [11:30:27] Elena Korolkova Close the ticket. Say if you have more question please re-open [11:30:36] John Bland elena: t2k have been filling up their pool here as well and are just starting to spill over into shared storage [11:30:43] John Bland no space token usage that I can see, either [11:39:04] Brian Davies http://dashb-atlas-job.cern.ch/dashboard/request.py/failedjobsstatus_individual?sites=UK&sitesSort=8&start=null&end=null&timeRange=lastMonth&sortBy=0&granularity=Daily&generic=0&type=aadp [11:39:29] Stephen Jones oK [11:47:58] Elena Korolkova Jon Perkin sometimes come to storage meetings [11:54:12] Duncan Rand NoQueue in analysis activity since Jul 4 08:00 [11:56:35] Queen Mary, U London London, U.K. Can the conferences list end up on the wiki somewhere please. [11:58:25] Jeremy Coles We might end up with pointers to existing pages. [11:59:50] Elena Korolkova Can we run with one cream CE? [12:00:01] Jeremy Coles http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items [12:00:07] Alessandra Forti yes but it's less redundant [12:00:32] Alessandra Forti i.e. if it goes down the whole site is down [12:00:33] Elena Korolkova So we should have 2 cream ce for the same cluster? [12:00:52] Jeremy Coles http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items [12:00:54] Alessandra Forti it's not compulsory it's safer if you have the resources I'd do it [12:01:36] Phone Bridge left [12:01:37] Elena Korolkova It's my understanding but we have local disagreement on this issue [12:02:02] Stuart Purdie https://svr001.gla.scotgrid.ac.uk/cgi-bin/atlas.py is an ical feed of all the conferences that have many conferences - tuned to be about 4 to 6 a year. There's also https://svr001.gla.scotgrid.ac.uk/cgi-bin/ukidowntime.py which mixes it with all UK downtimes. [12:02:36] Stuart Purdie (So, for example, that RAL outage Garath mentioned was already in my calendar) [12:04:43] Gareth Smith left [12:06:23] Andrew McNab left [12:06:42] Jeremy Coles http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items [12:06:59] Gareth Smith joined [12:09:22] Gareth Smith left [12:09:29] Mark Slater left [12:09:31] Robert Harrington left [12:09:31] Stephen Jones left [12:09:31] Elena Korolkova left [12:09:31] Brian Davies left [12:09:31] Mark Mitchell left [12:09:32] Alessandra Forti left [12:09:33] John Bland left [12:09:33] David Crooks left [12:09:33] Mingchao Ma left [12:09:33] Chris Brew left [12:09:35] Mohammad kashif left [12:09:35] Andrew Washbrook left [12:09:36] Santanu Das left [12:09:36] Govind Songara left [12:09:37] Jeremy Coles Duncan took minutes [12:09:38] Daniela Bauer left [12:09:39] Sam Skipsey left [12:09:39] Rob Harper left [12:09:43] Matthew Doidge left