Minutes: Chris=1 Ewan=0 Duncan=1 Alessandra=1 Stuart=1 David=1 Stephen=1 Catalin=1 Rob=1 Present: ------- Alessandra Forti Andrew McNab Andrew Washbrook Brian Davies Christopher Walker Daniela Bauer David Colling David Crooks Dan Traynor Elena Korolkova Govind Songara John Bland John Kelly Matthew Doidge Mingchao Ma Raja Nandakumar raul lopes Rob Fay Rob Harper Sam Skipsey Santanu Das Stephen Jones Stuart Purdie Apologies: --------- Apologies: Mark M, Kashif, Catalin ROD team update --------------- Nagios affected for a couple of hours on Sunday night due to site-wide network problems with Oxford's Begbrook site. This is not related to the problems they are having with the Dell network switches. Nagios has been updated to the latest release. Incident 2 weeks ago where asia pacific ROC tested UK sites and caused alarms. Should have no impact on availability and reliability of UK sites. Tier-1 update: John Kelly ------------- A couple of disk server failures Thursday and Friday (pretty routine) both atlas. Garbage collection policies changed to give better performance. FTS - Atlas alarm ticket - very high load on SRMs. Stopped machine from active draining reduced that load. Security: Mingchao Ma --------------------- Incident ongoing. No GridPP sites have reported issues, though a few sites including the Tier-1 have reported scans. All info should be in the e-mails sent to security contacts. Will update all sites if new developments - including more details of the incident. Andy McNab: We have been asked to block an IP. Has this been fed back to the ISP of the site in question? Full details of the compromise are not yet known. It is believed that a user account was compromised, then used to login to development host which has a root vulnerability. Whilst it isn't known if this was the vulnerability used, sites should not assume internal system are only visible to themselves, and are reminded to keep internal systems patched. Tier-2 Issues -------------- * Publishing - QMUL not publishing - but aware of this. * NGI - In process of setting up NGI. Created in GOCDB. Will be some process of moving sites across to this. * QMUL: SE issues - QMUL will upgrade to StoRM 1.7 - in the hope it might solve the problem. * T2K space tokens - T2K are using a lot of space compared to other small VOs. - Storage group has recommended T2K move to space tokens so they can account for space, and not take all the space for other small VOs. - Imperial don't currently have enough space free to allow them to transfer all their data into a space token. * Proxy delegation at Imperial - Daniela has no idea what the ticket is about and said so in the ticket. * Hone jobs - Brunel Waiting for update from EMI Cream CE. * Snoplus: Tier-1 - waiting for Catalin to return. * Pheno VO: high number of job failures. David Crooks - all pheno jobs going to SARA seem to fail, but dteam jobs work. Perhaps ticket should be reassigned to SARA. Experiment problems and issues ------------------------------- LHCb - Raja: Running fine, no particular problems. Atlas: Problem Glasgow and QMUL not receiving database releases. Put offline for quite a while. Solution to avoid these problems is to move to latest CVMFS release. Also reduces load on storage element. Glasgow and Manchester have installed CVMFS QMUL problem storage under load, also WAN network bandwidth saturated. Oxford: Passing storage problem, presumably due to network break. UCL: Software area full BHAM: downtime until yesterday. Still testing. Squid changes: Cambridge - ticket now closed. WMS: Steve Lloyd's tests: lots of failures due to RAL and Glasgow WMS overloaded. RAL fixed number of gridftp slots. Manchester: CVMFS upgrade accidentally unset DEFAULT_SE environment variable which caused Steve Lloyd's tests to fail. Discussion over whether WMS was useful for Atlas. Multi cloud operation: Roger Jones believed that all T2D sites are registered for multi cloud operation. Alessandra didn't think that was the case. CMS: [11:38:57] David Colling Nothing too much to report [11:39:16] David Colling Minor problems at Imperial meaning that we fell below 80% ... bad [11:39:49] Jeremy Coles What was the underlying problem? [11:39:54] David Colling On a positive note Bristol moved off 0% readiness! ... V. Good! [11:40:34] David Colling Tape 0 isk 1 trials - make me nervous - as people know [11:41:21] David Colling that should be Tape0 colon disk 1 ( not smiley face) Events affecting job slot requirements: ------------------------------------ Summer conferences in August. Site performance and Accounting ------------------------------- Metrics ------- PMB: T2 accounting periods not discrete - no plan to stop monitoring metrics - expectation is that it will be continuous - so don't hold off upgrades just because it is an accounting period. EGI service operations security policy draft -------------------------------------- https://wiki.egi.eu/wiki/Talk:SPG:Drafts:Operations_Policy Main point is on page 6 of the document. JC: Major changes are to terminology, and whilst there is no intention to make substantive changes, sites are advised to check this document while it is still in draft. WLCG workshop ------------- See Ian's summary slides and Jeremy's notes. Things have gone well in first year of running. Contention expected next year. Soem efficiency improvements possible at sites. Concern that EGI and WLCG goals may not align. Concerns over use of EMI 1. How quickly do we move to SL6 and SL7. Aligning computing models - focus on improving commonality between different groups and experiments. Tier-3s: how independent should they be. Storage and Data: LHCb plan to start doing some reprocessing at Tier-2s. Chaotic transfer of input files to Tier-2s. Plan is to transfer data over the WAN direct to WN. Only Manchester involved at present. Some concern was expressed over WAN link saturation. Raja: LHCb are currently doing the throttling and plan to continue to do so. LHCOne: Feeling that we are not as involved as we should be. Mark Mitchell will help drive the discussion about where we should be focussing in this area. Perfsonar network. [12:03:09] David Colling I see this issue, but I think that it is very unlikely [12:03:55] David Colling We made a policy decision not to be involved [12:04:02] David Colling this was discussed at the PMB [12:04:47] David Colling As this would take hardware money [12:04:55] David Colling so may be monitoring [12:06:08] David Colling We may end up needing to have a connection to the LHCONE backbone somwhere [12:06:21] David Colling from Janet Whole node scheduling Memory usage Middleware: Opening talk gave lots of useful facts and figures. Each fill now provides more data than taken in 2010!!! Trigger rates increased by experiments, so taking more data. Some issues with CREAM - divergence between WLCG, EMI and EGI. Pileup becoming a problem. Move from MONARCH model to equal based architecture. How are batch systems holding up. Some concerns about whether Torque/Maui are holding up. Some interest in SLURM - as rewrite from the ground up. Users want more grid stability. General failure around 10%. Lots of those failures in UK are IO failures. FTS - monitoring and FTS3. CMS to replace jobrobot with their version of hammercloud. Cloud usage: Some concern about sending proxies to commercial clouds. Lots of storage talks mentioning http support. Moving away from the strict MONARCH model Talk about injecting pilot jobs directly into batch system rather than using grid submission (very ALICE like says Dave Colling). SL5/SL6/SL7. What hardware support Dell discussion about new hardware that is coming out. Isn't clear that HEPSPEC will accurately represent performance of HEP jobs on the next generation of hardware. CVMFS: Sites invited to install it - it solves many of the NFS. On lxplus and lxbatch since autumn 2010. http://northgrid-tech.blogspot.com/2011/07/cvmfs-installation.html [12:28:19] Alessandra Forti for who's interested See also http://hepwww.rl.ac.uk/sysman/Nov2010/agenda.html AOB --- Tier-2 reports - hopefully everything in this week. GridPP 27 meeting open for registration. The earlier you book the better. Please be as economical as possible. Let Jeremy know if you have any topics that need to be discussed in the PMB/ops meeting. EGI technical forum starting 19 September https://www.egi.eu/indico/conferenceTimeTable.py?confId=452#all - the week after the GridPP meeting. Chat window ----------- [11:01:28] Jeremy Coles Chris is taking minutes today. [11:05:11] Mingchao Ma joined [11:05:11] Elena Korolkova joined [11:05:11] Sam Skipsey joined [11:05:12] Alessandra Forti joined [11:05:12] Stephen Jones joined [11:05:12] Brian Davies joined [11:05:12] Raja Nandakumar joined [11:05:12] Stuart Purdie joined [11:05:13] Rob Harper joined [11:05:14] Andrew McNab joined [11:05:14] Rob Fay joined [11:05:14] Daniela Bauer joined [11:05:16] John Kelly joined [11:05:17] John Bland joined [11:05:19] Matthew Doidge joined [11:08:11] Christopher Walker joined [11:08:40] RECORDING Christopher joined [11:13:47] Stuart Purdie left [11:14:25] Stuart Purdie joined [11:18:23] Govind Songara joined [11:20:28] Mingchao Ma CVE-2010-3847 [11:20:42] Mingchao Ma https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/liblinker-2010-10-18 [11:24:24] Santanu Das joined [11:26:28] David Colling joined [11:29:39] Elena Korolkova We have T2K space token in Sheffield. [11:29:56] Elena Korolkova It was tested by T2K. It workes [11:29:57] Andrew Washbrook joined [11:31:55] Elena Korolkova I think You cab put 2 TB in spacetoken and then it's up to T2K to move data to the spacetoken [11:32:33] David Colling sorry my microphone is jnot working [11:32:41] David Colling why do they need space tokens [11:32:42] David Colling ? [11:33:00] Daniela Bauer So they don't suck up all the space they can get .... [11:33:13] David Colling But they don't actually have that much data - surely? [11:33:54] David Colling What is their request size at Resource Board? [11:34:09] David Colling This is set by Glenn's [11:37:53] Jeremy Coles I will check. [11:38:49] David Colling Sorry ... [11:38:57] David Colling Nothing too much to report [11:39:16] David Colling Minor problems at Imperial meaning that we fell below 80% ... bad [11:39:49] Jeremy Coles What was the underlying problem? [11:39:54] David Colling On a positive note Bristol moved off 0% readiness! ... V. Good! [11:40:27] Andrew Washbrook left [11:40:34] David Colling Tape 0 isk 1 trials - make me nervous - as people know [11:41:21] David Colling that should be Tape0 colon disk 1 ( not smiley face) [11:47:34] David Colling So does CMS ! [11:49:06] David Colling That is what Roger said quite clearly that this was thew case [11:49:33] David Colling He said this in Hamburg as well in another conversation [11:49:49] David Colling Bristol! [11:51:17] David Colling The next start date is the first day after the end of the current stop date [11:54:18] Andrew McNab left [11:59:00] Andrew McNab joined [12:03:09] David Colling I see this issue, but I think that it is very unlikely [12:03:55] David Colling We made a policy decision not to be involved [12:04:02] David Colling this was discussed at the PMB [12:04:47] David Colling As this would take hardware money [12:04:55] David Colling so may be monitoring [12:05:09] David Colling exactly [12:06:08] David Colling We may end up needing to have a connection to the LHCONE backbone somwhere [12:06:21] David Colling from Janet [12:09:45] Raja Nandakumar Apologies - got to go. [12:11:00] David Colling The comment about reserving machines for specific VOs was to do with whole node scheduling and somebody claimed that they were the same thing [12:11:11] Raja Nandakumar left [12:11:23] Jeremy Coles https://computing.llnl.gov/linux/slurm/ [12:17:22] David Colling Very ALICE like [12:18:11] David Colling Certainly CMS will not! [12:18:16] David Colling Certainly CMS will not! [12:18:38] David Colling T3s yes [12:19:13] David Colling Centos 6 is out (couple of weeks ago) and we are looking to move [12:25:07] Christopher Walker Can someone buy Alessandra a headset!!! [12:26:35] Alessandra Forti [12:27:08] Alessandra Forti I had a loud speaker but suddenly it isn't working anymore. according to my mac it sucks too much power. [12:28:10] Alessandra Forti http://northgrid-tech.blogspot.com/2011/07/cvmfs-installation.html [12:28:19] Alessandra Forti for who's interested [12:29:16] Jeremy Coles On monitoring overall (and the Site Status Board) https://indico.desy.de/materialDisplay.py?contribId=37&sessionId=3&materialId=slides&confId=4019 [12:29:25] Govind Songara RHUL having segfault with dpm 1.8.1 [12:30:01] David Colling I know that we are wrapping up but I to go. Byee [12:30:09] David Colling left [12:31:39] Stephen Jones Get we get 25 pence per mile? [12:31:56] Sam Skipsey Only if you cycle it, Stephen. [12:32:19] Mingchao Ma https://www.egi.eu/indico/conferenceTimeTable.py?confId=452#all