11:00 Experiment problems/issues (20') Review of weekly issues by experiment/VO - LHCb Raja: Problem with Lancaster, otherwise no major issues. - CMS Stuart: The Imperial CEs strain a bit under the extra 1000 jobs slots they are serving. - ATLAS Alessandra (report linked from agenda): UKI-SOUTHGRID-OX-HEP: data server has been down for a while the data on it have been declared lost. UKI-SOUTHGRID-BHAM-HEP: Site is decomissioning the shared cluster and has moved to CREAM only CEs. New queues have been added in panda and the old one removed. There are still what look like NFS problems to be solved. UKI-SOUTHGRID-RALPP: problems with storage failure. UKI-LT2-UCL-HEP: Ben restored DPM to working state but there are problems with GOCDB since the downtime entry has been deleted from the UCL page but the downtime object accessed by AGIS persists. Therefore AGIS keeps on thinking the site is in downtime. A ticket for GOCDB to delete the object has been opened. Site is manually on in DDM but the shifter hasn't put the prod queue online. Analysis queue will remain offline until site is stable in production for few weeks. UKI-LT2-RHUL-HEP: problems with certificate renewal of some data server was affecting jobs. It has been solved. UKI-SCOTGRID-GLASGOW: problem with storage during the weekend. UKI-SCOTGRID-ECDF: problem with the DATADISK space token being too full. Likely reasons a) DATADISK now being used for production input files b) the site running more atlas jobs than the storage can support. Data aren't or cannot be deleted fast enough with the consequence that writing to the space token might get blacklisted more often than usual. Jobs can keep on running though. CMTSITE time out in cvmfs: many sites are seeing this problem. It seems caused by a slow cache update. It goes in bursts, it is not continuous and is independent from the release. Last time it happened in Manchester - last Friday night - it coincided with a ramp of atlas jobs. I've installed a new cvmfs version which is not in production with better logging in Manchester. Despite many sites seeing it it is not a dominant error compared to others which are causing many more failures. Birmingham: Ditch the shared cluster, bring it back when the networking gets upgraded. Ewan: Some production jobs seems to use 3 - 3.5 GB of memory on the WNs. - Other 11:20 Networking update (MM) (15') - Feedback from the LHCONE meeting https://indico.cern.ch/conferenceDisplay.py?confId=179710. - Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information? For discussion.... Please email Mark if you want to help with testing. Chris: How does LHCONE interact with Janet ? Mark: This has been discussed, two different circuits read as two different networks. Tests with Glasgow underway, to report back to Janet. Lengthy discussion, no real no conclusion yet. Mark get some network figures from March/April - if you haven't sent them, could you please do so. PerfSonar: Deploy this by the end of July (instead of gridmon). 11:35 Meetings & updates (20') EMI 2 is on it's way (Jeremy). With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status (Jeremy) Extra diskservers for Alice, probems with networking. Chris would like to know what the networking problem actually were. Gareth says, network problem triggered further problem on the VMs. Default router outbound stopped working, root cause not quite known. - Accounting Some mumbling from Chris (sorry bad acoustics). - Documentation - Interoperation (reminder of the monthly NGI discussion next Tuesday) - Monitoring (David Crooks) Will ask people what they are using. - On-duty Daniela forgot to file a handover due to the holiday. No issues. - Rollout: Nothing happening (Daniela) - Security: Further details about MingChao's leaving. Two security challenges coming up !!! You are warned !!! - Services see PerfSonar - Tickets (Matt) Tickets discussed: NGI/SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 This ticket is tracking the certification of the Sussex site, which could turn into an interesting saga. Currently waiting on a bug in an existing ticket to get itself sorted (https://ggus.eu/ws/ticket_info.php?ticket=81792). UCL/GOC https://ggus.eu/ws/ticket_info.php?ticket=81878 UCL are having trouble taking themselves out of downtime due to the GOCDB being awkward after an accidental deletion. This is keeping the existing ticket open (https://ggus.eu/ws/ticket_info.php?ticket=80989). RALPP https://ggus.eu/ws/ticket_info.php?ticket=81862 One of the RALPP CREAMs is misbehaving for nagios tests - submitted on Friday so it may have been missed. https://ggus.eu/ws/ticket_info.php?ticket=81891 Just a heads up, looks like some dcache problems have cropped up over the weekend. EFDA-JET https://ggus.eu/ws/ticket_info.php?ticket=81886 Another heads up, lhcb are having disk-full type errors at JET. GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=81728 A third and final heads up, I don't trust sites to be notified when tickets get reopened. Looks like some job stage-in problems for atlas have cropped up again. All in hand. - Tools Report from Kashif. - VOs Chris looked at Ewan's 'how to get on the grid' documentation and gave it to some VOs. proxy renewal problems solved. - Site updates The next GDB is this Wednesday: https://indico.cern.ch/conferenceDisplay.py?confId=155068. Nothing exciting on the agenda. 11:55 Actions (5') Jeremy and webpages. Jeremy to talk to Marten about tar balls. Sam: DPM LFC checker - still ongoing. Matt: Parsing mismatch - ongoing. Documentation: Ongoing (ops team; 3 documents). rpm issues - resolved, won't be done Reminder for Duncan, but even Jeremy can't quite remember what Duncan was meant to be doing. To be completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items Completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Completed_Actions 12:00 AOB (1') 11:01:06] Govind Songara joined [11:01:15] Stephen Jones joined [11:01:28] Andrew McNab joined [11:01:52] Jeremy Coles Daniela will take minutes today. [11:02:00] Rob Harper joined [11:02:58] Stuart Purdie joined [11:03:01] Mark Slater joined [11:03:23] Duncan Rand joined [11:03:26] Alessandra Forti joined [11:04:27] Gareth Smith joined [11:05:14] Santanu Das joined [11:06:08] Andrew Washbrook joined [11:09:09] Ewan Mac Mahon joined [11:11:34] Jeremy Coles We will skip back to the experiment updates... and networking discussion! [11:11:54] Andrew Washbrook left [11:15:45] Alessandra Forti UKNGI-SECURITY@JISCMAIL.AC.UK [11:15:55] Mark Mitchell joined [11:19:44] Stuart Wakefield joined [11:21:03] Andrew Washbrook joined [11:23:31] Pete Gronbech joined [11:28:05] Gareth Roy joined [11:29:33] Mark Slater Chris: Just to say, I'm hopefully going to write up the Ganga section today or tomorrow. I haven't forgotten [11:29:37] Jeremy Coles https://indico.cern.ch/conferenceDisplay.py?confId=155068 [11:31:06] Queen Mary, U London London, U.K. Thanks Mark [11:31:34] Jeremy Coles https://indico.cern.ch/conferenceDisplay.py?confId=179710 [11:31:39] Queen Mary, U London London, U.K. Ben says their problem is bookkepping, bookkeeping, bookkeeping [11:32:35] Mark Slater I'm working on that as well. I'll have a bit more of a think and get back touch with him [11:37:28] Jeremy Coles Thanks MS. [11:40:17] Ewan Mac Mahon Just a note - last I heard, a Janet 'lightpath' was essentially a traffic managed VLAN, not literal fibre you can put light down. [11:40:30] Ewan Mac Mahon So they can be deployed (more or less) anywhere. [11:44:59] Jeremy Coles We'll move back to the experiment updates in a few minutes..... [11:47:24] Ewan Mac Mahon OK; I think this makes sense, but I do still think that in general principles we don't have the same problems that the US folks do, and anything that looks like a separate network isn't likely to be worth it for us. [11:47:55] Ewan Mac Mahon I don't think we need bandwidth reservations over Janet - we should (and, I think, can) just expect them to give us enough bandwith anyway. [11:48:26] Queen Mary, U London London, U.K. Can you use the MDM to display results from perfsonar-ps [11:48:45] Stuart Purdie left [11:49:32] Ewan Mac Mahon Based on what I've seen so far, Chris' idea of PS at the sites and an MDM portal sounds good if we can do it. [11:50:27] Ewan Mac Mahon And handily Glasgow can run that on the kit they were going to use for the gridmon central portal. [11:50:46] Mark Mitchell left [11:50:46] Mark Mitchell left [11:54:04] Jeremy Coles https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=188784 [11:55:10] Sam Skipsey (Technically, what happened at Glasgow was that the tiny bit of local site setup that ATLAS still needs became inaccessible - the cvmfs was absolutely fine, but because the pilots couldn't source the setup script in the site local area, it couldn't correctly set up its environment.) [11:58:05] Ewan Mac Mahon ^ This is just the little NFS shared area? [11:58:13] Sam Skipsey Yep. [11:58:30] Sam Skipsey Ironically, just before our plan to move it to a newer server some time this week, too. [11:58:37] Ewan Mac Mahon Do we know why it was inaccessible - load or just a server outage? [11:59:19] Sam Skipsey We do, it doesn't seem to be load so much as that server has become a little unreliable. [11:59:54] Alessandra Forti ok so it might be the same problem with ralpp [12:00:04] Alessandra Forti nfs not available [12:00:08] Sam Skipsey right [12:00:28] Govind Songara Hi Sam, we are seeing similar issue, squid cache use all the ram and it crash nfsd [12:00:51] Sam Skipsey ...you're running your squid cache on the nfs server? [12:01:57] Govind Songara Our squid is on same server which use to serve software area .. so yes it serve software for other vo's [12:02:48] Govind Songara our squid cache is on raid which is not good [12:02:54] Ewan Mac Mahon We're doing that too as it it happens, but of course CVMFS takes most of the load off the NFS. [12:02:56] Sam Skipsey Ah, so we don't do that NFS servers aren't necessarily the most reliable of beasts, though, and I have some ideas as to why our particular problem happens (which are moot, since we're moving to new hardware ) [12:03:20] Jeremy Coles Acions: http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items [12:03:27] Ewan Mac Mahon Our's has an overspecced array of 15k disks etc. though so it can pretty much take it. [12:03:33] Sam Skipsey RAID0 would be good for a squid cache, but I suspect you mean RAID6? [12:03:38] Pete Gronbech 10k [12:03:56] Alessandra Forti we have raid1 [12:04:04] Govind Songara Is your squid cache on raid, i just found atlas do not recommend to use raid, nfs, afs [12:04:11] Ewan Mac Mahon Er. I'm clearly going to have to check..... [12:04:23] Govind Songara we use raid6 [12:04:48] Pete Gronbech My mistake, it's 15 on the squid I think and 10 k on the new WNs [12:05:15] Sam Skipsey Our squid cache is just on a disk [12:05:47] Alessandra Forti mine too two disks raid1 [12:05:55] Sam Skipsey Generally, you want it to be fast for data and metadata writing and reading, so no, afs and nfs would not be good mount options [12:06:20] Stephen Jones I have a cunning plan for VomSnooper [12:09:38] Stephen Jones Duncan: re action: there was some talk of VPN errors at UCL, in a meeting last month, and it was suggested that you would follow it up when your hols were over. [12:12:04] Mark Slater left [12:12:04] Stuart Wakefield left [12:12:05] David Crooks left [12:12:05] Govind Songara left [12:12:05] Gareth Roy left [12:12:06] Gareth Smith left [12:12:07] Sam Skipsey left [12:12:07] John Hill left [12:12:08] Raja Nandakumar left [12:12:08] Stephen Jones bye [12:12:08] Matthew Doidge left [12:12:11] Mohammad kashif left [12:12:12] Elena Korolkova left [12:12:12] raul lopes left [12:12:13] Rob Harper left [12:12:16] Duncan Rand left [12:12:18] Andrew McNab left [12:12:20] Mark Norman left [12:12:24] Queen Mary, U London London, U.K. left [12:12:37] Alessandra Forti left