Attending: Chris Brew, Andrew McNab, Raul Lopes, Dan Traynor, David Crooks, Elena Korolkova, Ewan Steele, Gang Qin, Ian Collier, John Bland, Garether Smith, John Hill, Mark Mitchell, Daniela Bauer, Matt Raso-Barnett, Mohammad Kashif, Rob Fay, Robert Frank, Sam Skipsey, Steve Jones, Wahid Bhimji, Ewan MacMahon, Andrew Lahiff (over phone bridge), Matt Williams, Govind Songara, Pete Gronbech. Chair: Jeremy Minutes: Matt Apologies: Chris W, Raja, Alessandra Experiment problems/issues (20') - LHCb Nothing from Raja -- Update on ARC-DIRAC issues Andrew - tweaked some job environment setup at Tier 1, should work for LHCB now. - CMS Daniela- Fermilab can't handle SHa-2 certificates, users advised to stick to SHA1. Imperial having problems with jobs being held, looks to be a CMS problem (and they're almost admititng to it). Jeremy - Tier 1 CMS share moving back to 5% for analysis work. On SHA2 - most issues seemed to be resolved, with a few problems at CERN. France already moved. - ATLAS Elena-Not much to report, reduced production work due to large output files filling Tier 1 datadisks. Atlas experts working on it. - Other -- ILC is moving to CVMFS. Please see https://ggus.eu/ws/ticket_info.php?ticket=101502 Meetings & updates (20') With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest Monday 24th February There is a test GridPP website for SHA-2. Jeremy- Might move mid-march. Jens has been doing testing. The final WLCG Tier-2 availability/reliability reports for January 2014 are available. Alessandra noted a FR cloud report on January's VO test results. The suggestion made was to do something similar for UK sites. We need to revisit our plans for RIPE ATLAS probes. -Likely to be scaled down from our previous grand plans for dozens of probes to maybe one at each site. Janet is moving away from SeeVogh/EVO. Support ends in August. Our meetings will migrate to Vidyo. -Testing over coming weeks, people encouraged to try it out. - Tier-1 status From Gareth: There were problems with the FTS3 service last Tuesday when difficulties were encountered moving the VMs around. Since then the service has run successfully and is being used for an extensive test by Atlas and CMS. -One VM was lost entirely in the process - under investigation. The software server used by the small VOs will be withdrawn from service (aiming for June). -Contacting of the VOs and surveying the use cases to be done. A replacement MyProxy server is being put into production (to resolve the MyProxy issues raied in GGUS ticket 97025). It will be necessary for VOs to make appropriate reconfigurations to use this. It is most likely the Tier1, along with some other non-GridPP services at RAL, will move to use the new site firewall on Monday 17th March and there may be some disruption around this change. We do not have a date for the other significant network change we have to do which is the installation of our new Routing layer and changes to the way the Tier1 connects to the RAL network. -Date to be confirmed. The break should be short, but the operation is non-trivial. Will be declared in gocdb. -Jeremy- A lot of traffic bypasses the firewall? -Gareth- Yes, but there will be a drop of outside conenctivity when the new firewall gets put in (although everything wiil stay up internally). Risk of firewall rules being incorrectly set, so there will be an at risk period for some time after. New Firewall is a different make so transferring rules non-trivial, but new firewall has good debugging tools. Jeremy- Expected throughput of new Firewall? Gareth - will find out. Data flows avoid firewall, but control flows don't. - Accounting No updates. - Documentation Keydocs owners need to take some action! Just under half need updating - some just need a header fixing. Naughty step: Mark, Pete, Jens, Rob, Allesandra, Wahid, David, Raul, Matt. Jeremy - reminds us of the percentage complete column, which can be used to gauge accuracy if time is to constrained to do a "thorough" job. -May need to review what key docs are key docs. - Interoperation [from David] Meeting Agenda: https://wiki.egi.eu/wiki/Agenda-24-02-2014 URT News: ARC, WMS, SAM probes UMD 3.5 released last week. Storm 1.11.3, other updates for openssl SR: IGE.globus-rls v. 5.2.5 no EA DMSU: WMS-ARGUS connection errors GGUS ticket https://ggus.eu/ws/ticket_info.php?ticket=101486 EMI-2 Decommissioning: Sites will start to receive alarms on Monday March 3rd, probes deployed in midmon and ready to be checked by 27-02-2014, interest requested for EMI3 tarball "informal" SR. GLUE2 Validation: Possible timeline: Broadcast to ROD and Sites on March 3rd, Probe will be set OPERATIONAL on March 10th, Sites will have other two weeks to fix the Site-BDII before receiving alarms. -David advises reading the agenda for more info. -EMI3 WN tarball staged rollout to start next month. - Monitoring Next meeting of the consolidation group this Friday, agenda looking at HammerCloud Functional tests -Looking at folding hammercloud into site availability monitoring - On-duty -rota needs updating, Jeremy is looking into it. - Rollout Will start ticketing sites not meeting the baselines at some point in the near future. - Security Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There are some setup docs: http://wiki.nikhef.nl/grid/Argus_Global_Banning_Setup_Overview#NGI_Argus Steve- I will review the documentation. Ian C - Reminds us we don't really have a choice, universal banning is a thing all sites have to be able to impliment somehow. Ewan- There are alternatives of argus. For those wanting to test the central argus Ewan has a special banned DN which he can use to test sites. Contact him if you want to try it out. - Services Reminder-Perfsonar is a production service! - Tickets -ILC supporting sites (most of the UK) need to review the instructions to impliment the ILC cvmfs and roll it out (or stop supporting ILC). The best way to track this is in the ticket itself https://ggus.eu/ws/ticket_info.php?ticket=101502 The status will be reviewed in next week's meeting after which sites may be ticketed individually. -Durham asked for some help with their perfsonar problems, encouraged to post details to TB-SUPPORT. -There was a side line debate about the relative priority that perfsonar should be given, and even if it should be classed as a production service. Matt- "Not having a working perfsonar is kind of a black mark against your site." Wahid- "Surely it should be more of a tiny smudge." See the chat window for more exchanges. - VOs WMSs now updated, so upgrading openssl can go ahead if you haven't already.. Housekeeping! (20') - Check of updates in different areas: HEPSPEC06 -https://www.gridpp.ac.uk/wiki/HEPSPEC06 Jeremy looking through the quarterly reports to look at this. Nothing for UCL. Durham has had trouble running HEPSPEC (having not done it before). ECDF only have SL5, have jobs running to test HEPSPEC. Sussex isn't actually on the page. JET missing completely. RALPP onto it, used to be Rob's job. Tier 1 - Work been done, just need to update page. perfSONAR - http://netmon02.grid.hep.ph.ic.ac.uk:8080/maddash-webui/index.cgi Edinburgh - tried upgrade, back up failed but Wahid wanted a backup. Ewan M reiterates that you really don't need a backup. Duncan's mesh configs do all the work for you. Nuke the box from orbit, then configure it from Duncan's mesh. Wahid asks what instructions to use? Jeremy will circulate. Sheffield - next week Brunel - almost done RALPP - when we have time. IPv6 status - https://www.gridpp.ac.uk/wiki/IPv6_site_status Table looking pretty complete. Imperial needs to update - IC can actually accept IPv6 jobs now. ARGUS deployment - https://www.gridpp.ac.uk/wiki/ARGUS_deployment Birmingham TBC, as well as EFDA-JET and Sussex. Matt RB-Sussex done, just need to update table. Batch systems - https://www.gridpp.ac.uk/wiki/Batch_system_status Quite a few holes in the table, needs to be filled in over the coming days. No comments. Jeremy-Reminder of the multicore taskforce, meeting Tuesday afternoons. Resilient VOMS Leave this till next week now. - 12:55 Further checks (10') Sites with EMI-2 services (services to be decommissioned by April) - http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html Number of sites still on EMI2 APEL. Tier 1 - reasonably confident that will have migrated away from EMI2 in time. Brunel-i EMI2 free Imperial- APEL and a couple of creams. Liverpool- EMI2 free. QMUL- no EMI2 her either. Cmabridge- ditto. RHUL- bdii and apel still to go. Lancaster - need to make EMI3 tarball, coming along well. Will test it at Lancaster. New voms tools biggest pain. Also asked question about how DPM emi generation is calculated - only yaim is got from the EMI repos, check /etc/emi-version on headnode. Oxford - No EMI3 on production - will need to check test boxes Manchester- APEL, Bdii and VOMS need upgrading. Durham - have a few services left to go. ECDF- waiting on tarball, need to update a CE. Sussex- argus, cream, apel and bdii need upgrading Glasgow- have a few things, working on them. Birmingham and Bristol need to be checked, but Ewan M reckons Bristol is looking good. RALPP- bdii and argus to go. Sites below baseline (WLCG will start monitoring in coming months) - https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions Sites are asked to review against this table. Jeremy-Maybe we can improve our monitoring on this? Automate the checking. Sites with GLUE2 issues No time to go through it, Glue2 validator, sites encouraged to look at this. Probe will become operational 10th of March, and sites with problems will have 2 weeks to fix. Daniela notes that IC have an issue by virtue of running an ARC CE, but is a problem with the CE middleware itself that hasn't been addressed. 13:05 AOB (1') No AOB. Chat Window: [11:01:20] Jeremy Coles Matt is taking minutes today. [11:09:27] Steve Jones Instruction in resource table of approved VOs [11:09:35] Steve Jones for CVMFS for ILC [11:10:50] Steve Jones https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements [11:22:31] Daniela Bauer mea Culpa, will do it right now ... [11:23:10] Mark Mitchell Core Services was my bad hadn't updated the date on the doc, sorry. [11:23:17] Mark Mitchell Had added to it [11:26:00] David Crooks Sorry if I was a bit noisy, we have a cleaning lorry right outside our window [11:26:24] Jeremy Coles It was fine for me David. [11:32:07] Elena Korolkova which ticket? [11:32:25] Daniela Bauer we've already installed it (cvmfs for ilc) [11:32:36] John Hill So have we [11:34:45] Ewan Mac Mahon @Elena - this one: https://ggus.eu/ws/ticket_info.php?ticket=101502 [11:35:58] Chris Brew can't talk for some reason [11:36:44] Chris Brew Only issue on the LHCb ticket was it's effectively for a new service just coming in as an urgent GGUS ticket with no warning/discussion with us [11:37:22] Matt Raso-Barnett i've marked the sussex perfsonar ticket as solved now [11:37:39] Wahid Bhimji "production" service - actually needed for any jobs or workflows? [11:37:54] Ewan Steele anyone got any bright ideas for fixing mine at durham? [11:39:39] Ewan Mac Mahon It's a production service as much as a CE is. One offers network monitoring, one runs jobs, both are things you're supposed to be doing. [11:39:43] Matt Doidge None I'm afraid Ewan. Anyone else? https://ggus.eu/ws/ticket_info.php?ticket=100968 [11:41:19] Ewan Mac Mahon Clearly the CE is offering a more valuable service, but a site with a broken perfsonar is a site with one of its services down. [11:41:35] Ewan Mac Mahon It's not fully working at that point. [11:42:47] Sam Skipsey but a *production* service is one which is necessary for *production*, surely? [11:43:02] Sam Skipsey (I agree that Perfsonar is a service.) [11:43:22] Ewan Mac Mahon There's also a chicken/egg issue here that we can't use perfsonar to test the network if most of the failures are down to badly configured endpoints. [11:44:09] Ewan Mac Mahon Or even just enough of the failures are that no-one can have confidence in the results. [11:44:12] Ian Collier And the network is necessary. We'd better have a way of monitoring that. Perfsonar is teh chosen mechanism. [11:45:13] Wahid Bhimji As usual with this thing - any new supposedly "production" services to the list just distracts short manpower sites from actually making jobs work [11:46:02] Sam Skipsey You haven't demonstrated that perfsonar is a production service, Ian, just that we need a working network. Which is not a function of perfsonar working. [11:46:14] Wahid Bhimji if physics is being done then site is working. [11:46:50] Ian Collier But it is the mechanism that teh collaboration has chosen to monitor the network. That can only be done effectively if it is treated as a production service itself. [11:47:18] Matt Raso-Barnett Sussex isn't actually on the HEPSPEC page -- I'll try to get updated figures for us on that page this week [11:47:55] Sam Skipsey Of course, perfsonar doesn't really monitor "the network" - it does latency and bandwidth tests that give point-to-point measures of transfers. Strictly, a proper network monitor would look more like the RIPE probes that Ewan keeps talking about. [11:48:28] Mark Mitchell However, we do need a reporting mechanism for network connectivity per tier-2. I wouldn't have said that this is a production service in a specific UK instance. It is a service which we need to monitor it. Ah the joys of semantics [11:48:53] Sam Skipsey (In any case, I do accept the pragmatic point that the Collaboration has decided to declare Perfsonar a Production Service.) [11:49:31] Ewan Mac Mahon Point-to-point measurements between all the points we care about monitors the network as we see it though, which on one level is what we directly care about. [11:50:00] Ewan Mac Mahon And on a practical level, perfsonar nodes are really simple - you install off the disk image, point at Duncan's config files, and you're done. [11:50:22] Ewan Mac Mahon And then you pretty much forget about it. [11:50:24] Wahid Bhimji everything is supposedly "really simple" but it adds up [11:50:37] Sam Skipsey Except when they break, Ewan, which apparently keeps happening for more than one site? [11:50:39] Wahid Bhimji and I'm not ware of a real experiment issue fixed by perfsonar [11:50:45] Wahid Bhimji e.g. our T2D issues [11:51:09] Elena Korolkova I plan to to this first week of March [11:51:16] Elena Korolkova to do [11:51:18] Ewan Mac Mahon No-one's going to be able to fix any _real_ issues until we have a reliable test bed. [11:51:50] Sam Skipsey So, the only *real* network issue we had over the last year was caused by a deep routing issue that perfsonar would never (and didn't) detect. [11:52:26] Mark Mitchell Which was picked up by Chris as he was looking at an increase in latency with file transfers if I remeber correctly. [11:53:03] Mark Mitchell The interesting outcome of that was that the escalation process between CERN and GEANT was manual at that point. [11:54:27] Raul Lopes Brunel pretty much done [11:54:56] Mark Mitchell This may change with the RIP Probes but what is evident is that a carrier configuration issue went undetected [11:55:04] Jeremy Coles https://www.gridpp.ac.uk/wiki/IPv6_site_status [11:56:50] Jeremy Coles https://www.gridpp.ac.uk/wiki/ARGUS_deployment [11:56:57] Ewan Mac Mahon perfSonar does do both routing and latency measurements. With a fully working perfSonar setup QMUL might have been able to diagnose their issue by noting that the route between the endpoints had changed, and changed to a silly route. [11:57:36] Ewan Mac Mahon But that would require everyone to have their perfSonar boxes working, and available to the internet. [11:58:18] Mark Mitchell I agree, the deployment is vital to this as the route change caused an increase which on the surface was minor, unless you were in the north west of Europe [11:58:19] Wahid Bhimji wyeah with all this network probelms thats the reply - that if perfsonar was made up to the point it was useful then it would be useful [11:58:31] Wahid Bhimji but it also has to be maintained [11:58:41] Wahid Bhimji and you have to know that its being maintained [11:59:54] Wahid Bhimji etc...etc. no grid system should rely on the sites setup and configuration that much . [12:03:50] Elena Korolkova Was the deadline for movimg to EMI3 changedb to end of May? [12:03:59] Daniela Bauer Imperial has a couple of EMI2 creamce + APEL, but will update before the deadline [12:04:22] Mark Mitchell Also, the Perfsonar box doesn't really fix our network user status other than being able to supply a lot of information to JANET. Which is one of its advantage. I suppose. [12:05:03] David Crooks Elena: So the end of support is end of April, but effectively there's an additional month before the decommissioning deadline. [12:05:17] Raul Lopes No EMI-2 here [12:05:43] Steve Jones Liverpool: NO EMI2 Here, Either. [12:05:56] Steve Jones Mostly EMI3 (some UMD3) [12:05:58] Daniel Traynor qmul - no emi2 [12:06:18] John Hill Cambridge: no EMI-2 [12:06:27] Govind Songara May be apel and site bdii [12:06:52] Chris Brew site bdii, apel (to be retired) CreamCEs (To be retired) [12:07:13] Wahid Bhimji what is the difference (for the WN?) [12:07:41] Elena Korolkova Sheffield: ce's , apel and bdii will be moved to VM and EMI3 [12:07:47] Wahid Bhimji um oh [12:08:14] Elena Korolkova in March, I don't see a problem here [12:08:31] Wahid Bhimji test it at Lancs first ! [12:08:49] Ewan Mac Mahon And for Oxford we think we're EMI3 everywhere, but we need to check some of the oddball test boxes just to be sure. So, nothing important is EMI2, probably nothing at all. [12:12:07] Wahid Bhimji I have to go in a min. I can imagine ECDF might have EMI2 (the storage is EMI3) - certainly the CE is oldish which has to wait for a new (sl6) CE [12:12:33] Chris Brew and argus [12:12:48] Matt Raso-Barnett Sussex: argus, cream, apel, bdii all are still EMI2 [12:13:25] Ewan Mac Mahon I think Bristol are in pretty good shape. [12:14:32] Jeremy Coles https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions [12:15:35] Daniela Bauer It's not doing anything special right now [12:17:10] Matt Raso-Barnett i'm not familiar [12:17:14] Wahid Bhimji well obviously I am unfamilar with it [12:17:47] Daniela Bauer we ahve an issue due to having an ARC-CE [12:17:59] Daniela Bauer but this is not fixed in EMI3, so I don't consider it my problem [12:18:32] Wahid Bhimji bye