GridPP Operations meeting 2013 12 17 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- VOs ===== LHCb: ------- LHCb's status in the UK remains unchanged since last week - the workload is mostly Monte Carlo and a few user analysis jobs, and is generally working OK. There are a couple of problems which are being worked on, an uploading problem at Sheffield: https://ggus.eu/ws/ticket_info.php?ticket=98594 and a problem picking up jobs at (ECDF?). There's also reprocessing work going on, but it's taking place at CERN and GridKa, not in the UK. Raja also noted that LHCb is now fully CVMFS based for their software, and no longer require a traditional NFS software are at all; any sites that still have them may remove them. CMS: ------ There is nothing to report for CMS (as always). ATLAS: -------- Last week Alastair Dewhurst discussed the CMVFS problem that was experienced at RAL with inode values going over the 32bit limit when cmvfs filesystems have been mounted for an extended period. There is a recipe included in his slides https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=289284 that other sites can use to see if their WNs are showing the same effect now, and a fix is expected to be in the CVMFS 2.1.16 release. Jeremy Coles noted that affected sites could try to avoid the problem (particularly for the holiday) period by restarting their nodes before the break. ATLAS' current and planned production work is currently somewhat short of single-core simulation jobs, so ATLAS are requesting more multicore queues to be made available. Ideally, the VO would like these to be assigned by local batch systems dynamically, but where it's necessary to have fixed allocations the recommendation is that 10-20% of total ATLAS share be dedicated to multicore work. The move to Rucio continues, and ATLAS expect to shortly have deadlines by which DPM and dcache sites will be expected to have enabled webdav access and to have had files on their storage renamed (StoRM sites are temporarily exempted due to a problem with the webdav rename process being slow). The deadline has not yet been set, but probably will be in January. Other VOs: ------------ Chris Walker reported that several of the GridPP VOs have now tested services using VOMS proxies issued by the 'new' backup VOMS instances, and several problems have been found and fixed. Chris suggested that we should be clear to recommend UI configurations be switched to include them from January - we could probably do it now with minimal breakage, but it's safer to leave it. CW also noted that this process has taken three months, and we should have a post-mortem review to consider how we could have done this better. Updates: ========== There was a GDB last week; no summary is yet available, but the agenda is at: http://indico.cern.ch/conferenceDisplay.py?confId=251192 The DPM workshop took place in Edinburgh; more coverage will be available in the GridPP storage meeting tomorrow. There will be a meeting of the middleware readiness group on Thursday: https://indico.cern.ch/conferenceDisplay.py?confId=285681 Tier One: ----------- Not much has been changing at RAL, work is mostly focussed on 'battening down' ready for the holiday period. Gareth noted though, that the Tier 1 will still have their usual degree of on call coverage in case of major incidents. One change is that an NGI ARGUS instance has been (somewhat) set up. This is currently not in a usable state, and is waiting on some changes at CERN (to allow it access to the Emergency User Suspension lists) before it can progress. Documentation --------------- Jeremy warned that many of our KeyDocs are due for review, and that some have no current owner, and requested that people consider the state of the monitoring at: https://www.gridpp.ac.uk/php/KeyDocs.php On duty: ---------- It's been pretty quiet; there have been a few tickets, some are still open where sites are waiting for information or assistance, and several for short-lived problems that have now been closed. Matt's World of GGUS: ----------------------- Matt posted a link to the current ticket list and noted that there hasn't been much recent change: http://tinyurl.com/cblj3ab Jeremy Coles pointed out that some of the tickets in the list have been open long enough to go red, and expressed particular concern about outstanding 'urgent' tickets from VOs, and any tickets that will hit their automatic escalation dates over the holidays. Matt offered to do a final pre-holiday review on Thursday, so it was requested that sites check on their tickets and tidy up where possible - e.g. post updates, close tickets, place things on hold if appropriate, etc. GDB highlights review: ======================== Jeremy Coles went through some of the more notable bits from the recent GDB, while suggesting that sites might like to have a look at the slides linked from that meeting's agenda: http://indico.cern.ch/conferenceDisplay.py?confId=251192 He picked out: Identity Federation discussions which are currently centring on the possibility of hiding some of the fundamental X509 infrastructure behind friendlier interfaces (for example, portals). This is currently at an early stage, mostly consisting of working out what WLCG would require of any such thing. EGI have been rearranging and redistribution work within the project with the aim of not running out of money. The notional effort dedicated to some areas has been reduced, and other things have been picked up - e.g. some things of interest to WLCG have been picked (back?) up by CERN. SHA2: The main remaining areas of concern are dcache and StoRM SEs, but there are compliant versions available/expected soon. Martin Litmaath believes that WLCG should be sufficiently ready by about mid-January. The UK e-Science CA has decided to postpone their switch to issuing SHA2 certs by default until (probably) March. In contrast the French CA is expected to move soon (possibly next week), so there is a chance that their users could have problems with any non-compliant services. New DM clients: Jeremy recommended that people look at the slides: http://indico.cern.ch/getFile.py/access?contribId=13&sessionId=2&resId=1&materialId=slides&confId=251192 The key point though is that the current/old gfal/lcg-utils tools are in strict maintanence mode and will not receive feature updates (e.g. more IPv6 support). CERN are expecting to have a general IPv6 service deployed and available to their whole site from Q1 2014. There was a report from Hepix that a new SpecCPU benchmark is likely to be coming in October 2014; there are hepix people on the working group already. Central ARGUS discussion: =========================== EGI have requested that the UK (and other) NGI create an NGI ARGUS instance as part of the central Emergency User Suspension infrastructure: https://ggus.eu/ws/ticket_info.php?ticket=99556 The most likely way for it to be used by sites if for them to deploy a site ARGUS, use it for authorisation decisions, and have it sync from the UK NGI one. To this end, Jeremy has created a wiki table to track the status of ARGUS (and the related glExec) deployments in the UK: https://www.gridpp.ac.uk/wiki/ARGUS_deployment There has been some discussion (on tb-support) about the policy and the practicalities of implementing it since some sites are reluctant to deploy ARGUS, or have more complex authentication and authorisation setups that they are unsure will integrate well with it. In particular Matt Doidge raised the question of Lancaster's setup (two clusters, one Physics, one University shared one) that use different pool account mappings - it was pointed out that this currently uses two gridmapdir areas, and could likely just as well use two ARGUS servers. Chris Walker queried whether there were any assumptions built in to the system that ARGUS servers would exist one-per-site, Ewan Mac Mahon said that he thought not, but it was agreed that this point would need to be checked. Andrew McNab asked whether there would be automated testing of the Emergency Suspension system in future, and if so, how it might work - no-one knows. AOB: ===== Kashif highlighted the recent warnings about the new version of OpenSSL that was released in RHEL/SL. This appears to cause problems that break Cream CEs under some circumstances. There has been a discussion of this on lcg-rollout, which included the recommendation from Martin Litmaath to hold the update back; Chris Walker queried whether the security team thought that was safe, the team didn't have a view on that, but did have a (regular) meeting scheduled for later in the day. Chris Walker and Daniella Bauer also pointed out that there was a new release of gridsite, which appears to address the problem in some fashion, but Kashif pointed out that the new gridsite release had not been thouroughly tested, and that it wasn't clear whether or not it did solve the whole problem. Daniella explained that Imperial has seen failures after having updated the OpenSSL version, but that these appeared to have gone away after installing the gridsite update, but that that had only been a couple of hours previously, so it was too early to tell for sure. Next meeting: =============== Will be 11am, on Tuesday the 7th of January, 2014