GridPP Operations meeting 2013 12 17
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

VOs
=====

LHCb:
-------

LHCb's status in the UK remains unchanged since last week - the workload
is mostly Monte Carlo and a few user analysis jobs, and is generally
working OK. There are a couple of problems which are being worked on, an
uploading problem at Sheffield:
 https://ggus.eu/ws/ticket_info.php?ticket=98594
and a problem picking up jobs at (ECDF?). There's also reprocessing work
going on, but it's taking place at CERN and GridKa, not in the UK.

Raja also noted that LHCb is now fully CVMFS based for their software, and
no longer require a traditional NFS software are at all; any sites that
still have them may remove them.


CMS:
------

There is nothing to report for CMS (as always).


ATLAS:
--------

Last week Alastair Dewhurst discussed the CMVFS problem that was
experienced at RAL with inode values going over the 32bit limit when cmvfs
filesystems have been mounted for an extended period. There is a recipe
included in his slides 
 https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=289284
that other sites can use to see if their WNs are showing the same effect
now, and a fix is expected to be in the CVMFS 2.1.16 release.  Jeremy
Coles noted that affected sites could try to avoid the problem
(particularly for the holiday) period by restarting their nodes before the
break.

ATLAS' current and planned production work is currently somewhat short of
single-core simulation jobs, so ATLAS are requesting more multicore queues
to be made available. Ideally, the VO would like these to be assigned by
local batch systems dynamically, but where it's necessary to have fixed
allocations the recommendation is that 10-20% of total ATLAS share be
dedicated to multicore work.

The move to Rucio continues, and ATLAS expect to shortly have deadlines by
which DPM and dcache sites will be expected to have enabled webdav access
and to have had files on their storage renamed (StoRM sites are
temporarily exempted due to a problem with the webdav rename process being
slow). The deadline has not yet been set, but probably will be in January.


Other VOs:
------------

Chris Walker reported that several of the GridPP VOs have now tested
services using VOMS proxies issued by the 'new' backup VOMS instances, and
several problems have been found and fixed. Chris suggested that we should
be clear to recommend UI configurations be switched to include them from
January - we could probably do it now with minimal breakage, but it's
safer to leave it. CW also noted that this process has taken three months,
and we should have a post-mortem review to consider how we could have done
this better.


Updates:
==========

There was a GDB last week; no summary is yet available, but the agenda is
at:
 http://indico.cern.ch/conferenceDisplay.py?confId=251192

The DPM workshop took place in Edinburgh; more coverage will be available
in the GridPP storage meeting tomorrow.

There will be a meeting of the middleware readiness group on Thursday:
 https://indico.cern.ch/conferenceDisplay.py?confId=285681


Tier One:
-----------

Not much has been changing at RAL, work is mostly focussed on 'battening
down' ready for the holiday period. Gareth noted though, that the Tier 1
will still have their usual degree of on call coverage in case of major
incidents.

One change is that an NGI ARGUS instance has been (somewhat) set up. This
is currently not in a usable state, and is waiting on some changes at CERN
(to allow it access to the Emergency User Suspension lists) before it can
progress.

Documentation
---------------

Jeremy warned that many of our KeyDocs are due for review, and that some
have no current owner, and requested that people consider the state of the
monitoring at:
 https://www.gridpp.ac.uk/php/KeyDocs.php

On duty:
----------

It's been pretty quiet; there have been a few tickets, some are still open
where sites are waiting for information or assistance, and several for
short-lived problems that have now been closed.

Matt's World of GGUS:
-----------------------

Matt posted a link to the current ticket list and noted that there hasn't
been much recent change:
 http://tinyurl.com/cblj3ab
Jeremy Coles pointed out that some of the tickets in the list have been
open long enough to go red, and expressed particular concern about
outstanding 'urgent' tickets from VOs, and any tickets that will hit their
automatic escalation dates over the holidays. Matt offered to do a final
pre-holiday review on Thursday, so it was requested that sites check on
their tickets and tidy up where possible - e.g. post updates, close
tickets, place things on hold if appropriate, etc. 


GDB highlights review:
========================

Jeremy Coles went through some of the more notable bits from the recent
GDB, while suggesting that sites might like to have a look at the slides
linked from that meeting's agenda:
 http://indico.cern.ch/conferenceDisplay.py?confId=251192

He picked out:

Identity Federation discussions which are currently centring on the
possibility of hiding some of the fundamental X509 infrastructure behind
friendlier interfaces (for example, portals). This is currently at an
early stage, mostly consisting of working out what WLCG would require of
any such thing.

EGI have been rearranging and redistribution work within the project with
the aim of not running out of money. The notional effort dedicated to some
areas has been reduced, and other things have been picked up - e.g. some
things of interest to WLCG have been picked (back?) up by CERN.

SHA2: The main remaining areas of concern are dcache and StoRM SEs, but
there are compliant versions available/expected soon. Martin Litmaath
believes that WLCG should be sufficiently ready by about mid-January. The
UK e-Science CA has decided to postpone their switch to issuing SHA2 certs
by default until (probably) March. In contrast the French CA is expected 
to move soon (possibly next week), so there is a chance that their users
could have problems with any non-compliant services.

New DM clients: Jeremy recommended that people look at the slides:
 http://indico.cern.ch/getFile.py/access?contribId=13&sessionId=2&resId=1&materialId=slides&confId=251192
The key point though is that the current/old gfal/lcg-utils tools are in
strict maintanence mode and will not receive feature updates (e.g. more
IPv6 support).

CERN are expecting to have a general IPv6 service deployed and available
to their whole site from Q1 2014.

There was a report from Hepix that a new SpecCPU benchmark is likely to be
coming in October 2014; there are hepix people on the working group already.

Central ARGUS discussion:
===========================

EGI have requested that the UK (and other) NGI create an NGI ARGUS
instance as part of the central Emergency User Suspension infrastructure:
 https://ggus.eu/ws/ticket_info.php?ticket=99556 
The most likely way for it to be used by sites if for them to deploy a
site ARGUS, use it for authorisation decisions, and have it sync from the
UK NGI one. To this end, Jeremy has created a wiki table to track the
status of ARGUS (and the related glExec) deployments in the UK:
 https://www.gridpp.ac.uk/wiki/ARGUS_deployment

There has been some discussion (on tb-support) about the policy and the
practicalities of implementing it since some sites are reluctant to deploy
ARGUS, or have more complex authentication and authorisation setups that
they are unsure will integrate well with it. In particular Matt Doidge
raised the question of Lancaster's setup (two clusters, one Physics, one
University shared one) that use different pool account mappings - it was
pointed out that this currently uses two gridmapdir areas, and could
likely just as well use two ARGUS servers. Chris Walker queried whether
there were any assumptions built in to the system that ARGUS servers would
exist one-per-site, Ewan Mac Mahon said that he thought not, but it was
agreed that this point would need to be checked. Andrew McNab asked whether
there would be automated testing of the Emergency Suspension system in future,
and if so, how it might work - no-one knows.

AOB:
=====

Kashif highlighted the recent warnings about the new version of OpenSSL that
was released in RHEL/SL. This appears to cause problems that break Cream CEs
under some circumstances. There has been a discussion of this on lcg-rollout,
which included the recommendation from Martin Litmaath to hold the update back;
Chris Walker queried whether the security team thought that was safe, the team
didn't have a view on that, but did have a (regular) meeting scheduled for
later in the day. Chris Walker and Daniella Bauer also pointed out that there
was a new release of gridsite, which appears to address the problem in some
fashion, but Kashif pointed out that the new gridsite release had not been
thouroughly tested, and that it wasn't clear whether or not it did solve the
whole problem.  Daniella explained that Imperial has seen failures after having
updated the OpenSSL version, but that these appeared to have gone away after
installing the gridsite update, but that that had only been a couple of hours
previously, so it was too early to tell for sure.

Next meeting:
===============

Will be 11am, on Tuesday the 7th of January, 2014