ATLAS UK Cloud Support meeting minutes, 15 November 2018

Present: Brian Davies, Dan Traynor, Elena Korolkova, Matt Doidge, Peter Love, Sam Skipsey, Tim Adye

Outstanding tickets:

* ggus 138233 UKI-NORTHGRID-LANCS-HEP transfer failure with "Communication error on send"
    Matt: Problem from ticket now solved. Changed xrootd endpoint in AGIS. This broke Rucio transfers on other queues.
    Peter: configured xrootd mover for all queues. Added lan rights.
    Matt: All jobs coming through UCORE. High failure rate. Will dig in logs to investigate.
    Will close ticket later today. If there is another issue, can open another one.
    Elena checked settings in AGIS. Look OK.
    Brian: Is there a difference with write wan and write lan? Changed xrootd to write wan and had trouble.
* ggus 138277 High failure rate in UKI-NORTHGRID-LIV-HEP_MCORE
    Peter will follow up.
* ggus 137112 UKI-NORTHGRID-MAN-HEP srm space reporting broken
    No news, but this will take time to resolve.
* ggus 138033 RAL-LCG2 singularity jobs failing at RAL
    Peter suggested to change fields to something crazy to see where this gets picked up. Ask AGIS support to do it in the schedconfig table.
* ggus 138281 Low efficiency and many failures in transfers from UKI-SOUTHGRID-RALPP
    Switched to IPv6/4 dual stack.
    Tim switched FTS servers from RAL to CERN at Chris's request.
    Elena will check its OK

Ongoing issues (new comments marked in green):

1. Configuring ECDF’s new storage (Elena)
    ggus 136391 is now closed. We will continue to discuss ECDF RDF storage in future meetings.
    Queue now online. Mostly HC tests and some jobs completed successfully. Not many jobs sent to RDF, presumably because only single-core.
    Alessandra: ATLAS policy is to move all sites to UCORE if possible. Keep ECDF and RDF separate.
    Andy agreed. Elena will setup in AGIS. We can then see if this gets more jobs/data to RDF.
    8/11: Elena will add UCORE queue today.
    15/11: Elena setup UCORE for RDF and asked to add to Harvester. Harvester experts said there was a problem. Elena will discuss with ECDF.
2. LocalGroupDisk consolidation (Tim)
* most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
* Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
* Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
* Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
* Alessandra: Manchester LOCALGROUPDISK is full again.
* Tim will see if another cleanup can help. Procedure is here.
    Elena: Oxford LOCALGROUPDISK full. Blacklisted for writing.
    Peter: AGIS has Oxford LOCALGROUPDISK higher priority than DATADISK. Will fix so jobs stop trying to write there.
    Tim will see if he can free up space on Oxford LOCALGROUPDISK.
3. Sussex becoming a CPU only site. Will use T1 disk or maybe RHUL. (Tim)
    Alastair has promised to let us know his ideas.
    Dan discussed with Leo at Sussex: happy to get rid of STORM and move to ARC, with Lustre as a cache. Proven to work at Durham and other European sites.
    Dan+Sam agreed to pursue this option with Sussex.
    8/11: Dan: Leo is moving to a job, so we have to wait for his replacement.
4. BHAM migration to EOS, Mark requests to move now (Elena/Brian)
    see ADCINFR-87
    Discussed whether to restore ATLAS disk at Birmingham with new EOS instance (when Mark is ready in in 5 weeks), or to remain diskless accessing data at Manchester.
    Alessandra: has saturated Manchester network. Will need to add storage or cache at Birmingham.
    Only cache option currently supported for ATLAS is ARC cache. It seems to work well, but not ready to install everywhere.
    ATLAS needs time to develop a strategy. As a cloud squad we would like to restore disk at Birmingham until this has been done.
    In the meantime, can favour sim prod at Birmingham to save Manchester network.
    Discussed in GridPP Storage Group meeting (see minutes for 31 October 2018). They said we can decide what we want for ATLAS.
    For the moment, we'll keep it using Manchester and monitor.
    8/11: Sam: discussed in Ops meeting. Request to gather more data. Alessandra continuing to collect data. Birmingham should also collect data.
    15/11: Sam will email about this.
5. Moving UK Cloud to Harvester (Peter/Alessandra/Elena)
    See spreadsheet.
    Oxford C7 UCORE queue is short of jobs.
    Glasgow had wallclock+CPU time limits = 0. Harvester treats this as 0 limit.
    Liverpool switched to Harvester. Jobs dried up.
    Peter supports idea of integrating Harvester with APFmon. Fernando and Tadashi have agreed, but Panda group are doing something else.
    Alessandra would like to push for better organisation of the Harvester migration. Need to check PanDA parameters, not just replicate MCORE queue. Eg. HC tests should be switched to add score, not just mcore tests.
    Elena: will check UK queues for these issues and email Harvester experts to ask about problems and APFmon.
    Alessandra is reviewing Singularity configuration for the documentation. When done, will send Elena a link.
    Elena: problem with Sussex and Liverpool. Alessandra: they are adding more worker nodes.
    ECDF doesn't run as many job slots as before switch.
    8/11: Elena will check settings. Many queues have only MCORE HC tests. Also need SCORE.
             Peter: Initial integration of APFmon monitoring with Harvester is done . Please contact Peter if anything missing.
             Eventually need to decommission APF. Non-UCORE can be serviced by Harvester, but switch later.
             Harvester ops meeting on Fri 9:30 (CERN time)
    15/11: Elena: Lancaster have UCORE queue. Should we "rename" UCORE to UCORE_SL7?
             Peter: unless it's causing problem leave it.
6. RHUL choice of storage to replace DPM (Brian/Sam)
    RHUL have 1.3PB of storage in DPM that is getting old, and asking for advice for what type of storage to replace it with.
    Alessandra and Sam suggest POSIX (HDFS) + xrootd. Allows xrootd+http access, which will be needed in future.
    15/11: Brian freed up space at RHUL. Then have space to investigate HDFS and xrootd caching.

News:
Brian: FTS problems with IPv6. Rome transfers broken for IPv6. Need to fix RAL first, then look at others. RAL FTS on same version as CERN FTS. Some sites enable on LHCONE, then problems with IPv6 because RAL on OPN.
Dan: NTR
Elena: NTR
Matt: NTR
Peter: NTR
Sam: NTR
Tim: RAL T1 all jobs now running in C7 containers.