ATLAS UK Cloud Support meeting minutes, 6 December 2018

Present: Alessandra Forti, Dan Traynor, Gareth Roy, Matt Doidge, Peter Love, Sam Skipsey, Vip Davda

Outstanding tickets:

* ggus 137112 UKI-NORTHGRID-MAN-HEP srm space reporting broken
    Alessandra: was waiting for feedback, but will close now.
* ggus 138033 RAL-LCG2  singularity jobs failing at RAL
    Alessandra is modifying run containers. Just needs to sett an environment variable.
    Talked with Rohini (SKA). She will look at doing the same.

Ongoing issues (new comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
  * Alessandra: Manchester LOCALGROUPDISK is full again.
  * Tim will see if another cleanup can help. Procedure is here.
    Elena: Oxford LOCALGROUPDISK full. Blacklisted for writing.
    Peter: AGIS has Oxford LOCALGROUPDISK higher priority than DATADISK. Will fix so jobs stop trying to write there.
    Tim will see if he can free up space on Oxford LOCALGROUPDISK.
2. Sussex becoming a CPU only site. Will use T1 disk or maybe RHUL. (Tim)
    Alastair has promised to let us know his ideas.
    Dan discussed with Leo at Sussex: happy to get rid of STORM and move to ARC, with Lustre as a cache. Proven to work at Durham and other European sites.
    Dan+Sam agreed to pursue this option with Sussex.
    8/11: Dan: Leo is moving to a job, so we have to wait for his replacement.
3. BHAM migration to EOS, Mark requests to move now (Elena/Brian)
    see ADCINFR-87
    Discussed whether to restore ATLAS disk at Birmingham with new EOS instance (when Mark is ready in in 5 weeks), or to remain diskless accessing data at Manchester.
    Alessandra: has saturated Manchester network. Will need to add storage or cache at Birmingham.
    Only cache option currently supported for ATLAS is ARC cache. It seems to work well, but not ready to install everywhere.
    ATLAS needs time to develop a strategy. As a cloud squad we would like to restore disk at Birmingham until this has been done.
    In the meantime, can favour sim prod at Birmingham to save Manchester network.
    Discussed in GridPP Storage Group meeting (see minutes for 31 October 2018). They said we can decide what we want for ATLAS.
    For the moment, we'll keep it using Manchester and monitor.
    8/11: Sam: discussed in Ops meeting. Request to gather more data. Alessandra continuing to collect data. Birmingham should also collect data.
    15/11: Sam will email about this.
    22/11: Alessandra: bandwidth still full. Not just due to Birmingham but other sites with FTS. Asking for 40Gbs to JANET.
                Later can change Birmingham fairshares to see effect.
                2nd biggest bottleneck is headnode. Increasing #threads etc. Looking at SSD disks for database.
                Sam: could use memcache for db, instead of SSD. (memcache is default in 8.10, but Manchester has older version.)
                Will be interesting to see effect of Birmingham by seeing change when we switched to use Manchester, and later when we change the fairshares.
    29/11: Alessandra reported no problems in last 2-3 weeks.
4. Harvester issues (Peter/Alessandra/Elena)
    06/12: Nicolo, Ivan, and Fernando declared the migration of UK T2 UCORE queues complete, but problems still at Manchester (at least). So we keep an ongoing issue.
    Manchester UCORE half empty for almost a week. ce01 empty, ce02 OK.
    Peter is looking at improving monitoring. Harvester obfuscates differences between CEs, whereas APF treated them differently.
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce01
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce02
    Alessandra can't find Harvester "factory" for GPU queue.
    Peter asked Fernando to improve messages from Harvester.
Peter will continue looking into it with monitoring.

News:

Tim:
    ANALY_RAL_ECHO has been blacklisted since Tuesday morning, and had been having problems with jobs created after 8am Monday.
    Peter saw jobs from last Thursday/Friday truncated stdout from CE. http://apfmon.lancs.ac.uk/dash/queue/ANALY_RAL_ECHO-14660
    Andre was messing around with something called "ACT Harvester".
Alessandra: NTR
Dan: Very few ATLAS jobs at QMUL. Jobs going to CE, but not getting submitted. May be internal problem. Peter said it looks like network problems (iPv6?) a couple of days ago (https://aipanda024.cern.ch/condor_logs_1/18-12-03_22/grid.6007155.0.err)
Gareth: NTR
Matt: NTR
Peter: NTR
Sam: NTR
Vip: NTR

Next week is ATLAS Software & Computing Week. Peter, Alessandra, and Tim will be at CERN and probably can't attend next week's ATLAS UK Cloud Support meeting. Elena will be available to chair the meeting.