ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 17 January 2019

Present: Alessandra Forti, Brian Davies, Dan Traynor, Elena Korolkova, Gareth Roy, Kashif Mohammad, Sam Skipsey, Tim Adye, Vip Davda

Outstanding tickets:

* ggus 138033 singularity jobs failing at RAL
    Alessandra: ATLAS workaround not in production. Still in dev pilot. Move to using unpacked images in CVMFS.
    For standalone containers, unpack in worker directory tree, so works round home directory problem.
    With the prod pilot, still won't work: home directory problem masked by need for loop device.

Ongoing issues (new comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's
LOCALGROUPDISK? Currently using dmlite-shell.
    Tim will provide a list from rucio for Oxford. If useful, can we do this automatically for all UK LOCALGROUPDISKs?

2. Sussex becoming a CPU only site. Will use T1 disk or maybe RHUL. (Tim)
    Dan discussed with Leo at Sussex: happy to get rid of STORM and move to ARC, with Lustre as a cache. Proven to work at Durham and other European sites.
    8/11: Dan: Leo is moving to a job, so we have to wait for his replacement.
    17/01: Kashif: Sussex grid person moved to central IT. Looking after Grid on a voluntary basis.
3. BHAM migration to EOS, Mark requests to move now (Elena/Brian)
    see ADCINFR-87
    Discussed whether to restore ATLAS disk at Birmingham with new EOS instance (when Mark is ready in in 5 weeks), or to remain diskless accessing data at Manchester.
    Alessandra: has saturated Manchester network. Will need to add storage or cache at Birmingham.
    Only cache option currently supported for ATLAS is ARC cache. It seems to work well, but not ready to install everywhere.
    ATLAS needs time to develop a strategy. As a cloud squad we would like to restore disk at Birmingham until this has been done.
    In the meantime, can favour sim prod at Birmingham to save Manchester network.
    Discussed in GridPP Storage Group meeting (see minutes for 31 October 2018). They said we can decide what we want for ATLAS.
    For the moment, we'll keep it using Manchester and monitor.
    8/11: Sam: discussed in Ops meeting. Request to gather more data. Alessandra continuing to collect data. Birmingham should also collect data.
    15/11: Sam will email about this.
    22/11: Alessandra: bandwidth still full. Not just due to Birmingham but other sites with FTS. Asking for 40Gbs to JANET.
                Later can change Birmingham fairshares to see effect.
                2nd biggest bottleneck is headnode. Increasing #threads etc. Looking at SSD disks for database.
                Sam: could use memcache for db, instead of SSD. (memcache is default in 8.10, but Manchester has older version.)
                Will be interesting to see effect of Birmingham by seeing change when we switched to use Manchester, and later when we change the fairshares.
    29/11: Alessandra reported no problems in last 2-3 weeks.
    10/01: Decision needed on whether ATLAS to have storage at BHAM.
    17/01: Elena: discussed on Tuesday. Mark ready to install storage. Can provide 500 TB. Says it's not a big deal to setup, at least from the Bham side.
    Alessandra: No analysis queue, so no
LOCALGROUPDISK. Just DATADISK.
    Elena: For ATLAS setup, we don't have experience with EOS. Alessandra: not a bad thing to know.
    Alessandra: network is sort of fine, but didn't check how efficiently jobs run. Did not run any production jobs for at least a week.

    Elena will check jobs at Bham and discuss with Mark. Will ask him to call into a next meeting.
4. Harvester issues (Peter/Alessandra/Elena)
    06/12: Nicolo, Ivan, and Fernando declared the migration of UK T2 UCORE queues complete, but problems still at Manchester (at least). So we keep an ongoing issue.
    Manchester UCORE half empty for almost a week. ce01 empty, ce02 OK.
    Peter is looking at improving monitoring. Harvester obfuscates differences between CEs, whereas APF treated them differently.
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce01
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce02
    Alessandra can't find Harvester "factory" for GPU queue.
    Peter asked Fernando to improve messages from Harvester. Peter will continue looking into it with monitoring.
    13/12: Elena has setup panda site UKI-LT2-QMUL_SL7_UCORE and queue UKI-LT2-QMUL_SL7_UCORE and sent email to harvester experts asking to include it to Harvester.
    10/01: Gareth asked about approach to debugging issues at Durham. Elena will follow-up Harvester status at Durham and get an update on procedures. Peter to check apfmon in relation to Harvester. QMUL_UCORE is running OK under Harvester.
    17/01: Elena: Ivan set Durham UCORE PQ offline in September (currently just score). Elena mailed him twice. Will set back online. [Subsequent mails clarified that Durham Harvester migrated separately because it uses aCT.]
    Brian confirmed that he has declared missing files lost. Procedure with new DDM emails seems to work fine.


News:
Alessandra: NTR
Brian: NTR
Sam: NTR
Vip: Asked on atlas-support-cloud-uk@cern.ch about multithreaded analysis jobs that ran at Oxford. Sam suggested to use cgroups to control score jobs. Alessandra will take a look.
Tim:
* Cleaning up squids in AGIS for all sites: https://its.cern.ch/jira/browse/FTOPSDEVEL-218 . ATLAS cloud squads will be asked to check and apply changes. Checked if this is OK for UK sites.
* Missing files in RAL Echo. FTS bug when file first written to RAL disk: FTS starts two transfers for the same request, which causes transfer to fail but appear to succeed. Being investigated by FTS. Sam saw something similar. Will send Tim details.
* New Frontier server added. Part of a migration from HyperV to VMWare+OpenStack.
* Bad checksum file due to 2 identical jobs run on different WNs simultaneously by aCT. Rod is investigating.

AOB:
ATLAS Sites Jamboree, 5-8 March. Encourage ATLAS site admins to attend. Maybe we could have a face-to-face meeting.
Tim, Alessandra, and Peter will probably be there.
Alessandra will publicise for site admins on TB-SUPPORT@JISCMAIL.AC.UK. Will GridPP fund travel?
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m