ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 31 January 2019

Present: Alessandra Forti, Dan Traynor, Elena Korolkova, Gareth Roy, Matt Doidge, Peter Love,  Tim Adye, Vip Davda

Outstanding tickets:

* ggus 139375 RAL-LCG2 transfers fail with "the server responded with an error 500"
    Issue with atl16mcHITS tape migrations being stuck so tape cache filled up. Now fixed.
* ggus 138033 singularity jobs failing at RAL
    Home directory fix is being tested and should soon be rolled out more widely across the farm.
    Alessandra: Rohina could run from docker, but not from Singularity. Will move to docker.
    Pilot will use unpacked containers in CVMFS on RAL-LCG2_TEST queue.
    Yesterday tried singularity standalone images, removing option -c so doesn't mount home directory.
    Needs to be tested with different setups.
    WLCG will not recommend using user namespaces. Only works in Singularity 2.6.1, and won't work if we move to Singularity 3.
    User namespaces and setuid modes each have their own problems: https://docs.google.com/spreadsheets/d/1SGKyja47Veu_8IUXlXWOOEferuFoD62O4m64pgTNgSk/
    Tried Podman which can do this stuff, but doesn't yet work on CentOS7.

Ongoing issues (new comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.

2. Sussex becoming a CPU only site. Will use T1 disk or maybe RHUL. (Tim)
    Dan discussed with Leo at Sussex: happy to get rid of STORM and move to ARC, with Lustre as a cache. Proven to work at Durham and other European sites.
    8/11: Dan: Leo is moving to a job, so we have to wait for his replacement.
    17/01: Kashif: Sussex grid person moved to central IT. Looking after Grid on a voluntary basis.

3. BHAM migration to EOS, Mark requests to move now (Elena/Brian)
    see ADCINFR-87
    Discussed whether to restore ATLAS disk at Birmingham with new EOS instance (when Mark is ready in in 5 weeks), or to remain diskless accessing data at Manchester.
    Alessandra: has saturated Manchester network. Will need to add storage or cache at Birmingham.
    Only cache option currently supported for ATLAS is ARC cache. It seems to work well, but not ready to install everywhere.
    ATLAS needs time to develop a strategy. As a cloud squad we would like to restore disk at Birmingham until this has been done.
    In the meantime, can favour sim prod at Birmingham to save Manchester network.
    Discussed in GridPP Storage Group meeting (see minutes for 31 October 2018). They said we can decide what we want for ATLAS.
    For the moment, we'll keep it using Manchester and monitor.
    8/11: Sam: discussed in Ops meeting. Request to gather more data. Alessandra continuing to collect data. Birmingham should also collect data.
    15/11: Sam will email about this.
    22/11: Alessandra: bandwidth still full. Not just due to Birmingham but other sites with FTS. Asking for 40Gbs to JANET.
                Later can change Birmingham fairshares to see effect.
                2nd biggest bottleneck is headnode. Increasing #threads etc. Looking at SSD disks for database.
                Sam: could use memcache for db, instead of SSD. (memcache is default in 8.10, but Manchester has older version.)
                Will be interesting to see effect of Birmingham by seeing change when we switched to use Manchester, and later when we change the fairshares.
    29/11: Alessandra reported no problems in last 2-3 weeks.
    10/01: Decision needed on whether ATLAS to have storage at BHAM.
    17/01: Elena: discussed on Tuesday. Mark ready to install storage. Can provide 500 TB. Says it's not a big deal to setup, at least from the Bham side.
    Alessandra: No analysis queue, so no LOCALGROUPDISK. Just DATADISK.
    Elena: For ATLAS setup, we don't have experience with EOS. Alessandra: not a bad thing to know.
    Alessandra: network is sort of fine, but didn't check how efficiently jobs run. Did not run any production jobs for at least a week.
    Elena will check jobs at Bham and discuss with Mark. Will ask him to call into a next meeting.
    24/01: Elena: no jobs running in last 48 hours. Could be a Vac problem? We should sort this out before contacting Mark.
    Gareth: no ATLAS Vac jobs running at Glasgow either: http://vacmon.gridpp.ac.uk/ . Peter will check.
    Peter: Will need to integrate Vac into Harvester. New JDL needs to be developed.
    Monitoring very slow at the moment. Peter looking at it. Gareth will look too.
    29/01: Discussion in ADC Weekly of CPU-only sites:
    https://indico.cern.ch/event/793522/contributions/3300712/attachments/1787032/2909957/Diskless_Jan19.pdf

    Elena: Jobs now running OK. 220 jobs in last 12 hours.
    Peter: Previously issues with Vac, but still running on APF.
    Efficiency OK so far. Decided to keep diskless and monitor further. Probably won't need extra network monitoring.


4. Harvester issues (Peter/Alessandra/Elena)
    06/12: Nicolo, Ivan, and Fernando declared the migration of UK T2 UCORE queues complete, but problems still at Manchester (at least). So we keep an ongoing issue.
    Manchester UCORE half empty for almost a week. ce01 empty, ce02 OK.
    Peter is looking at improving monitoring. Harvester obfuscates differences between CEs, whereas APF treated them differently.
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce01
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce02
    Alessandra can't find Harvester "factory" for GPU queue.
    Peter asked Fernando to improve messages from Harvester. Peter will continue looking into it with monitoring.
    13/12: Elena has setup panda site UKI-LT2-QMUL_SL7_UCORE and queue UKI-LT2-QMUL_SL7_UCORE and sent email to harvester experts asking to include it to Harvester.
    10/01: Gareth asked about approach to debugging issues at Durham. Elena will follow-up Harvester status at Durham and get an update on procedures. Peter to check apfmon in relation to Harvester. QMUL_UCORE is running OK under Harvester.
    17/01: Elena: Ivan set Durham UCORE PQ offline in September (currently just score). Elena mailed him twice. Will set back online.
    Brian confirmed that he has declared missing files lost. Procedure with new DDM emails seems to work fine.
    24/01: Elena confirmed that Durham not in Harvester yet. Will be migrated separately because it uses aCT.
    Elena: Removed secondary Squids from UK T2s in AGIS. Three sites don't have local squids: IC uses QMUL, Durham uses Glasgow, Sussex uses RALPP. These should be left as-is.
    Elena: Raoul from Brunel asked about moving away from SRM. Should be possible to use gsiftp instead. Elena will check in AGIS.
    31/01: Elena: problem with Durham solved. When they move to CentOS7 they will be moved to Harvester and UCORE.
    Brunel asked about doing transfers without SRM. Configured same as Lancaster, which is using xrootd-only. Peter will check Brunel's settings (Elena will forward him the email).
    Alessandra set QMUL CentOS7 queue to test mode, because it doesn't yet have enough slots.


5. ATLAS Sites Jamboree, 5-8 March. Encourage ATLAS site admins to attend. Maybe we could have a face-to-face meeting.
    Alessandra, Peter, Elena, and Tim will probably be there.
    Alessandra publicised for site admins on TB-SUPPORT@JISCMAIL.AC.UK. Will GridPP fund travel?

6. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.

News:
Alessandra: NTR
Dan: Issue with Storm data flood. Now fixed and working fine.
Elena: NTR
Gareth: NTR
Matt: Lancaster SL6 local queue will be retired soon. B-physics group shouldn't need it any more.
Peter: Working on Harvester monitoring. Should be able to see "inner thoughts" of Harvester, and see queued/running/activated jobs in apfmon.
Vip: Moved most WNs to CentOS7. Now have 2400 CentOS7 cores and 700 SL6 for legacy VOs. Will discuss whether to retire ATLAS SL6 queues.
Tim:
* One file had a checksum mismatch at RAL, which I tracked down to a problem in the creating job's pilot. The problem was quickly diagnosed by Paul Nilsson, who is working on a fix so this issue should be detected if it occurs again in future.
* Investigating slightly low job efficiency at RAL.
* Waiting for a repeat of duplicate file writes in order to debug FTS.
 
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m