ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 24 January 2019

Present: Alessandra Forti, Dan Traynor, Elena Korolkova, Gareth Roy, Kashif Mohammad, Matt Doidge, Peter Love, Sam Skipsey, Tim Adye, Vip Davda

Outstanding tickets:

* ggus 139293 Deletion errors at UKI-SOUTHGRID-CAM-HEP
    New ticket, under investigation.
* ggus 139282 Deletion errors in site UKI-NORTHGRID-LANCS-HEP
    New ticket. One disk server currently down following two power losses yesterday.
* ggus 138033 singularity jobs failing at RAL
    Home directory fix is being tested and should soon be rolled out more widely across the farm.

Ongoing issues (new comments from last week marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.


2. Sussex becoming a CPU only site. Will use T1 disk or maybe RHUL. (Tim)
    Dan discussed with Leo at Sussex: happy to get rid of STORM and move to ARC, with Lustre as a cache. Proven to work at Durham and other European sites.
    8/11: Dan: Leo is moving to a job, so we have to wait for his replacement.
    17/01: Kashif: Sussex grid person moved to central IT. Looking after Grid on a voluntary basis.

3. BHAM migration to EOS, Mark requests to move now (Elena/Brian)
    see ADCINFR-87
    Discussed whether to restore ATLAS disk at Birmingham with new EOS instance (when Mark is ready in in 5 weeks), or to remain diskless accessing data at Manchester.
    Alessandra: has saturated Manchester network. Will need to add storage or cache at Birmingham.
    Only cache option currently supported for ATLAS is ARC cache. It seems to work well, but not ready to install everywhere.
    ATLAS needs time to develop a strategy. As a cloud squad we would like to restore disk at Birmingham until this has been done.
    In the meantime, can favour sim prod at Birmingham to save Manchester network.
    Discussed in GridPP Storage Group meeting (see minutes for 31 October 2018). They said we can decide what we want for ATLAS.
    For the moment, we'll keep it using Manchester and monitor.
    8/11: Sam: discussed in Ops meeting. Request to gather more data. Alessandra continuing to collect data. Birmingham should also collect data.
    15/11: Sam will email about this.
    22/11: Alessandra: bandwidth still full. Not just due to Birmingham but other sites with FTS. Asking for 40Gbs to JANET.
                Later can change Birmingham fairshares to see effect.
                2nd biggest bottleneck is headnode. Increasing #threads etc. Looking at SSD disks for database.
                Sam: could use memcache for db, instead of SSD. (memcache is default in 8.10, but Manchester has older version.)
                Will be interesting to see effect of Birmingham by seeing change when we switched to use Manchester, and later when we change the fairshares.
    29/11: Alessandra reported no problems in last 2-3 weeks.
    10/01: Decision needed on whether ATLAS to have storage at BHAM.
    17/01: Elena: discussed on Tuesday. Mark ready to install storage. Can provide 500 TB. Says it's not a big deal to setup, at least from the Bham side.
    Alessandra: No analysis queue, so no LOCALGROUPDISK. Just DATADISK.
    Elena: For ATLAS setup, we don't have experience with EOS. Alessandra: not a bad thing to know.
    Alessandra: network is sort of fine, but didn't check how efficiently jobs run. Did not run any production jobs for at least a week.
    Elena will check jobs at Bham and discuss with Mark. Will ask him to call into a next meeting.
    24/01: Elena: no jobs running in last 48 hours. Could be a Vac problem? We should sort this out before contacting Mark.
    Gareth: no ATLAS Vac jobs running at Glasgow either: http://vacmon.gridpp.ac.uk/ . Peter will check.
    Peter: Will need to integrate Vac into Harvester. New JDL needs to be developed.
    Monitoring very slow at the moment. Peter looking at it. Gareth will look too.


4. Harvester issues (Peter/Alessandra/Elena)
    06/12: Nicolo, Ivan, and Fernando declared the migration of UK T2 UCORE queues complete, but problems still at Manchester (at least). So we keep an ongoing issue.
    Manchester UCORE half empty for almost a week. ce01 empty, ce02 OK.
    Peter is looking at improving monitoring. Harvester obfuscates differences between CEs, whereas APF treated them differently.
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce01
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce02
    Alessandra can't find Harvester "factory" for GPU queue.
    Peter asked Fernando to improve messages from Harvester. Peter will continue looking into it with monitoring.
    13/12: Elena has setup panda site UKI-LT2-QMUL_SL7_UCORE and queue UKI-LT2-QMUL_SL7_UCORE and sent email to harvester experts asking to include it to Harvester.
    10/01: Gareth asked about approach to debugging issues at Durham. Elena will follow-up Harvester status at Durham and get an update on procedures. Peter to check apfmon in relation to Harvester. QMUL_UCORE is running OK under Harvester.
    17/01: Elena: Ivan set Durham UCORE PQ offline in September (currently just score). Elena mailed him twice. Will set back online.
    Brian confirmed that he has declared missing files lost. Procedure with new DDM emails seems to work fine.
    24/01: Elena confirmed that Durham not in Harvester yet. Will be migrated separately because it uses aCT.
    Elena: Removed secondary Squids from UK T2s in AGIS. Three sites don't have local squids: IC uses QMUL, Durham uses Glasgow, Sussex uses RALPP. These should be left as-is.
    Elena: Raoul from Brunel asked about moving away from SRM. Should be possible to use gsiftp instead. Elena will check in AGIS.


5. ATLAS Sites Jamboree, 5-8 March. Encourage ATLAS site admins to attend. Maybe we could have a face-to-face meeting.
    Alessandra, Peter, Elena, and Tim will probably be there.
    Alessandra publicised for site admins on TB-SUPPORT@JISCMAIL.AC.UK. Will GridPP fund travel?

6. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).


News:
Alessanda: NTR
Dan: NTR
Elena: NTR
Gareth: Just sent Peter an email with vacd logs. Daemons not talking to HTCondor.
Kashif: Asked about plans for sites like Oxford which haven't been allocated new disk. ATLAS is working on solutions, but nothing yet.
Matt:
* Power loss twice yesterday due to UPS failure. One disk server didn't react well, so working on recovering it. May have lost files. Currently shows lost ZFS pools. Gareth suggested "zpool import". Rob Currie is the expert.
* Following Ste Jones' success, tempted to try out HTCondor-CE in next weeks/months to move away from CREAM.
Peter: Vac issue looks like a server down. Hopefully a temporary problem. If not, push to move to Harvester. Gareth: maybe due to CentOS7 migrations.
Sam: can also help with ZFS.
Vip: moving more workers to CentOS7.
Tim: Still investigating FTS losing files as they are written into RAL Tier-1 Echo. Waiting for it to happen again so FTS experts can look in more detail.
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m