ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 14 February 2019

Present: Alessandra Forti, Brian Davies, Dan Traynor, Elena Korolkova, Gareth Roy,  Kashif Mohammad, Matt Doidge, Sam Skipsey, Tim Adye, Vip Davda

Outstanding tickets:

* ggus 139663 UKI-SCOTGRID-ECDF two analysis queue active but not working
    Elena is looking
* ggus 139587 All jobs at UKI-NORTHGRID-LANCS-HEP_ES failed during STAGEOUT
    Peter fixed the problem in AGIS by switching the Pilot copy tool to use xrdcp. Reopened to discuss possible xrootd checksum issues.
* ggus 138033 singularity jobs failing at RAL
    Pilot2 production test jobs run OK. Alessandra will set up an Analysis test queue.
    Rohini (SKA) managed to run from Docker. Moving away from Singularity hub.
* ggus 139675 UKI-NORTHGRID-SHEF-HEP Transfer errors with TRANSFER globus_ftp_client
    Elena is looking

Ongoing issues (new comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.

2. Sussex becoming a CPU only site. Will use T1 disk or maybe RHUL. (Tim)
    Dan discussed with Leo at Sussex: happy to get rid of STORM and move to ARC, with Lustre as a cache. Proven to work at Durham and other European sites.
    8/11: Dan: Leo is moving to a job, so we have to wait for his replacement.
    17/01: Kashif: Sussex grid person moved to central IT. Looking after Grid on a voluntary basis.

3. BHAM migration to EOS, Mark requests to move now (Elena/Brian)
    see ADCINFR-87
    Discussed whether to restore ATLAS disk at Birmingham with new EOS instance (when Mark is ready in in 5 weeks), or to remain diskless accessing data at Manchester.
    Alessandra: has saturated Manchester network. Will need to add storage or cache at Birmingham.
    Only cache option currently supported for ATLAS is ARC cache. It seems to work well, but not ready to install everywhere.
    ATLAS needs time to develop a strategy. As a cloud squad we would like to restore disk at Birmingham until this has been done.
    In the meantime, can favour sim prod at Birmingham to save Manchester network.
    Discussed in GridPP Storage Group meeting (see minutes for 31 October 2018). They said we can decide what we want for ATLAS.
    For the moment, we'll keep it using Manchester and monitor.
    8/11: Sam: discussed in Ops meeting. Request to gather more data. Alessandra continuing to collect data. Birmingham should also collect data.
    15/11: Sam will email about this.
    22/11: Alessandra: bandwidth still full. Not just due to Birmingham but other sites with FTS. Asking for 40Gbs to JANET.
                Later can change Birmingham fairshares to see effect.
                2nd biggest bottleneck is headnode. Increasing #threads etc. Looking at SSD disks for database.
                Sam: could use memcache for db, instead of SSD. (memcache is default in 8.10, but Manchester has older version.)
                Will be interesting to see effect of Birmingham by seeing change when we switched to use Manchester, and later when we change the fairshares.
    29/11: Alessandra reported no problems in last 2-3 weeks.
    10/01: Decision needed on whether ATLAS to have storage at BHAM.
    17/01: Elena: discussed on Tuesday. Mark ready to install storage. Can provide 500 TB. Says it's not a big deal to setup, at least from the Bham side.
    Alessandra: No analysis queue, so no LOCALGROUPDISK. Just DATADISK.
    Elena: For ATLAS setup, we don't have experience with EOS. Alessandra: not a bad thing to know.
    Alessandra: network is sort of fine, but didn't check how efficiently jobs run. Did not run any production jobs for at least a week.
    Elena will check jobs at Bham and discuss with Mark. Will ask him to call into a next meeting.
    24/01: Elena: no jobs running in last 48 hours. Could be a Vac problem? We should sort this out before contacting Mark.
    Gareth: no ATLAS Vac jobs running at Glasgow either: http://vacmon.gridpp.ac.uk/ . Peter will check.
    Peter: Will need to integrate Vac into Harvester. New JDL needs to be developed.
    Monitoring very slow at the moment. Peter looking at it. Gareth will look too.
    29/01: Discussion in ADC Weekly of CPU-only sites:
    https://indico.cern.ch/event/793522/contributions/3300712/attachments/1787032/2909957/Diskless_Jan19.pdf
    Elena: Jobs now running OK. 220 jobs in last 12 hours.
    Peter: Previously issues with Vac, but still running on APF.
    Efficiency OK so far. Decided to keep diskless and monitor further. Probably won't need extra network monitoring.
    07/02: Mark commented in JIRA ADCINFR-87: he is ready to set up EOS instance at Birmingham. Tim will reply to Mark: we don't need this now/yet.

4. Harvester issues (Peter/Alessandra/Elena)
    06/12: Nicolo, Ivan, and Fernando declared the migration of UK T2 UCORE queues complete, but problems still at Manchester (at least). So we keep an ongoing issue.
    Manchester UCORE half empty for almost a week. ce01 empty, ce02 OK.
    Peter is looking at improving monitoring. Harvester obfuscates differences between CEs, whereas APF treated them differently.
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce01
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce02
    Alessandra can't find Harvester "factory" for GPU queue.
    Peter asked Fernando to improve messages from Harvester. Peter will continue looking into it with monitoring.
    13/12: Elena has setup panda site UKI-LT2-QMUL_SL7_UCORE and queue UKI-LT2-QMUL_SL7_UCORE and sent email to harvester experts asking to include it to Harvester.
    10/01: Gareth asked about approach to debugging issues at Durham. Elena will follow-up Harvester status at Durham and get an update on procedures. Peter to check apfmon in relation to Harvester. QMUL_UCORE is running OK under Harvester.
    17/01: Elena: Ivan set Durham UCORE PQ offline in September (currently just score). Elena mailed him twice. Will set back online.
    Brian confirmed that he has declared missing files lost. Procedure with new DDM emails seems to work fine.
    24/01: Elena confirmed that Durham not in Harvester yet. Will be migrated separately because it uses aCT.
    Elena: Removed secondary Squids from UK T2s in AGIS. Three sites don't have local squids: IC uses QMUL, Durham uses Glasgow, Sussex uses RALPP. These should be left as-is.
    Elena: Raoul from Brunel asked about moving away from SRM. Should be possible to use gsiftp instead. Elena will check in AGIS.
    31/01: Elena: problem with Durham solved. When they move to CentOS7 they will be moved to Harvester and UCORE.
    Brunel asked about doing transfers without SRM. Configured same as Lancaster, which is using xrootd-only. Peter will check Brunel's settings (Elena will forward him the email).
    Alessandra set QMUL CentOS7 queue to test mode, because it doesn't yet have enough slots.

5. ATLAS Sites Jamboree, 5-8 March. Encourage ATLAS site admins to attend. Maybe we could have a face-to-face meeting.
    Alessandra, Peter, Elena, and Tim will probably be there.
    Alessandra publicised for site admins on TB-SUPPORT@JISCMAIL.AC.UK. Will GridPP fund travel?
    14/02: ATLAS is collecting information about site statuses as input for the Jamboree discussion. A request from Andrej Filipčič and others was circulated to sites on 24 January, and forwarded to TB-SUPPORT@JISCMAIL.AC.UK by Alessandra. The deadline is 22 February.
    So far Birmingham and Oxford have responded. Alessandra suggested that it is also interesting to include status of Singularity
installation.
    Alessandra will remind other sites to respond with a post to atlas-support-cloud-uk@cern.ch. Replies can go to the original posters (Andrej and/or Alessandra).


6. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.


News:
Alessandra: Working on GPU support. ATLAS might want to use GPUs for production work. Also looking at analysis example on GPUs. Working at Manchester and QMUL. Need to do job splitting etc.
Brian: There has been some progress getting DPM bugs sorted (mainly seen at Brunel) for CSM 1.11.X. May be suggested soon as an upgrade.
Dan: Issue with systemd mounting. Could then move to sensible-sized CentOS7 queue. Would be interesting to hear about any CentOS7 tuning advice.
Elena: Discussed SAM tests for Vac sites. ATLAS experts are working on SAM tests via Harvester. Won't be ready next week!
Gareth: No time until after GridPP6 proposal has gone out! Have equipment for Hypervisor cluster, but need to set it up. Chasing IPv6.
Kashif: NTR
Matt: Sympathises with Gareth.
    Has problems with NFS on C7 accessing new storage.
    Investigating checksum mismatch problems in the Pilot. Normally contact atlas-support-cloud-uk@cern.ch for help, but since we already discussed it, Matt will contact the experts directly (Tim will send contacts).
Sam: Found another file lost due to duplicate transfers in FTS. Logs send to RAL experts, who are working with FTS developers. They have applied a possible fix.
Vip: NTR
Tim: Have new CE, arc-ce05. Installed with latest aCT version in Aquilon. Testing with PanDA queue RAL-LCG2_TEST gave some "lost heartbeat" failures that were fixed on arc-ce05. Continuing tests show no more errors, so can soon move into production.

AOB:
We should have a discussion about what we (ATLAS UK) want to do about diskless sites. This will come up at the Sites Jamboree, and should later be discussed with GridPP. But first we need to discuss what we want. Unfortunately we ran out of time, so will discuss next week - and put it first on the agenda.
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00