ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
https://indico.cern.ch/event/787083/

ATLAS UK Cloud Support meeting minutes, 28 February 2019

Present: Alessandra Forti, Brian Davies, Dan Traynor, Elena Korolkova, Gareth Roy, Matt Doidge, Peter Love, Rob Currie, Sam Skipsey, Tim Adye, Vip Davda

Outstanding tickets:

* ggus 139663 UKI-SCOTGRID-ECDF two analysis queue active but not working
    Rob: Can we turn off SL6 queue, so just run CentOS7. Elena will deactivate SL6 PanDA analysis queue.
    Vip asked if this could also be done for Oxford, for all SL6 queues (UCORE and analysis) - see below.
* ggus 139723 RAL-LCG2 permissions on scratchdisk
    Development to get rucio mover working with RAL Echo.
* ggus 138033 RAL-LCG2 singularity jobs failing at RAL
* ggus 139915 UKI-SOUTHGRID-RALPP IPv6 transfer problems
    Restarting the network on affected servers fixed the issue. But still trying to understand why this keeps happening.
* UKI-NORTHGRID-LANCS-HEP in downtime. Info
    Elena noted https://its.cern.ch/jira/browse/ADCSITEEXC-1723 . The SRM storage had to be blacklisted manually. We should check that it comes back when downtime is complete.

Ongoing issues (last comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.

2. Harvester issues (Peter/Alessandra/Elena)
    06/12: Nicolo, Ivan, and Fernando declared the migration of UK T2 UCORE queues complete, but problems still at Manchester (at least). So we keep an ongoing issue.
    Manchester UCORE half empty for almost a week. ce01 empty, ce02 OK.
    Peter is looking at improving monitoring. Harvester obfuscates differences between CEs, whereas APF treated them differently.
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce01
    http://apfmon.lancs.ac.uk/CERN_central_B:UKI-NORTHGRID-MAN-HEP_SL7_UCORE-ce02
    Alessandra can't find Harvester "factory" for GPU queue.
    Peter asked Fernando to improve messages from Harvester. Peter will continue looking into it with monitoring.
    13/12: Elena has setup panda site UKI-LT2-QMUL_SL7_UCORE and queue UKI-LT2-QMUL_SL7_UCORE and sent email to harvester experts asking to include it to Harvester.
    10/01: Gareth asked about approach to debugging issues at Durham. Elena will follow-up Harvester status at Durham and get an update on procedures. Peter to check apfmon in relation to Harvester. QMUL_UCORE is running OK under Harvester.
    17/01: Elena: Ivan set Durham UCORE PQ offline in September (currently just score). Elena mailed him twice. Will set back online.
    Brian confirmed that he has declared missing files lost. Procedure with new DDM emails seems to work fine.
    24/01: Elena confirmed that Durham not in Harvester yet. Will be migrated separately because it uses aCT.
    Elena: Removed secondary Squids from UK T2s in AGIS. Three sites don't have local squids: IC uses QMUL, Durham uses Glasgow, Sussex uses RALPP. These should be left as-is.
    Elena: Raoul from Brunel asked about moving away from SRM. Should be possible to use gsiftp instead. Elena will check in AGIS.
    31/01: Elena: problem with Durham solved. When they move to CentOS7 they will be moved to Harvester and UCORE.
    Brunel asked about doing transfers without SRM. Configured same as Lancaster, which is using xrootd-only. Peter will check Brunel's settings (Elena will forward him the email).
    Alessandra set QMUL CentOS7 queue to test mode, because it doesn't yet have enough slots.
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.


3. ATLAS Sites Jamboree, 5-8 March. Encourage ATLAS site admins to attend. Maybe we could have a face-to-face meeting.
    Alessandra, Peter, Elena, and Tim will probably be there.
    Alessandra publicised for site admins on TB-SUPPORT@JISCMAIL.AC.UK. Will GridPP fund travel?
    14/02: ATLAS is collecting information about site statuses as input for the Jamboree discussion. A request from Andrej Filipčič and others was circulated to sites on 24 January, and forwarded to TB-SUPPORT@JISCMAIL.AC.UK by Alessandra. The deadline is 22 February.
    So far Birmingham and Oxford have responded. Alessandra suggested that it is also interesting to include status of Singularity installation.
    Alessandra will remind other sites to respond with a post to atlas-support-cloud-uk@cern.ch. Replies can go to the original posters (Andrej and/or Alessandra).
    28/02: Alessandra said the survey generated enough responses from the UK.

4. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.

5. Diskless sites
  21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.

News:
Alessandra:
    Manchester is failing SAM tests, but haven't yet understood why. Maybe related to switch off of SL6 queues. Will review AGIS setup.
    Frontier is not defined any more for SAM tests. Other jobs fine, or maybe they fall back on something.
Brian:
    FTS upgrade at Tier-1 should fix double write issue.
    CMS seeing seeing an issue on Brunel storage that might be related: file seen in namespace before it's complete.
Dan:
    Email from Rod about core power settings at QMUL. It is set at 290, but should be 8 or 10. Rebus shows site has 1.3M HS06, so could be something wrong with the site publishing. Dan will look into it.
    QMUL now ready to go with CentOS7. Elena has made UCORE queue, but needs to be set to right batch queue. Dan will email Elena details.
Elena: Asked Dan which batch queues to attach to UKI-LT2-QMUL_SL7_UCORE PanDA queue (ce09.esc.qmul.ac.uk has centos7_analysis, centos7_gpu, centos7_long, centos7_preempt, and a bunch of SL6 queues). Suggested also to create CentOS7 analysis queue attached to centos7_analysis. [Ed. was this resolved?]
Gareth: NTR
Matt:
    Lancaster in downtime. Hope to be back tomorrow morning. Going slower due to incorrect electrical diagrams, so small chance of delay through weekend.
    DPM 1.2 out soon. XRootD 4.9.0 out. Hopefully these fix checksumming. Would like to switch back to xrootd for read/write on LAN.
Rob: NTR
Sam: NTR
Vip: NTR
Tim:
    New FTS version 3.8.3 at RAL. Will be updated at CERN next week. BNL and FNAL soon.
    New Frontier server (1 of 3) has had problems due to disk filling up, due to a logrotate problem. As soon as this is fixed, plan to migrate other two servers.
    Performing rolling upgrade on ARC CEs.

AOB:
No meeting next week. ATLAS Cloud Squad will all be at the ATLAS Sites Jamboree.
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m