ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 4 April 2019

Present: Alessandra Forti (Manchester), Brian Davies (RAL), Dan Traynor (QMUL), Elena Korolkova (Sheffield), Matt Doidge (Lancaster), Peter Love (Lancaster), Sam Skipsey (Glasgow), Stewart Martin-Haugh (RAL), Tim Adye (RAL), Vip Davda (Oxford)

Outstanding tickets:

* UKI-SCOTGRID-GLASGOW_LOCALGROUPDISK blacklisted in DDM: OFF for uw (DISKSPACE). More info
    * jira [ADCSUPPORT-5159] "UKI-SCOTGRID-GLASGOW LOCALGROUPDISK transfer errors due to NO FREE SPACE"
    Brian said that Rucio "greedy cleanup" should be ongoing.
    User duncan marked as not needed any more, but so far files not removed.
    Brian will continue to chase with DDM experts, perhaps by doing a new Rucio consistency check to remove dark data.
    Tim will check user data has been deleted in Rucio.
* ggus 140134 UKI-SOUTHGRID-OX-HEP Unspecified gridmanager error: Code 0 Subcode 0
     This ticket initially concerned an SE pool node crash, but seems to have moved onto discussion of worker node memory. People should be encouraged to post a new ticket for a new issue.
    Vip increased worker node memory to 4GB/slot.
    Alessandra suggested we could increase maxrss from 16000 to 32000 in AGIS.
    Elena said 5% of jobs were still failing. Should check this before changing maxrss.

Ongoing issues (new comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.
    04/04: Tim cleaned out some very old data that should have been removed last year from ATLAS UK LOCALGROUPDISKs. Freed up 65 TB. He's now ready to initiate another mass cleanup of more recent (still >2 years old) data.

2. Harvester issues (Peter/Alessandra/Elena)
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.
    14/03:
    Lancs: Peter configured Lancs in AGIS to use GridFTP
    Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
    QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
    21/03: Elena (from email attached to Indico):
        1. I'll configure Sussex in AGIS to use QMUL  <--- Ale will do it
        2. Raul  will think about deployment of XCache in Brunel.  <----- NO don't need xcache for Brunel
        4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
    Lancaster was just a development queue. Peter will deactivate it now.
    QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
    QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.
    28/03: Elena: RALPP asked to disable SL6 analysis queue. There also seems to be a problem with AGIS storage setup. Elena asked Chris to check endpoints (screendump from AGIS).
    Elena: request to disable Sussex analysis queue. Done.
    Alessandra: Manchester still have problems with Harvester. If we want to maintain the current setup we will have to enable extra PanDA queues.
    04/04: Manchester problem resolved: workers full again.
    Was fixed by creating one PanDA queue for each cluster, since Harvester doesn't seem able to cope with two different batch systems. Disappointing that Harvester's promise to allow unified queues is not always possible.
    Sussex work won't happen before May. Fabrizio said they are hiring someone.


3. Bham XCache (Mark/Elena)
    14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself  ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.   
    Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
    21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
    Brian: ATLAS UK should do any needed AGIS setup.
    Sam: will need to understand how ATLAS uses XCache.
    04/04: Birmingham has XCache setup. Still need changes in AGIS, eg. to create an RSE. Alessandra will do it.

4. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
    21/03: Glasgow still to do.
    28/03: Sheffield aim for deadline, which Alessandra confirmed was 1st June.
    04/04: Alessandra: about to send out email to all ATLAS sites to request status updates on TWiki. Tool to identify CentOS7 sites doesn't work, so keep manual list.
    Status known for all UK sites except RHUL.
    Andy McNab needs to change image for VAC sites for CentOS7 and make sure Singularity is accessible.


5. Diskless sites
  21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
    14/03: From Jamboree:
    Bham: Mark has agreed to setup XCache in Bham
    Shef: When Mark switches to XCache Sheffield can use Manchester storage
    Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
    Brunel: Dan is not sure about Brunel. Elena will ask Raul
    LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
    21/03: Elena (email):
        3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
    Decided we want to keep LOCALGROUPDISK for the time being.
    Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
    Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.
    Elena: should ask John from Cambridge when he wants to try to use QM storage.
    Sam: is this what we want. John was interested in seeing how Mark got on at Bham.
    28/03: Alessandra will do Sussex: CentOS7 upgrade and switching to diskless.
    Alessandra: Brunel still has storage for CMS, but just 33 TB for ATLAS. Don't use XCache.
    Brunel has now been set to run low-IO jobs and reduce data consolidation.
    Decided to keep as is (with 33 TB ATLAS disk) for the moment.
    Brian: wait to see how Birmingham gets on with XCache before trying to use it at Cambridge. Alternative would be to use another site so as not to overload QMUL.
    04/04: Mark will write up a procedure for installing XCache. Wait to test ATLAS setup at Birmingham before trying other sites (eg. Cambridge).
    Mario said there were two modes to test. Transparent cache puts proxy in front of storage, so not useful for our case. We want the Buffer cache mode (Mario calls it "Volatile RSE").


6. Dumps of ATLAS namespace (Brian)
    14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
    ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
    Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.
    28/03: Sam: needs to fine tune how to automate dumps.
    Matt: Lancaster still to do. Brian will prod him before next week.
    04/04: Brian will check what sites now have setup.
    The dump format is described here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Storage_Element_Dumps . Paths should be relative to the "rucio" directory.


News:

Alessadra:
    Manchester upgrading storage headnode, so downtime next week. This is a straight upgrade of DPM, using mixed legacy DOME, not pure DOME. Tested on TPC testbed. Once main storage nodes updated, can upgrade testbed to pure DOME.
    Matt advised that everything good in DPM 1.12 (not previous versions). However, with DOME, SRM can easily overwhelmed. He moved to xrootd for local access and gridftp for transfers.
    Brian asked if storage group should think about getting rid of SRM? Alessandra said this had been discussed for several years.
Brian: NTR.
Elena: NTR
Matt: Trying to put new storage online. Dealing with IPv6 problems.
Peter: NTR
Sam: NTR
Stewart: NTR
Vip: Installing some new worker nodes, but problems with the whole batch from Dell. All different problems. Dell have new policy to get you to do all the work. IPMI totally vital.
Tim: On Monday, increased the RAL-LCG2-ECHO_DATADISK quota by 0.88 PB to 10.33 PB, to meet our 2019 pledge.
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m