ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 18 April 2019

Present: Alessandra Forti (Manchester), Brian Davies (RAL), Gareth Roy (Glasgow), Sam Skipsey (Glasgow), Tim Adye (RAL), Vip Davda (Oxford)

Outstanding tickets:

* ggus 140671 UKI-LT2-Brunel: Incorrect dump format
    After some miscommunication, Raul now has instructions.
* ggus 140745 UKI-NORTHGRID-LIV-HEP Removed by SYSTEM_PERIODIC_REMOVE due to job restarted undesirably
    Condor pool config fixed. Waiting for confirmation from submitter.
* ggus 140729 UKI-LT2-QMUL Could not open SRM connection
    See next QMUL ticket.
* ggus 140719 UKI-LT2-QMUL transfer and deletion fail
    Looks like transfers are now working again. Brian will update ticket.
* UKI-LT2-QMUL_LOCALGROUPDISK blacklisted in DDM: OFF for uw (DISKSPACE). More info
    Brian sent a nice listing with LOCALGROUPDISK users at QMUL. Alessandra said Manchester list was useful to send to her local users.
    Particularly useful to group (in colour!) users according to DN associated with site, DN from CERN, and  other UK.
* ggus 138033 RAL-LCG2 singularity jobs failing at RAL
    Underlay enabled. Alessandra has sent a test job to test it.
* ggus 140134 UKI-SOUTHGRID-OX-HEP Unspecified gridmanager error: Code 0 Subcode 0
    There are currently some central Rucio server timeouts, but the site looks OK. Alessandra will close the ticket.

Ongoing issues (comments from 2 weeks ago marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.
    04/04: Tim cleaned out some very old data that should have been removed last year from ATLAS UK LOCALGROUPDISKs. Freed up 65 TB. He's now ready to initiate another mass cleanup of more recent (still >2 years old) data.
    18/04: Tim found 127.7 TB of old LOCALGROUPDISK data. Will send list to ATLAS UK users.

2. Harvester issues (Peter/Alessandra/Elena)
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.
    14/03:
    Lancs: Peter configured Lancs in AGIS to use GridFTP
    Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
    QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
    21/03: Elena (from email attached to Indico):
        1. I'll configure Sussex in AGIS to use QMUL  <--- Ale will do it
        2. Raul  will think about deployment of XCache in Brunel.  <----- NO don't need xcache for Brunel
        4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
    Lancaster was just a development queue. Peter will deactivate it now.
    QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
    QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.
    28/03: Elena: RALPP asked to disable SL6 analysis queue. There also seems to be a problem with AGIS storage setup. Elena asked Chris to check endpoints (screendump from AGIS).
    Elena: request to disable Sussex analysis queue. Done.
    Alessandra: Manchester still have problems with Harvester. If we want to maintain the current setup we will have to enable extra PanDA queues.
    04/04: Manchester problem resolved: workers full again.
    Was fixed by creating one PanDA queue for each cluster, since Harvester doesn't seem able to cope with two different batch systems. Disappointing that Harvester's promise to allow unified queues is not always possible.
    Sussex work won't happen before May. Fabrizio said they are hiring someone.

3. Bham XCache (Mark/Elena)
    14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself  ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.   
    Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
    21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
    Brian: ATLAS UK should do any needed AGIS setup.
    Sam: will need to understand how ATLAS uses XCache.
    04/04: Birmingham has XCache setup. Still need changes in AGIS, eg. to create an RSE. Alessandra will do it.
    18/04: Alessandra has been looking how to create RSE. Will need to contact DDM experts to understand what to change in AGIS. Perhaps there is already an XCache RSE defined at ECDF that could be used as a model model.

4. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
    21/03: Glasgow still to do.
    28/03: Sheffield aim for deadline, which Alessandra confirmed was 1st June.
    04/04: Alessandra: about to send out email to all ATLAS sites to request status updates on TWiki. Tool to identify CentOS7 sites doesn't work, so keep manual list.
    Status known for all UK sites except RHUL.
    Andy McNab needs to change image for VAC sites for CentOS7 and make sure Singularity is accessible.
    18/04: Vip said that Oxford has now completed migration of all WNs. Alessandra will delete the SL6 PanDA queues.
    Alessandra: we should press sites to put CentOS image live for VAC.
    Gareth: Glasgow using VAC as a method of building a new site with CentOS7 for ATLAS-only. Will then turn off SL6 VAC. Part of lots of work for the new data centre.


5. Diskless sites
  21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
    14/03: From Jamboree:
    Bham: Mark has agreed to setup XCache in Bham
    Shef: When Mark switches to XCache Sheffield can use Manchester storage
    Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
    Brunel: Dan is not sure about Brunel. Elena will ask Raul
    LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
    21/03: Elena (email):
        3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
    Decided we want to keep LOCALGROUPDISK for the time being.
    Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
    Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.
    Elena: should ask John from Cambridge when he wants to try to use QM storage.
    Sam: is this what we want. John was interested in seeing how Mark got on at Bham.
    28/03: Alessandra will do Sussex: CentOS7 upgrade and switching to diskless.
    Alessandra: Brunel still has storage for CMS, but just 33 TB for ATLAS. Don't use XCache.
    Brunel has now been set to run low-IO jobs and reduce data consolidation.
    Decided to keep as is (with 33 TB ATLAS disk) for the moment.
    Brian: wait to see how Birmingham gets on with XCache before trying to use it at Cambridge. Alternative would be to use another site so as not to overload QMUL.
    04/04: Mark will write up a procedure for installing XCache. Wait to test ATLAS setup at Birmingham before trying other sites (eg. Cambridge).
    Mario said there were two modes to test. Transparent cache puts proxy in front of storage, so not useful for our case. We want the Buffer cache mode (Mario calls it "Volatile RSE").

6. Dumps of ATLAS namespace (Brian)
    14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
    ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
    Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.
    28/03: Sam: needs to fine tune how to automate dumps.
    Matt: Lancaster still to do. Brian will prod him before next week.
    04/04: Brian will check what sites now have setup.
    The dump format is described here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Storage_Element_Dumps . Paths should be relative to the "rucio" directory.
    18/04: Brian: still working on Glasgow and Lancaster.
    Sam says Glasgow now has a dump. There have been problems getting it to work as a cron job, due to xrootd permissions problems.


News:
Alessandra: NTR
Brian: NTR
Gareth: will start moving equipment into new data centre in July-August. Ceph test, or think about new SE.
Sam: NTR
Vip: NTR
Tim:
    Occasionally get a corrupted file on the worker node's xcache for Echo. Some ideas for how to automatically clean these up.
    Stewart created RAL-LCG2_MCORE_TEMP production queue. Keep in test, but eventually will replace RAL-LCG2-ECHO_MCORE. Can then remove RAL-LCG2-ECHO site.

AOB:
We will all meet at GridPP42 in Cosener's House next week. Next ATLAS UK Cloud Support meeting the week after on 2nd May.
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m