ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 6 June 2019

Present: Alessandra Forti (Manchester), Brian Davies (RAL), Dan Traynor (QMUL), Elena Korolkova (Sheffield), Kashif Mohammad (Oxford), Matt Doidge (Lancaster), Sam Skipsey (Glasgow), Tim Adye (RAL), Vip Davda (Oxford)

Outstanding tickets:

* UKI-SCOTGRID-ECDF_LOCALGROUPDISK blacklisted in DDM: OFF for uw (DISKSPACE). More info
     Leave it unless users complain.
* ggus 141234 UKI-SCOTGRID-DURHAM Lost heartbeats
    Elena: Seemed better on Tuesday. Will check again now and, if good, close the ticket.
* ggus 141167 High number of hits on ATLAS backup proxies from UKI-SCOTGRID-DURHAM WNs
    Sam: Durham changed its IP address range, which necessitated Glasgow Frontier Squid ACL change, and Durham firewall settings. Now seems to be resolved.
    Elena will close the ticket.
* ggus 141611 Missing file at UKI-NORTHGRID-MAN-HEP_DATADISK
    Alessandra said the file was not there. Will declare lost in Rucio.
* ggus 141549 ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded
    2/3 RAL Frontier servers filled up the disk with a logfile on Sunday.
    Restored quickly, and now fixed logrotate to stop it happening again. Can close ticket.
* ggus 141571 UKI-SOUTHGRID-OX-HEP "could not open connection" transfer errors
    Kashif enabled GridFTP redirect on DOME.
    Sam said that GridFTP is fine, only SRM is broken.
    ATLAS doesn't need SRM, but it is still used at Oxford for other VOs and Nagios.
    Proposed to switch off ATLAS use of SRM, so ATLAS can keep going while Kashif fixes SRM for others.
    Alessandra will switch off SRM in AGIS for Oxford and Lancaster.
* ggus 141579 Problem with access to some files on UKI-SOUTHGRID-RALPP
    In the ticket, Chris says this is load related and should recover when load drops.

Ongoing issues (new comments marked in green):

1. Bham XCache (Mark/Elena)
    14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself  ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.   
    Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
    21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
    Brian: ATLAS UK should do any needed AGIS setup.
    Sam: will need to understand how ATLAS uses XCache.
    04/04: Birmingham has XCache setup. Still need changes in AGIS, eg. to create an RSE. Alessandra will do it.
    18/04: Alessandra has been looking how to create RSE. Will need to contact DDM experts to understand what to change in AGIS. Perhaps there is already an XCache RSE defined at ECDF that could be used as a model model.
    02/05: Sam reported on the work, mainly by Mark and Alessandra. Some movement on configuring AGIS, but it broke things. Mario a bit reluctant because needs new Rucio features.
    Two modes we'd like to try: one cache and proxy on site and ATLAS knows nothing about the proxy. Other mode (beta) uses Rucio volatile storage. Some confusion as to which one we wanted to test. We (UK) would be happy with simple proxy, but ATLAS wants to test volatile storage.
    09/05: Brian debugging xcache it with xrdcp. Was confused by "no such file" message in debug log. This is normal for successful transfer.
    16/05: Brian: can xrdcp via Birmingham XCache from other sites, but problem with RAL Echo. Have a fix for RAL Echo, verified with a test gateway.
    Alessandra: RAL Echo via XCache will be needed eventually. Brian will put in a JIRA ticket.
    Alessandra: Rucio not yet ready to use XCache RSE. See https://its.cern.ch/jira/browse/ADCINFR-121 (and linked https://github.com/rucio/rucio/issues/2511)
    30/05: Brain will check after Alessandra's storage update
    06/06: Sam said that Teng will set up XCache monitoring  for Mark. Based on stuff he did at ECDF.
    Alessandra checked a few days ago. Stefan had changed the fairshare. Should be running more jobs now. These jobs use the XCache, so we can start to see how they do (especially once the XCache monitoring is available).


2. Diskless sites
  21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
    14/03: From Jamboree:
    Bham: Mark has agreed to setup XCache in Bham
    Shef: When Mark switches to XCache Sheffield can use Manchester storage
    Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
    Brunel: Dan is not sure about Brunel. Elena will ask Raul
    LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
    21/03: Elena (email):
        3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
    Decided we want to keep LOCALGROUPDISK for the time being.
    Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
    Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.
    Elena: should ask John from Cambridge when he wants to try to use QM storage.
    Sam: is this what we want. John was interested in seeing how Mark got on at Bham.
    28/03: Alessandra will do Sussex: CentOS7 upgrade and switching to diskless.
    Alessandra: Brunel still has storage for CMS, but just 33 TB for ATLAS. Don't use XCache.
    Brunel has now been set to run low-IO jobs and reduce data consolidation.
    Decided to keep as is (with 33 TB ATLAS disk) for the moment.
    Brian: wait to see how Birmingham gets on with XCache before trying to use it at Cambridge. Alternative would be to use another site so as not to overload QMUL.
    04/04: Mark will write up a procedure for installing XCache. Wait to test ATLAS setup at Birmingham before trying other sites (eg. Cambridge).
    Mario said there were two modes to test. Transparent cache puts proxy in front of storage, so not useful for our case. We want the Buffer cache mode (Mario calls it "Volatile RSE").
    02/05: Lincoln Bryant was at RAL and demonstrated XCache setup using SLATE to Alastair, Brian, Tim, et al.
    16/05: Brian will try out SLATE at RAL.
    06/06: Elena said Sheffield was waiting. Need to finish with CentOS7 upgrade first.

3. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
    21/03: Glasgow still to do.
    28/03: Sheffield aim for deadline, which Alessandra confirmed was 1st June.
    04/04: Alessandra: about to send out email to all ATLAS sites to request status updates on TWiki. Tool to identify CentOS7 sites doesn't work, so keep manual list.
    Status known for all UK sites except RHUL.
    Andy McNab needs to change image for VAC sites for CentOS7 and make sure Singularity is accessible.
    18/04: Vip said that Oxford has now completed migration of all WNs. Alessandra will delete the SL6 PanDA queues.
    Alessandra: we should press sites to put CentOS image live for VAC.
    Gareth: Glasgow using VAC as a method of building a new site with CentOS7 for ATLAS-only. Will then turn off SL6 VAC. Part of lots of work for the new data centre.
    02/05: Elena is working on new ARC-CE connected to new Condor farm at Sheffield (upgrading all in one go). Oxford provided example config for AGIS.
    Vip: will switch off SL6 for other VOs (ATLAS was done months ago).
    09/05: Elena: CentOS7 PanDA queues setup for Durham and Sheffield. Durham works, but Sheffield has problems with the ARC-CE.
    Vip: suggested this might be due to a problem with the certificates in grid-security. He will advise Elena.
    Elena disabled an old SL6 MCORE PanDA queue for Liverpool, which has no more SL6 workers.
    16/05: Alessandra: Waiting for news from Andrew McNab for VAC sites: Birmingham, Cambridge, UCL. Liverpool has switched VAC off.
    To do: RHUL, QMUL, Durham (Sam said they were working on it last week).
    Have two weeks till the 1st June deadline. Alessandra prefers not to have to open lots of tickets for UK sites.
    23/05: Glasgow: VAC queue can go over to CC7 whenever needed: production queues are still. CC6
        Issues observed with HTCondor-CE when trying to move production queues to CC7
        Unlikely to make June deadline.
        VAC upgrade for cc7 ready at Manchester
    30/05: Sheffield arcproxy crashing with new ARC-CE and Centos-7 (understood + fixed after the meeting)
        Glasgow will move VAC when new VMs are ready
    06/06: Alessandra had sent a mail with link to table on progress: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment
        Deadline was 1st June. Elena asked sites to update
at Ops meeting. No one responded so far.
        Alessandra changed status of 5 UK sites to "On Hold". 3 are VAC sites. She said is Andrew trying it at Manchester, but it doesn't seem to work yet. Seems to be a race condition but don't know where it comes from. Emailed Peter to look into it.
        Get VAC working at Manchester then roll it out at other sites.
        Alessandra asked Roger what to do about UCL, but didn't hear back after PMB.
        VAC still running on APF (and a special instance). Needs to move to Harvester. Peter should do it.
        Elena asked about problems with the ARC-CE at Sheffield. Alessandra will look at it.
        Elena is planning to make a list of PanDA queues for UK sites to see which old SL6 queues can be removed.


4. Harvester issues (Peter/Alessandra/Elena)
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.
    14/03:
    Lancs: Peter configured Lancs in AGIS to use GridFTP
    Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
    QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
    21/03: Elena (from email attached to Indico):
        1. I'll configure Sussex in AGIS to use QMUL  <--- Ale will do it
        2. Raul  will think about deployment of XCache in Brunel.  <----- NO don't need xcache for Brunel
        4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
    Lancaster was just a development queue. Peter will deactivate it now.
    QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
    QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.
    28/03: Elena: RALPP asked to disable SL6 analysis queue. There also seems to be a problem with AGIS storage setup. Elena asked Chris to check endpoints (screendump from AGIS).
    Elena: request to disable Sussex analysis queue. Done.
    Alessandra: Manchester still have problems with Harvester. If we want to maintain the current setup we will have to enable extra PanDA queues.
    04/04: Manchester problem resolved: workers full again.
    Was fixed by creating one PanDA queue for each cluster, since Harvester doesn't seem able to cope with two different batch systems. Disappointing that Harvester's promise to allow unified queues is not always possible.
    Sussex work won't happen before May. Fabrizio said they are hiring someone.
    02/05: Sheffield Harvester migration at same time as CentOS7 etc.
    16/05: Fabrizio at European Strategy for Particle Physics meeting in Granada this week. Should have an update next week.
    30/05: Sussex only pending: RHUL needs to be checked
    06/06: Alessandra talked to Fabrizio about Sussex. The new person starts in July. They would like to go disk-less, probably using QMUL.

5. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.
    04/04: Tim cleaned out some very old data that should have been removed last year from ATLAS UK LOCALGROUPDISKs. Freed up 65 TB. He's now ready to initiate another mass cleanup of more recent (still >2 years old) data.
    18/04: Tim found 127.7 TB of old LOCALGROUPDISK data. Will send list to ATLAS UK users.
    02/05: UK users mailed last week. Will give them 3 weeks.
    09/05: Vip is chasing local users and freed up some space at Oxford.
    Tim: should be able to delete old LOCALGROUPDISK data next week.
    16/05: Tim will delete data on Monday.
    23/05: Data deleted today.
    06/06: That's the end of this long-standing issue.
        We can deal with individual LOCALGROUPDISKs as they cause problems (eg. user complaints).
        Will remove this item until it's time to do another big cleanup.


6. Dumps of ATLAS namespace (Brian)
    14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
    ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
    Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.
    28/03: Sam: needs to fine tune how to automate dumps.
    Matt: Lancaster still to do. Brian will prod him before next week.
    04/04: Brian will check what sites now have setup.
    The dump format is described here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Storage_Element_Dumps . Paths should be relative to the "rucio" directory.
    18/04: Brian: still working on Glasgow and Lancaster.
    Sam says Glasgow now has a dump. There have been problems getting it to work as a cron job, due to xrootd permissions problems.
    02/05: Lancaster dump is in the wrong place (root://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/atlaslocalgroupdisk/rucio/dump_20190429). Should be root://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/atlaslocalgroupdisk/dumps/dump_20190429. Matt will fix it.
    Glasgow dump now available.
    09/05: All Tier-2s now have dump files.
        RAL Tier-1 still to fix automatic dump (previously done by hand).
    06/06: As well as fixing the RAL Tier-1 dump format, we also need the aforementioned change in the Rucio Consistency checker.

7. XRootD/HTTP transfers (Alessandra/Tim)
    16/05: DOMA TPC group looking for ATLAS sites to enable XRootD/HTTP for test-production transfers.
        Alessandra reported that Lancaster and Brunel ready for this. Could also include RAL using a test RSE.
    06/06: Alessandra reported that Mario is working on configuring functional tests for XRootD TPC.
        Had Lancaster running, but it broke. Also need monitoring.
        Concerning the fix needed to get it working with AGIS/Rucio, Cedric and Martin say it isn't easy. We would like to understand what the problem is.


News:

Alessandra: NTR
Brian: Brian is taking on more work for the RAL Tier 1 (monitoring production from John Kelley and FTS from Catalin, who are both leaving). Brian will have to drop his ATLAS work. He'll come to the UK Cloud Support meetings for the rest of the month, and thereafter liaise with Tim and Stewart.
Dan: Identified reason for recent storage outages (failover does not help). He will try to put a fix.
Elena: NTR
Kashif is leaving Oxford at end of month, so wants to finish storage fixes and improvements first.
Matt:
    Reported 2 analysis jobs using too many threads. Would have started killing them if there were many more. Tim will look into getting ATLAS DAST to deal with this sort of issue in future.
    One disk server is at its limit of broken disks, so keeping fingers crossed while it rebuilds (readonly). If it breaks before rebuild is done, will have 100 TB data loss.
    Working on HTCondor-CE in the background.
Sam is ever closer to having test Ceph storage ready.
Vip: NTR
Tim: 1/7 XRootD gateways upgraded to support TPC. Will do the rest if all OK. Brian commented that (when complete) this should also fix access to RAL Echo via XCache.

AOB:
None
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m