ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2019-06-20T10:00:00+01:00
End: 2019-06-20T11:00:00+01:00
Location: Vidyo

Thursday 20 Jun 2019, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

Hide

ATLAS UK Cloud Support meeting minutes, 20 June 2019

Present: Alessandra Forti (Manchester), Brian Davies (RAL), Dan Traynor (QMUL), Elena Korolkova (Sheffield), Gareth Roy (Glasgow), John Hill (Cambridge), Matt Doidge (Lancaster), Peter Love (Lancaster), Sam Skipsey (Glasgow), Tim Adye (RAL)
Apologies: Stewart Martin-Haugh (RAL), Vip Davda (Oxford)

Outstanding tickets:

* ggus 141675 UKI-SCOTGRID-ECDF Multiple evicted jobs
    Elena will poke Edinburgh
* ggus 141712 Macaroon issuing fails for svr018.gla.scotgrid.ac.uk / UKI-SCOTGRID-GLASGOW_SCRATCHDISK
    Sam: didn't realise that ATLAS is testing Macaroons for http. ATLAS isn't ready for commissioning. Glasgow not ready for this either.
    Alessandra will close the ticket.
* ggus 141571 UKI-SOUTHGRID-OX-HEP "could not open connection" transfer errors
    Oxford is down today - AC failure.
    Elena: this ticket is a longstanding problem. Working on it, but so far not much luck.
* ggus 141743 UKI-NORTHGRID-SHEF-HEP: Transfer errors with "TRANSFER Operation timed out"
    Elena fixed a couple of problems. Need to rebalance disk servers to reduce load on some old servers that don't have enough RAM. Elena will update the ticket.

Ongoing issues (new comments marked in green):

1. Birmingham XCache (Mark/Elena)
    14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.
    Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
    21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
    Brian: ATLAS UK should do any needed AGIS setup.
    Sam: will need to understand how ATLAS uses XCache.
    04/04: Birmingham has XCache setup. Still need changes in AGIS, eg. to create an RSE. Alessandra will do it.
    18/04: Alessandra has been looking how to create RSE. Will need to contact DDM experts to understand what to change in AGIS. Perhaps there is already an XCache RSE defined at ECDF that could be used as a model model.
    02/05: Sam reported on the work, mainly by Mark and Alessandra. Some movement on configuring AGIS, but it broke things. Mario a bit reluctant because needs new Rucio features.
    Two modes we'd like to try: one cache and proxy on site and ATLAS knows nothing about the proxy. Other mode (beta) uses Rucio volatile storage. Some confusion as to which one we wanted to test. We (UK) would be happy with simple proxy, but ATLAS wants to test volatile storage.
    09/05: Brian debugging xcache it with xrdcp. Was confused by "no such file" message in debug log. This is normal for successful transfer.
    16/05: Brian: can xrdcp via Birmingham XCache from other sites, but problem with RAL Echo. Have a fix for RAL Echo, verified with a test gateway.
    Alessandra: RAL Echo via XCache will be needed eventually. Brian will put in a JIRA ticket.
    Alessandra: Rucio not yet ready to use XCache RSE. See https://its.cern.ch/jira/browse/ADCINFR-121 (and linked https://github.com/rucio/rucio/issues/2511)
    30/05: Brain will check after Alessandra's storage update
    06/06: Sam said that Teng will set up XCache monitoring for Mark. Based on stuff he did at ECDF.
    Alessandra checked a few days ago. Stefan had changed the fairshare. Should be running more jobs now. These jobs use the XCache, so we can start to see how they do (especially once the XCache monitoring is available).
    13/06: Jobs are running fine, 2.5% are failing. Keep watching
    20/06: Mark reported success at last week's ADC Weekly. Very useful slides: https://indico.cern.ch/event/827071/contributions/3463355/attachments/1859714/3055943/AtlasDDMOpsMeeting_Jun2019.pdf
    Alessandra: still can't separate traffic. Manchester network is full, but probably not Birmingham. Some WNs outside UK using Manchester storage. Difficult to add monitoring to separate out Birmingham traffic without logging every connection.

2. Diskless sites
21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
    14/03: From Jamboree:
    Bham: Mark has agreed to setup XCache in Bham
    Shef: When Mark switches to XCache Sheffield can use Manchester storage
    Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
    Brunel: Dan is not sure about Brunel. Elena will ask Raul
    LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
    21/03: Elena (email):
        3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
    Decided we want to keep LOCALGROUPDISK for the time being.
    Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
    Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.
    Elena: should ask John from Cambridge when he wants to try to use QM storage.
    Sam: is this what we want. John was interested in seeing how Mark got on at Bham.
    28/03: Alessandra will do Sussex: CentOS7 upgrade and switching to diskless.
    Alessandra: Brunel still has storage for CMS, but just 33 TB for ATLAS. Don't use XCache.
    Brunel has now been set to run low-IO jobs and reduce data consolidation.
    Decided to keep as is (with 33 TB ATLAS disk) for the moment.
    Brian: wait to see how Birmingham gets on with XCache before trying to use it at Cambridge. Alternative would be to use another site so as not to overload QMUL.
    04/04: Mark will write up a procedure for installing XCache. Wait to test ATLAS setup at Birmingham before trying other sites (eg. Cambridge).
    Mario said there were two modes to test. Transparent cache puts proxy in front of storage, so not useful for our case. We want the Buffer cache mode (Mario calls it "Volatile RSE").
    02/05: Lincoln Bryant was at RAL and demonstrated XCache setup using SLATE to Alastair, Brian, Tim, et al.
    16/05: Brian will try out SLATE at RAL.
    06/06: Elena said Sheffield was waiting. Need to finish with CentOS7 upgrade first.
    13/06: Cambridge(John Hill) was thinking about installing XCache. Elena will contact John to confirm the plans.
    20/05: John: Cambridge have 900 job slots; but only 220TB DPM storage, all out of warranty (5.5-9 years old). No money for new disk. John is retiring in September, with no full-time replacement. Would like to minimise effort after that.
    Discussed four options:
        1) Leave everything as is. Still would need work now to upgrade DPM (currently on 1.10.0) and maybe go DOME. Difficult to maintain storage especially recovering after a disk crash.
        2) Run diskless with low-I/O jobs (MC sim). This is easiest, but worry about too many ATLAS sites like this.
        3) Install XCache manually using one pool node. This would take a bit to set up, but less long-term maintenance. If there are disk problems don't care about lost data, and can run diskless while being fixed.
        4) Install XCache with SLATE. Need to install Kubernetes instead. Mark setup XCache in a day, so maybe we don't gain much using SLATE.
    Decided to install XCache manually (option 3) to access QMUL storage. John will drain one pool node (40 TB) and put XCache on it. Reduce spacetoken size and ATLAS DDM will free up the space.
    Alessandra will open a ticket: ADCINFR-129

3. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
    21/03: Glasgow still to do.
    28/03: Sheffield aim for deadline, which Alessandra confirmed was 1st June.
    04/04: Alessandra: about to send out email to all ATLAS sites to request status updates on TWiki. Tool to identify CentOS7 sites doesn't work, so keep manual list.
    Status known for all UK sites except RHUL.
    Andy McNab needs to change image for VAC sites for CentOS7 and make sure Singularity is accessible.
    18/04: Vip said that Oxford has now completed migration of all WNs. Alessandra will delete the SL6 PanDA queues.
    Alessandra: we should press sites to put CentOS image live for VAC.
    Gareth: Glasgow using VAC as a method of building a new site with CentOS7 for ATLAS-only. Will then turn off SL6 VAC. Part of lots of work for the new data centre.
    02/05: Elena is working on new ARC-CE connected to new Condor farm at Sheffield (upgrading all in one go). Oxford provided example config for AGIS.
    Vip: will switch off SL6 for other VOs (ATLAS was done months ago).
    09/05: Elena: CentOS7 PanDA queues setup for Durham and Sheffield. Durham works, but Sheffield has problems with the ARC-CE.
    Vip: suggested this might be due to a problem with the certificates in grid-security. He will advise Elena.
    Elena disabled an old SL6 MCORE PanDA queue for Liverpool, which has no more SL6 workers.
    16/05: Alessandra: Waiting for news from Andrew McNab for VAC sites: Birmingham, Cambridge, UCL. Liverpool has switched VAC off.
    To do: RHUL, QMUL, Durham (Sam said they were working on it last week).
    Have two weeks till the 1st June deadline. Alessandra prefers not to have to open lots of tickets for UK sites.
    23/05: Glasgow: VAC queue can go over to CC7 whenever needed: production queues are still. CC6
        Issues observed with HTCondor-CE when trying to move production queues to CC7
        Unlikely to make June deadline.
        VAC upgrade for cc7 ready at Manchester
    30/05: Sheffield arcproxy crashing with new ARC-CE and Centos-7 (understood + fixed after the meeting)
        Glasgow will move VAC when new VMs are ready
    06/06: Alessandra had sent a mail with link to table on progress: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment
        Deadline was 1st June. Elena asked sites to update at Ops meeting. No one responded so far.
        Alessandra changed status of 5 UK sites to "On Hold". 3 are VAC sites. She said is Andrew trying it at Manchester, but it doesn't seem to work yet. Seems to be a race condition but don't know where it comes from. Emailed Peter to look into it.
        Get VAC working at Manchester then roll it out at other sites.
        Alessandra asked Roger what to do about UCL, but didn't hear back after PMB.
        VAC still running on APF (and a special instance). Needs to move to Harvester. Peter should do it.
        Elena asked about problems with the ARC-CE at Sheffield. Alessandra will look at it.
        Elena is planning to make a list of PanDA queues for UK sites to see which old SL6 queues can be removed.
    13/06: RHUL is lacking stuff and it's unknown when they move to CentOS7.
    20/06: Alessandra: VAC now working at Manchester. Andrew needs to write instructions, and will send to Birmingham and Cambridge.
        Gareth: please send also to Glasgow. Still have 1000 cores VAC.
        Alessandra: RHUL and UCL on hold. Sheffield still in progress (Elena will work on this; aim for 30 June completion).
        Glasgow close to a working system so probably do better than previous 31 July estimate.
        Sussex on hold till sysadmin available.
        All sites can upgrade to Singularity 3.2.1.

4. Harvester issues (Peter/Alessandra/Elena)
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.
    14/03:
    Lancs: Peter configured Lancs in AGIS to use GridFTP
    Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
    QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
    21/03: Elena (from email attached to Indico):
        1. I'll configure Sussex in AGIS to use QMUL <--- Ale will do it
        2. Raul will think about deployment of XCache in Brunel. <----- NO don't need xcache for Brunel
        4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
    Lancaster was just a development queue. Peter will deactivate it now.
    QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
    QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.
    28/03: Elena: RALPP asked to disable SL6 analysis queue. There also seems to be a problem with AGIS storage setup. Elena asked Chris to check endpoints (screendump from AGIS).
    Elena: request to disable Sussex analysis queue. Done.
    Alessandra: Manchester still have problems with Harvester. If we want to maintain the current setup we will have to enable extra PanDA queues.
    04/04: Manchester problem resolved: workers full again.
    Was fixed by creating one PanDA queue for each cluster, since Harvester doesn't seem able to cope with two different batch systems. Disappointing that Harvester's promise to allow unified queues is not always possible.
    Sussex work won't happen before May. Fabrizio said they are hiring someone.
    02/05: Sheffield Harvester migration at same time as CentOS7 etc.
    16/05: Fabrizio at European Strategy for Particle Physics meeting in Granada this week. Should have an update next week.
    30/05: Sussex only pending: RHUL needs to be checked
    06/06: Alessandra talked to Fabrizio about Sussex. The new person starts in July. They would like to go disk-less, probably using QMUL.
    13/06: Elena LT2 sites, Lancs and Liverpool. QMUL has a mixture of sl6 and CentOS7 queues. Dan wants to keep them until he moves all the resources to CentOs7. QMUL queues will be revisited after that. Elena will contiue to revisit queues
    20/06: Elena looked at ECDF-RDF in AGIS. Uses same CPU as ECDF, but with different storage.
    Alessandra: should we discuss ECDF going diskless?
    Gareth: also do we still need ECDF-Cloud, which uses resources at UVic.
    Elena will mail Teng to ask what they want to do.

5. Dumps of ATLAS namespace (Brian)
    14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
    ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
    Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.
    28/03: Sam: needs to fine tune how to automate dumps.
    Matt: Lancaster still to do. Brian will prod him before next week.
    04/04: Brian will check what sites now have setup.
    The dump format is described here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Storage_Element_Dumps . Paths should be relative to the "rucio" directory.
    18/04: Brian: still working on Glasgow and Lancaster.
    Sam says Glasgow now has a dump. There have been problems getting it to work as a cron job, due to xrootd permissions problems.
    02/05: Lancaster dump is in the wrong place (root://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/atlaslocalgroupdisk/rucio/dump_20190429). Should be root://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/atlaslocalgroupdisk/dumps/dump_20190429. Matt will fix it.
    Glasgow dump now available.
    09/05: All Tier-2s now have dump files.
        RAL Tier-1 still to fix automatic dump (previously done by hand).
    06/06: As well as fixing the RAL Tier-1 dump format, we also need the aforementioned change in the Rucio Consistency checker.

7. XRootD/HTTP transfers (Alessandra/Tim)
    16/05: DOMA TPC group looking for ATLAS sites to enable XRootD/HTTP for test-production transfers.
        Alessandra reported that Lancaster and Brunel ready for this. Could also include RAL using a test RSE.
    06/06: Alessandra reported that Mario is working on configuring functional tests for XRootD TPC.
        Had Lancaster running, but it broke. Also need monitoring.
        Concerning the fix needed to get it working with AGIS/Rucio, Cedric and Martin say it isn't easy. We would like to understand what the problem is.
    13/06: Alessandra made changes in AGIS for Man, Lancs and Ox last week. This week she will change storage settings for Brunel.
    20/06: To get TPC working, we need sites to upgrade to latest storage. DPM needs DOME. dCache also needs upgrade. also have problems.
    Dan: reported on Storm at QMUL: made changes to a server, so should be more reliable. WebDAV running on a single server. Need to move to a bigger server. Will do after ARC-CEs upgraded to CentOS7 (ie. mid-summer).
    Manchester, Lancaster, Oxford, Brunel mostly OK. Will add RAL Tier-1.
    Need to add XRootD/HTTP support added to RAL FTS to use for ATLAS UK sites. This could be done once gfal update in EPEL production (~2 weeks). Andrea will create a JIRA.
    Tim will discuss FTS update with Alastair, who is reluctant.

News:

Alessandra: Will add Stewart to the people who can approve ATLAS UK LOCALGROUPDISK R2D2 transfer requests. This is defined in VOMS. Reminded us to pay attention to these requests.
Brian: NTR
Elena: NTR
Gareth: NTR
Matt: NTR
Sam: NTR
Tim: Stewart removed some old RAL Tier-1 PanDA queues from AGIS - previously DISABLED or OFFLINE.

AOB:

Next week is the ATLAS Software and Computing Week in New York. Alessandra and Tim will be there. Stewart, Elena, and Peter still in the UK, so we will hold a meeting next week, chaired by Stewart.

There are minutes attached to this event. Show them.

- 1
  
  Outstanding tickets
- 2
  
  Ongoing issues
- 3
  
  News round-table
- 4
  
  AOB

Choose timezone

ATLAS UK Cloud Support

Vidyo