ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2019-06-13T10:00:00+01:00
End: 2019-06-13T11:00:00+01:00
Location: Vidyo

Thursday 13 Jun 2019, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

Hide

Topics to discuss... will be updated after the meeting.
Present: AF, BD, DT, EK, GR, MD, SS, VD

Apologies: TA, SMH

Outstanding tickets:

* ggus 141675 ECDF: Multiple evicted jobs

Rob is working on it
* ggus 141651 Atlas jobs are failing at UKI-SCOTGRID-GLASGOW_UCORE [MCORE]

2% of jobs are failing due to the problem; can be closed

* ggus 141706 UKI-NORTHGRID-LANCS-HEP: deletion errors with "The requested service is not available at the moment: a load spike on DPM yesterday evening which seemed to effect connectivity

* ggus 141707 UKI-NORTHGRID-LIV-HEP: deletion errors with "The requested service is not available at the moment": A spike in activity caused one of the servers to be overloaded for some time. Deletion requests were timing out.
* ggus 141571 UKI-SOUTHGRID-OX-HEP "could not open connection" transfer errors : The problem is still there

* ggus 141571 UKI-SOUTHGRID-OX-HEP Failed job from task. Solved

Ongoing issues (new comments marked in green):

1. Bham XCache (Mark/Elena)
    14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.
    Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
    21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
    Brian: ATLAS UK should do any needed AGIS setup.
    Sam: will need to understand how ATLAS uses XCache.
    04/04: Birmingham has XCache setup. Still need changes in AGIS, eg. to create an RSE. Alessandra will do it.
    18/04: Alessandra has been looking how to create RSE. Will need to contact DDM experts to understand what to change in AGIS. Perhaps there is already an XCache RSE defined at ECDF that could be used as a model model.
    02/05: Sam reported on the work, mainly by Mark and Alessandra. Some movement on configuring AGIS, but it broke things. Mario a bit reluctant because needs new Rucio features.
    Two modes we'd like to try: one cache and proxy on site and ATLAS knows nothing about the proxy. Other mode (beta) uses Rucio volatile storage. Some confusion as to which one we wanted to test. We (UK) would be happy with simple proxy, but ATLAS wants to test volatile storage.
    09/05: Brian debugging xcache it with xrdcp. Was confused by "no such file" message in debug log. This is normal for successful transfer.
    16/05: Brian: can xrdcp via Birmingham XCache from other sites, but problem with RAL Echo. Have a fix for RAL Echo, verified with a test gateway.
    Alessandra: RAL Echo via XCache will be needed eventually. Brian will put in a JIRA ticket.
    Alessandra: Rucio not yet ready to use XCache RSE. See https://its.cern.ch/jira/browse/ADCINFR-121 (and linked https://github.com/rucio/rucio/issues/2511)
    30/05: Brain will check after Alessandra's storage update
    06/06: Sam said that Teng will set up XCache monitoring for Mark. Based on stuff he did at ECDF.
    Alessandra checked a few days ago. Stefan had changed the fairshare. Should be running more jobs now. These jobs use the XCache, so we can start to see how they do (especially once the XCache monitoring is available).

13/06 Jobs are running fine, 2.5% are failing. Keep watching

2. Diskless sites
21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
    14/03: From Jamboree:
    Bham: Mark has agreed to setup XCache in Bham
    Shef: When Mark switches to XCache Sheffield can use Manchester storage
    Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
    Brunel: Dan is not sure about Brunel. Elena will ask Raul
    LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
    21/03: Elena (email):
        3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
    Decided we want to keep LOCALGROUPDISK for the time being.
    Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
    Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.
    Elena: should ask John from Cambridge when he wants to try to use QM storage.
    Sam: is this what we want. John was interested in seeing how Mark got on at Bham.
    28/03: Alessandra will do Sussex: CentOS7 upgrade and switching to diskless.
    Alessandra: Brunel still has storage for CMS, but just 33 TB for ATLAS. Don't use XCache.
    Brunel has now been set to run low-IO jobs and reduce data consolidation.
    Decided to keep as is (with 33 TB ATLAS disk) for the moment.
    Brian: wait to see how Birmingham gets on with XCache before trying to use it at Cambridge. Alternative would be to use another site so as not to overload QMUL.
    04/04: Mark will write up a procedure for installing XCache. Wait to test ATLAS setup at Birmingham before trying other sites (eg. Cambridge).
    Mario said there were two modes to test. Transparent cache puts proxy in front of storage, so not useful for our case. We want the Buffer cache mode (Mario calls it "Volatile RSE").
    02/05: Lincoln Bryant was at RAL and demonstrated XCache setup using SLATE to Alastair, Brian, Tim, et al.
    16/05: Brian will try out SLATE at RAL.
    06/06: Elena said Sheffield was waiting. Need to finish with CentOS7 upgrade first.

13/06 Cambridge(John Hill) was thinking about installing XCache. Elena will contact John to confirm the plans

Reply from John:

I had intended to raise this with ATLAS early next week, so you beat me to it :-) We certainly need to migrate away from a SE - all our storage (and the SE itself) is out of warranty and so can't be maintained for much longer. Also, we're down to ~200TB of usable storage which makes the SE essentially useless to ATLAS. From the point of view of support, I'd prefer to go diskless. I mainly raised the possibility of XCache because of concerns about the excessive load which the storage host (i.e. QMUL) might experience it we went completely diskless.
I am more than happy to follow whichever route looks best for ATLAS. However, as I am retiring at the end of September we need to find a solution which keeps the support effort as low as reasonably possible, as I am not at all clear what the support for the site will look like from October.

3. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
    21/03: Glasgow still to do.
    28/03: Sheffield aim for deadline, which Alessandra confirmed was 1st June.
    04/04: Alessandra: about to send out email to all ATLAS sites to request status updates on TWiki. Tool to identify CentOS7 sites doesn't work, so keep manual list.
    Status known for all UK sites except RHUL.
    Andy McNab needs to change image for VAC sites for CentOS7 and make sure Singularity is accessible.
    18/04: Vip said that Oxford has now completed migration of all WNs. Alessandra will delete the SL6 PanDA queues.
    Alessandra: we should press sites to put CentOS image live for VAC.
    Gareth: Glasgow using VAC as a method of building a new site with CentOS7 for ATLAS-only. Will then turn off SL6 VAC. Part of lots of work for the new data centre.
    02/05: Elena is working on new ARC-CE connected to new Condor farm at Sheffield (upgrading all in one go). Oxford provided example config for AGIS.
    Vip: will switch off SL6 for other VOs (ATLAS was done months ago).
    09/05: Elena: CentOS7 PanDA queues setup for Durham and Sheffield. Durham works, but Sheffield has problems with the ARC-CE.
    Vip: suggested this might be due to a problem with the certificates in grid-security. He will advise Elena.
    Elena disabled an old SL6 MCORE PanDA queue for Liverpool, which has no more SL6 workers.
    16/05: Alessandra: Waiting for news from Andrew McNab for VAC sites: Birmingham, Cambridge, UCL. Liverpool has switched VAC off.
    To do: RHUL, QMUL, Durham (Sam said they were working on it last week).
    Have two weeks till the 1st June deadline. Alessandra prefers not to have to open lots of tickets for UK sites.
    23/05: Glasgow: VAC queue can go over to CC7 whenever needed: production queues are still. CC6
        Issues observed with HTCondor-CE when trying to move production queues to CC7
        Unlikely to make June deadline.
        VAC upgrade for cc7 ready at Manchester
    30/05: Sheffield arcproxy crashing with new ARC-CE and Centos-7 (understood + fixed after the meeting)
        Glasgow will move VAC when new VMs are ready
    06/06: Alessandra had sent a mail with link to table on progress: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment
        Deadline was 1st June. Elena asked sites to update at Ops meeting. No one responded so far.
        Alessandra changed status of 5 UK sites to "On Hold". 3 are VAC sites. She said is Andrew trying it at Manchester, but it doesn't seem to work yet. Seems to be a race condition but don't know where it comes from. Emailed Peter to look into it.
        Get VAC working at Manchester then roll it out at other sites.
        Alessandra asked Roger what to do about UCL, but didn't hear back after PMB.
        VAC still running on APF (and a special instance). Needs to move to Harvester. Peter should do it.
        Elena asked about problems with the ARC-CE at Sheffield. Alessandra will look at it.
        Elena is planning to make a list of PanDA queues for UK sites to see which old SL6 queues can be removed.

3/06 RHUL is lacking stuff and it's unknown when they move to CetOS7.

4. Harvester issues (Peter/Alessandra/Elena)
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.
    14/03:
    Lancs: Peter configured Lancs in AGIS to use GridFTP
    Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
    QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
    21/03: Elena (from email attached to Indico):
        1. I'll configure Sussex in AGIS to use QMUL <--- Ale will do it
        2. Raul will think about deployment of XCache in Brunel. <----- NO don't need xcache for Brunel
        4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
    Lancaster was just a development queue. Peter will deactivate it now.
    QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
    QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.
    28/03: Elena: RALPP asked to disable SL6 analysis queue. There also seems to be a problem with AGIS storage setup. Elena asked Chris to check endpoints (screendump from AGIS).
    Elena: request to disable Sussex analysis queue. Done.
    Alessandra: Manchester still have problems with Harvester. If we want to maintain the current setup we will have to enable extra PanDA queues.
    04/04: Manchester problem resolved: workers full again.
    Was fixed by creating one PanDA queue for each cluster, since Harvester doesn't seem able to cope with two different batch systems. Disappointing that Harvester's promise to allow unified queues is not always possible.
    Sussex work won't happen before May. Fabrizio said they are hiring someone.
    02/05: Sheffield Harvester migration at same time as CentOS7 etc.
    16/05: Fabrizio at European Strategy for Particle Physics meeting in Granada this week. Should have an update next week.
    30/05: Sussex only pending: RHUL needs to be checked
    06/06: Alessandra talked to Fabrizio about Sussex. The new person starts in July. They would like to go disk-less, probably using QMUL.

13/06 Elena LT2 sites, Lancs and Liverpool. QMUL has a mixture of sl6 and CentOS7 queues. Dan wants to keep them until he moves all the resources to CentOs7. QMUL queues will be revisited after that. Elena will contiue to revisit queues

5. Dumps of ATLAS namespace (Brian)
    14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
    ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
    Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.
    28/03: Sam: needs to fine tune how to automate dumps.
    Matt: Lancaster still to do. Brian will prod him before next week.
    04/04: Brian will check what sites now have setup.
    The dump format is described here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Storage_Element_Dumps . Paths should be relative to the "rucio" directory.
    18/04: Brian: still working on Glasgow and Lancaster.
    Sam says Glasgow now has a dump. There have been problems getting it to work as a cron job, due to xrootd permissions problems.
    02/05: Lancaster dump is in the wrong place (root://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/atlaslocalgroupdisk/rucio/dump_20190429). Should be root://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/atlas/atlaslocalgroupdisk/dumps/dump_20190429. Matt will fix it.
    Glasgow dump now available.
    09/05: All Tier-2s now have dump files.
        RAL Tier-1 still to fix automatic dump (previously done by hand).
    06/06: As well as fixing the RAL Tier-1 dump format, we also need the aforementioned change in the Rucio Consistency checker.

6. XRootD/HTTP transfers (Alessandra/Tim)
    16/05: DOMA TPC group looking for ATLAS sites to enable XRootD/HTTP for test-production transfers.
        Alessandra reported that Lancaster and Brunel ready for this. Could also include RAL using a test RSE.
    06/06: Alessandra reported that Mario is working on configuring functional tests for XRootD TPC.
        Had javascript:void('Dark Emerald')Lancaster running, but it broke. Also need monitoring.
        Concerning the fix needed to get it working with AGIS/Rucio, Cedric and Martin say it isn't easy. We would like to understand what the problem is.

13/06 Alessandra made changes in AGIS for Man, Lancs and Ox last weel. This week she will change storage settinfs for Brunel.

News:

DT: GPU resources should be available via CentOS7 queues

GR: Glasgow is moving their servers in August. Work on test CentOS7 system

SS: Xrootd tests in Glasgow are under way

AOB:

There are minutes attached to this event. Show them.

- 10:00 → 10:10
  
  Outstanding tickets 10m
- 10:10 → 10:30
  
  Ongoing issues 20m
- 10:30 → 10:50
  
  News round-table 20m
- 10:50 → 11:00
  
  AOB 10m

Choose timezone

ATLAS UK Cloud Support

Vidyo