ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 21 March 2019

Present: Brian Davies, Dan Traynor, Duncan Rand, Gareth Roy, Matt Doidge, Peter Love, Sam Skipsey, Vip Davda

Outstanding tickets:

* ggus 140305 UKI-SCOTGRID-GLASGOW: Transfer errors with "Operation timed out
    Disk full, so all transfers go to the same disk, so too many gridftp and xrootd transfers on one server. Adjusted weights, but eventually xrootd will finish. Can't throttle xrootd or gridftp with DPM. Can turn off gridftp for a bit until xrootd recovers.
    Won't help removing secondary data: disk is mostly primary, and don't want to increase load with more data management.
* ggus 140103 UK UKI-NORTHGRID-LANCS-HEP: transfer and staging error with"DESTINATION OVERWRITE srm-ifce err"
    Matt fixed a discrepancy in the disk server's gridftp configs. Disabled "nucleus" setting. Will ask Elena to re-enable when ready.
* ggus 139723 permissions on scratchdisk
    Ongoing development to get rucio mover working with RAL Echo.
* ggus 138033 singularity jobs failing at RAL
    Now set a HOME directory for Condor jobs. Close the ticket if Alessandra is happy.
* ggus 140134 UKI-SOUTHGRID-OX-HEP Unspecified gridmanager error: Code 0 Subcode 0
    One of the SE pool nodes crashed. Back up again, and no more errors.

Ongoing issues (new comments marked in green):

1. LocalGroupDisk consolidation (Tim)
  * most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
  * Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
        At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
  * Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
  * Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
    10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
    17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
    Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
    24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
    Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
    Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.

2. Harvester issues (Peter/Alessandra/Elena)
    28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
    Remaining APF factories listed here: http://apfmon.lancs.ac.uk/aipanda157 .
    In the end should only have analysis queues on APF.
    Also think about moving test queues. Should have a separate Harvester instance to allow testing.
    14/03:
    Lancs: Peter configured Lancs in AGIS to use GridFTP
    Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
    QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
    21/03: Elena (from email attached to Indico):
        1. I'll configure Sussex in AGIS to use QMUL
        2. Raul  will think about deployment of XCache in Brunel.
        4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
    Lancaster was just a development queue. Peter will deactivate it now.
    QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
    QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.


3. Bham XCache (Mark/Elena)
    14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself  ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.   
    Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
    21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
    Brian: ATLAS UK should do any needed AGIS setup.
    Sam: will need to understand how ATLAS uses XCache.


4. CentOS7 migration (Alessandra)
    UK quite advanced. Sites with PBS will have to switch to something else.
    Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
    31/01: RALPP is moving to CentOS7-only. Elena will check queues.
    14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
    Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
    Vip: Jeremy has a web page with progress, https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
    28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
    21/03: Glasgow still to do.

5. Diskless sites
  21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
    Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
    Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
    28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
    14/03: From Jamboree:
    Bham: Mark has agreed to setup XCache in Bham
    Shef: When Mark switches to XCache Sheffield can use Manchester storage
    Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
    Brunel: Dan is not sure about Brunel. Elena will ask Raul
    LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
    21/03: Elena (email):
        3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
    Decided we want to keep LOCALGROUPDISK for the time being.
    Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
    Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.


6. Dumps of ATLAS namespace (Brian)
    14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
    ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
    Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.


News:
Brian: Would be nice to have an update from WLCG/OSG meeting.
Dan:
    enabled production SE for 3rd party copies via webdav. Now doing stress tests.
    Moving to C7.
    Issues with data transfers from CNAF (LHCb + ATLAS)
Gareth:
    IPv6 problems.
    Plans for CentOS7.
    New equipment needs commissioning.
    Want to get rid of DPM. Testing Ceph (1.5-2PB), with help from RAL.
Peter:
    Switched off all prod APF. Only fringe test cases.
    Glasgow analysis queue has faults: http://apfmon.lancs.ac.uk/q/ANALY_GLASGOW_SL6 . Missing some information.
    Please let Peter know if there are any issues with Harvester monitoring.
Sam: NTR
Tim:
    Added 1.25 PB to RAL DATADISK quota. Now 9.45 PB. Additional 1 PB on 1 April.
    So far no more lost files due to FTS double-copy bug. Following Rucio consistency check, 73 files marked lost from before.
    Changed DDM site name in AGIS from RAL-LCG2-ECHO into to RAL-LCG. Moved Nucleus.
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m
    • 10:10 10:30
      Ongoing issues 20m
      • News from Elena 5m
        From: Elena Korolkova <e.korolkova@sheffield.ac.uk>
        Date: 20/03/2019, 09:17

         

        News:
        
        1. I'll configure Sussex in AGIS to use QMUL
        
        2. Raul  will think about deployment of XCache in Brunel.
        
        3. On 19 Mar 2019, at 15:27, Cedric Serfon <cedric.serfon@cern.ch> wrote:
        
        
        Hi,
        
        Diskless sites can keep LOCALGROUPDISK if they want as long as somebody takes care of them, which is often not the case. c.f. Peter's CRC report, last week the errors plots were dominated by errors due to 2 T3s (one that got full and one that had his storage that crashed for more than 2 days). So if the UK cloud takes care of contacting the sites if they have problems, that's fine on DDM side.
        
        Cheers,
        Cedric
        
        ________________________________________
        From: Elena Korolkova [e.korolkova@sheffield.ac.uk]
        Sent: 19 March 2019 14:53
        To: atlas-adc-expert (Atlas Distributed Computing Manager on Duty and Experts)
        Cc: atlas-support-cloud-uk (ATLAS support contact for UK cloud)
        Subject: ATLAS Diskless sites and LOCALGROUPDISK
        
        Dear experts,
        
        some of the UK sites that are on the list of ATLAS Diskless site candidates have  LOCALGROUPDISK >30 TB. The LOCALGROUPDISKs are actively used by UK atlas users.  LOCALGROUPDISK in the UK are full in most sites and getting rid of LOCALGROUPDISK  in the Diskless site candidates make the situation with storage more complicated.
        
        What is your position on the LOCALGROUPDISK in this situation.
        
        Thanks
        Elena
        
        4. On 19 Mar 2019, at 16:46, Nicolo Magini <Nicolo.Magini@cern.ch> wrote:
        
        
        Hi Elena
        
        Il 19/03/2019 15:33, Elena Korolkova ha scritto:
        
        Dear harvest expert,
        
        UKI-LT2-QMUL has a special local queue to support ES. Do we need to setup a special ES queue for QMUL (like https://atlas-agis.cern.ch/agis/pandaqueue/detail/UKI-NORTHGRID-LANCS-HEP_ES)?
        
        If the QMUL EventService queue is a different queue in the local batch
        system, then yes, it will need a new PanDAQueue in AGIS. Please let us
        know if you need help with the configuration.
        
        Regards
        
        Nicolò
        
        P.S. on the other hand - from what I can tell, the Lancaster
        EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured
        to run EventService anymore. This queue is only running ordinary jobs -
        could you please check with Lancaster admins if this is intended? Maybe
        it's a config mistake, or maybe it's just a historical leftover and we
        can merge the queue into the new Lancaster UCORE queue?
        
    • 10:30 10:50
      News round-table 20m
    • 10:50 11:00
      AOB 10m