ATLAS UK Cloud Support meeting minutes, 21 March 2019
Present: Brian Davies, Dan Traynor, Duncan Rand, Gareth Roy, Matt Doidge, Peter Love, Sam Skipsey, Vip Davda
Outstanding tickets:
* ggus
140305 UKI-SCOTGRID-GLASGOW: Transfer errors with "Operation timed out
Disk full, so all transfers go to the same disk, so too many gridftp and xrootd transfers on one server. Adjusted weights, but eventually xrootd will finish. Can't throttle xrootd or gridftp with DPM. Can turn off gridftp for a bit until xrootd recovers.
Won't help removing secondary data: disk is mostly primary, and don't want to increase load with more data management.
* ggus
140103 UK UKI-NORTHGRID-LANCS-HEP: transfer and staging error with"DESTINATION OVERWRITE srm-ifce err"
Matt fixed a discrepancy in the disk server's gridftp configs. Disabled "nucleus" setting. Will ask Elena to re-enable when ready.
* ggus
139723 permissions on scratchdisk
Ongoing development to get rucio mover working with RAL Echo.
* ggus
138033 singularity jobs failing at RAL
Now set a HOME directory for Condor jobs. Close the ticket if Alessandra is happy.
* ggus
140134 UKI-SOUTHGRID-OX-HEP Unspecified gridmanager error: Code 0 Subcode 0
One of the SE pool nodes crashed. Back up again, and no more errors.
Ongoing issues (new comments marked in
green):
1. LocalGroupDisk consolidation (Tim)
* most >2year old datasets removed 4 July. Should free 488 TB. Tim will check the rest.
* Tim: R2D2 has default lifetime of 180 days (can be changed when making request).
At S&C Week, Cedric suggested a "subscribe to dataset" option to get email notifications. I think it would be more useful to do this when creating rule, and notification when LOCALGROUPDISK is last replica.
* Brian: do we really need to keep 10% free space? This prevents disk filling up, and leaves room for unexpected increases. Not necessary for sites to also provide a safety margin.
* Tim has asked Roger if we should rebalance more T2 disk from LOCALGROUPDISK to DATADISK. Will prod again.
10/01: Cleanup run is due sooner rather than later. RHUL increased space recently.
17/01: Is 180 day still the limit? We had mixed results when trying in the meeting.
Kashif: How to tell users of files on site's LOCALGROUPDISK? Currently using dmlite-shell.
24/01: Tim sent instructions last week to the atlas-support-cloud-uk list for checking Rucio (don't need to be an ATLAS user).
Kashif: many users have left Oxford, so difficult to contact. Will contact ATLAS Oxford group leader.
Dan asked about user quotas on LOCALGROUPDISK. Some users seem to have individual quotas. They shouldn't.
2. Harvester issues (Peter/Alessandra/Elena)
28/02: Migration almost complete. Kept SL6 queues on APF because they'd go away.
Remaining APF factories listed here:
http://apfmon.lancs.ac.uk/aipanda157 .
In the end should only have analysis queues on APF.
Also think about moving test queues. Should have a separate Harvester instance to allow testing.
14/03:
Lancs: Peter configured Lancs in AGIS to use GridFTP
Brunel: Elena made changes in AGIS and now Brunel is set online. Will look how Lancaster is doing with setup for GridFTP and try to add GridFTP to Brunel and disable srm.
QMUL: Elena removed a queue for ES. QMUL is OK now. Elena will contact harvester expert how to use a high memory quueues
21/03: Elena (from email attached to Indico):
1. I'll configure Sussex in AGIS to use QMUL
2. Raul will think about deployment of XCache in Brunel.
4. Nicolo says "If the QMUL EventService queue is a different queue in the local batch system, then yes, it will need a new PanDAQueue in AGIS. [... but] the Lancaster EventService queue UKI-NORTHGRID-LANCS-HEP_ES is actually not configured to run EventService anymore."
Lancaster was just a development queue. Peter will deactivate it now.
QMUL has a pre-emption queue setup for the event service, so we should set it up with a separate PanDA queue.
QMUL provides 4 GB / job slot, so used to have separate high memory PanDA queue. Now Harvester handles it in the UCORE queue. Dan will ask Elena to ask Harvester experts to check settings.
3. Bham XCache (Mark/Elena)
14/3: Following discussion at Jamboree Mark agreed to install XCache ib Bham. Mark asked for support and Ilija and Rod will help him to install XCache. Mark can deploy it by himself ( install on bare metal or using singularity / docker container) or deploy kubernetes cluster with SLATE and Elena does the rest, keep it up and running, updated, monitored, set up in AGIS.
Sam suggested to discuss Bham XCache setup at the Storage meeting next week. Elena will send email to Mark
21/03: Sam: Mark has a lot of experience setting up xrootd, so XCache setup should be a simple for him. He will setup XCache at Birmingham and agreed to be on-call. Mark can then advise other sites, eg. with Puppet. We won't use SLATE.
Brian: ATLAS UK should do any needed AGIS setup.
Sam: will need to understand how ATLAS uses XCache.
4. CentOS7 migration (Alessandra)
UK quite advanced. Sites with PBS will have to switch to something else.
Alessandra will follow progress for UK (also coordinating for the rest of ATLAS).
31/01: RALPP is moving to CentOS7-only. Elena will check queues.
14/02: Elena switched RALPP UCORE PanDA queue from SL6 to CentOS7. Analysis C7 queue still has problems.
Kashif: Oxford has 2/3 of worker nodes moved to C7 queue. Can delete SL6 queues in a few weeks' time.
Vip: Jeremy has a web page with progress,
https://www.gridpp.ac.uk/wiki/Batch_system_status . We should keep that up to date, but also need to monitor progress for ATLAS migration.
28/02: At Vip's request, during the meeting Elena disabled UKI-SOUTHGRID-OX-HEP_CLOUD, UKI-SOUTHGRID-OX-HEP_UCORE, ANALY_OX_TEST, and UKI-SOUTHGRID-OX-HEP_TEST. Just have CentOS7 queues, UKI-SOUTHGRID-OX-HEP_SL7_UCORE and ANALY_OX_SL7.
21/03: Glasgow still to do.
5. Diskless sites
21/02: Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.
28/02: Brian: We need more experience from Birmingham and Manchester monitoring.
14/03: From Jamboree:
Bham: Mark has agreed to setup XCache in Bham
Shef: When Mark switches to XCache Sheffield can use Manchester storage
Cam and Sussex: Dan is happy for Cam and Sussex to use QM storage
Brunel: Dan is not sure about Brunel. Elena will ask Raul
LOCALGROUPDISK at diskless sites: Elena will send email to ATLAS
21/03: Elena (email):
3. Cedric says "diskless sites can keep LOCALGROUPDISK [...] if the UK cloud takes care of contacting the sites if they have problems."
Decided we want to keep LOCALGROUPDISK for the time being.
Dan: will this go away? No Gridpp funding, so only keep long term if sites choose to fund it.
Sam: LOCALGROUPDISK is supposed to be owned ATLAS UK, but mostly actually used by local users.
6. Dumps of ATLAS namespace (Brian)
14/03: Brian noticed that some sites stopped provide dumps of atlas namespace. Scripts stopped working after DPM upgrade and contacted sites. Brian will open JIRA ticket.
ATLAS documentation here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Automated_checks_Site_Responsibi
Sites are now making the dumps. Working on RAL Tier1 Echo, which needs a Rucio change.
News:
Brian: Would be nice to have an update from WLCG/OSG meeting.
Dan:
enabled production SE for 3rd party copies via webdav. Now doing stress tests.
Moving to C7.
Issues with data transfers from CNAF (LHCb + ATLAS)
Gareth:
IPv6 problems.
Plans for CentOS7.
New equipment needs commissioning.
Want to get rid of DPM. Testing Ceph (1.5-2PB), with help from RAL.
Peter:
Switched off all prod APF. Only fringe test cases.
Glasgow analysis queue has faults:
http://apfmon.lancs.ac.uk/q/ANALY_GLASGOW_SL6 . Missing some information.
Please let Peter know if there are any issues with Harvester monitoring.
Sam: NTR
Tim:
Added 1.25 PB to
RAL DATADISK quota. Now 9.45 PB. Additional 1 PB on 1 April.
So far no more lost files due to FTS double-copy bug. Following Rucio consistency check, 73 files marked lost from before.
Changed DDM site name in AGIS from RAL-LCG2-ECHO into to RAL-LCG. Moved Nucleus.