RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Videoconference
RAL Tier1 Experiments Liaison Meeting
Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 12:38 12:39
      Major Incidents Changes 1m
    • 12:39 12:40
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 12:40 12:41
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 12:41 12:42
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 12:42 12:43
      Experiment Operational Issues 1m
    • 12:44 12:45
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))

      2022 Pledges:
      - TAPE already 'provided';
      - DISK awaiting hardware;
      - CPU awaiting TB.

      Tape challenge; still to confirm a date for T0 Export repeat test

      Antares:

      - antares-tpc01 doesn't appear to work with TPC transfers (reason unknown); resulting "Operation Expired" errors for Archiving to Antares;
       -  ~ 75% overall transfer efficiency 

      - xrootd gateways can trigger "Operation Expired" errors for Recalls from Antares to Echo

      - Updates to BNL FTS to force through more transfers (and try to reduce the pre-transfer staging eviction states).

       

    • 12:45 12:46
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
      Webdav tests still failing when load is higher. A few green days recently - load was light.

      I talked to the Rucio developer who works on the multihop transfers, as well as 2.5 hours of discussion with CERN CTA and FTS colleagues (including Steve Murray), some of which was about Antares specifically.

      For Rucio, the multihop is given a significant redesign in the forthcoming version 1.28 (currently CMS is using 1.27.x). This new design keeps the two parts of the transfer together through the lifetime of the job, even after a resubmit. Previously, once a multihop path was determined by Rucio, and then submitted to FTS, Rucio no longer had a memory of it being a multihop transfer.

      Between us we also figured out that CMS had an expired credential in CMS-Rucio, which deals with FTS cancellations, and this is why nothing ever got cancelled. It did affect the staging as the same files had multiple FTS jobs associated with them. The FTS developer told me that this used to be prohibited by design, but it caused a problem for database queries so it was removed last year.

      I was already aware that an update to the EOS version would fix another problem with bulk FTS requests containing one genuinely missing file failing the entire bulk request, with the ‘file missing’ type error. I probably mentioned this in my GridPP talk and have now written an internal ticket to try to encourage this to happen ASAP.

      There is also an update coming for CTA itself, v4.6.0-1 is recommended, and I forget exactly what the change was for, but I have the release notes to read.

      I observed a monitoring problem in CMS-Rucio during the tape challenge, and I can see that Eric Vaandering is following up on that.

      I think these are quite some significant changes which should help the system run much more smoothly under load/when things go wrong. A lot of them are the result of all the work we did last summer when CMS was attempting to recall 8PB of ‘B-parking’ data from CERN-CTA. I was planning to give my Oct 2021 CMS Computing Week talk, which explained the B-parking problems to the T1 storage guys next week – but I’ll be able to update it with this information in this email.

      In my opinion, these changes will fix a lot of the errors we were seeing during staging (particularly after the Rucio clusters were deleted by CERN-IT). They will benefit ATLAS too. My last worry would be on the Echo side, when it is busy.
       
    • 12:50 12:51
      VO Liaison LHCb 1m
      Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))
    • 12:55 12:58
      VO Liaison LSST 3m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 13:00 13:01
      VO Liaison Others 1m
    • 13:05 13:06
      Experiment Planning 1m
    • 13:10 13:11
      Euclid 1m
    • 13:15 13:16
      SKA 1m
    • 13:20 13:30
      Dune/protoDune 10m
    • 13:30 13:31
      AOB 1m
    • 13:35 13:36
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))