RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:40
      ATLAS Operations Report 5m
      Speakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:40 13:45
      CMS Operations Report 5m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Still confused why the Antares tests are in yellow warning mode. I was told that if 50% of transfers are failing on 50% of links then we get this state. But this is unfair, because Antares only has two links (CERN T0 and Echo) and one of them (CERN T0) currently has no traffic! Actual transfers to Antares appear to have no problems.

      New data from CMS is arriving, marked 2025 Commissioning. Have requested new tape families to be created.

      Period of low efficiency last Thursday was tracked to a large number of very low efficiency workflow jobs (I/O only).

      Mini-DC for Echo was successful - see notes in later section. 

    • 13:45 13:50
      LHCb Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Issues:

      • Pilot failures due to xrootd-set nproc limit
        • Fully mitigated now
        • Turned out xrootd client also sets max open files limit when imported :-(
          • Has not affected anything (yet)
      • A lot of rescheduled and failed jobs at all T1 sites, due to a problem with EOS-pilot SE at CERN
        • Not our fault
      • Lots of failed transfers to SARA from all T1 sites
        • Again, not our fault

       

      Data Challenge (sorry, can not edit appropriate section):

      • Very good throughput on Monday and Tuesday, well above the target
        • Much better than during the DC24 
        • the rate reduce a bit during the OPN cut, but still was close to the target
      • Not so good rate on Wednesday, where gateways were struggling
      • Read test on Thurdsay (RAL -> CERN), failed completely due to storage problems at CERN
      • Checksum tests on Friday, worked well (at least from client perspective)
      • Simultaneously with the challenge, LHCb started new Sprucing workflow, where jobs running at RAL download data directly from CERN (eospilot SE), that contributed to the OPN load (and LHCONE a bit, due to old WN gens).
    • 13:50 13:55
      ALICE Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Raw 2024 data distribution started. It was delayed because RAL has the smallest share among T1s, and therefore started last (not because of the storage issues).

    • 13:55 14:00
      LSST Operations Report 5m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))

      Data moved to RAL, ingestion to Butler needing to be done manually, as data movement needed to be done manually as well - complete this week for DP testing

       

      Voms access changes to echo (Jyothish is helping with this testing on ceph-svc16) - Data is to be moved to sites is not to be widely available, therefore needed to specify voms roles for certain areas of lsst:datadisk to allow certain voms roles access to different areas and data

      Date movement to RAL:

       

      Data movement from RAL:

       

       

       

    • 14:00 14:01
      Tier-1 Projects 1m
    • 14:05 14:15
      Actions from mini UK Data Challenge 10m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Echo mini-DC was last week 3-7 March. Analysis is in progress. Talks at LHCONE/OPN meeting next week, then more at GridPP53

       

      Monday - planned to only half-fill the OPN but almost filled it (200Gbps).

      Tuesday - cut of the OPN and fallback to the LHCONE. A slight drop observed but actually it worked pretty well. 

      Wednesday - CMS injections crashed on tuesday evening at 8pm. All VOs restarted the injections in the morning, drawing data from CERN and Tier 1s. Increased the rate to saturated the link and study T1->T1 links. Gateways started to struggle as they were saturated to 25Gbps with some 

      Thursday - read tests from Echo. Didn't really intend to push it hard but it also ~saturated on the OPN.

      Friday - Tier 2 tests cancelled/postponed due to major sites Lancaster and Manchester not being fully functional. Instead we did checksum-on-the-fly testing. This went well in terms of stress testing the code but there were errors with the checksums themselves (many of them were trying to checksum a file on read instead of only on writes, but there were also possibly a few genuine errors.

       

      At times even though FTS traffic dropped to 100Gbps, we still see nearly 200Gbps in the network plots. So we suspect streaming from CERN from CMS and LHCb...investigating. James A created new plots showing traffic to/from storage and WNs (but the WNs show internal data movements too). Various monitoring from CMS and LHCb show 

      Deletions: CMS was behind in deletions before the mini-DC started (and had gone over pledge). So deletions were going to struggle during the mini-DC. Katy got additional Rucio reaper pods started up to focus purely on RAL. Although the rate-per-file of deletions is clearly slower than some other sites, RAL seemed able to keep up with the higher rate of deletions once it was being targeted with continuous deletion requests. 

    • 14:15 14:25
      Antares Upgrade 10m

      New EOS nodes
      Tape Robotics downtime

      Speakers: George Patargias, Thomas Byrne
    • 14:25 14:35
      XRootD Development 10m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)

      Writable WN gateways:

      • The change is being deployed this week
        • LHCb jobs from 2018 and 2020 gens are already writing their output data to ECHO via root (i.e. local gateways)
          • Mostly successful, there is a problem with one WN (new sandbox has not been deployed there yet)
        • Tomorrow extesion to 2019 and 2021 gens is planned.
    • 14:35 14:45
      Utilizing GPUs 10m
      Speakers: Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Thomas Birkett
    • 14:45 14:46
      AOB 1m
    • 14:46 14:55
      Summary of Operational Status and Issues 9m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 14:55 15:00
      Any other Business 5m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore