RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:40
      ATLAS Operations Report 5m
      Speakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:40 13:45
      CMS Operations Report 5m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      CMS had a large spike in job failures - these were bad jobs and were killed. At the same time Tom Birkett saw a large number of errors mentioning CVMFS. I'm curious if these are still appearing now the failing jobs have subsided. The CVMFS error pointed to the cms-ib (Integration Build) which shouldn't be used for production jobs.

      Deletions - 3.5PB removed from tape. Another batch to be finalised in mid-March.

      SAM tests are ok but still seeing the yellow warning on tape transfers. These are failing functional tests which nominate when there is no prod activity. The tape  functional tests are no longer supposed to be running - will follow up.

    • 13:45 13:50
      LHCb Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
      • Pilot failures due to nproc limit
        • The issue is not present any more, a few mitigations are in place:
          • Patched xrootd version deployed to cvmfs by LHCb
          • Tom B develped a blocker for the setrlimit sycall for the job containers, it is being tested on the LHCb-only WN
            • There were some incompatibilities with the LHCb pilot code, but it is fixed now
      • New LHCb workflow arrived
        • Jobs are downloading data from CERN to RAL WNs
      • Data Challenge (is it the right section?)
        • Good performance for the LHCb during the first two days
          • Some throughput decrease during the OPN cut
        • Dropped this morning
          • CERN SE is very loaded with production data movement.
    • 13:50 13:55
      ALICE Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Missing tape activity, to be investigated. Reported free space looks strange.

    • 13:55 14:00
      LSST Operations Report 5m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))

      Moved 12 TB of data lastweek / Monday + Tuesday to RAL from Lancs, IN2P3 and SLAC to try and ensure that RAL has all the data needed for various pipeline test
      New dataset collected and moved and registered with metadata database, working with Campaign Management team at FERMILAB to ensure all data is registered correctly and ingested as needed

       

      Will work with Matt Doige to relieve pressure from Lancs storage in terms of LSST by pointing their Data pipelines at our storage instead for a time, as well as ensuring we are using our own storage now we have the data

       

      Messaging stack being updated this week to version 1.5 to ensure up to data and ready for data end-to-end testing

       

      Job efficiency expected to change as data source changed to RAL, and new pipeline testing is done at RAL

       

    • 14:00 14:01
      Tier-1 Projects 1m
    • 14:05 14:15
      Preparing for mini UK Data Challenge 10m
      Speakers: Mr James Adams, Katy Ellis (Science and Technology Facilities Council STFC (GB))

      So far: 

      Monday - planned to only half-fill the OPN but almost filled it (200Gbps).

      Tuesday - cut of the OPN and fallback to the LHCONE. A slight drop observed but actually it worked pretty well. 

      Wednesday - CMS injections crashed on tuesday evening at 8pm. All VOs restarted the injections this morning, drawing data from CERN and Tier 1s. Increased the rate to saturated the link and study T1->T1 links.

      Even though FTS traffic dropped to 100Gbps, we still see nearly 200Gbps in the network plots. So we suspect streaming from CERN from CMS and LHCb...investigating. James A created new plots showing traffic to/from storage and WNs (but the WNs show internal data movements too).

      CMS injections were interupted by actual production traffic. There were also DB issues which probably affected the submission of rules in Rucio. 

      Deletions: CMS was behind in deletions before the mini-DC started (and had gone over pledge). So deletions were going to struggle during the mini-DC. Katy got additional Rucio reaper pods started up to focus purely on RAL. Although the rate-per-file of deletions is clearly slower than some other sites, RAL seemed able to keep up with the higher rate of deletions once it was being targeted with continuous deletion requests. 

       

      To come:

      Thursday - read tests from Echo 

      Friday - probably cancelled/postponed (Tier 2 testing) due to major sites Lancaster and Manchester not being fully functional. 

    • 14:15 14:25
      Anatares Upgrade 10m

      New EOS nodes
      Tape Robotics downtime

      Speakers: George Patargias, Thomas Byrne
    • 14:25 14:35
      XRootD Development 10m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)
      • Writeable WN gateways
        • Sandbox deployed to preprod farm
        • Amendments made to CC document
    • 14:35 14:45
      Utilizing GPUs 10m
      Speakers: Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Thomas Birkett
    • 14:45 14:46
      AOB 1m
    • 14:46 14:55
      Summary of Operational Status and Issues 9m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 14:55 15:00
      Any other Business 5m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore