RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:40
      ATLAS Operations Report 5m
      Speakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:40 13:45
      CMS Operations Report 5m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Red days for SAM on Tuesday/Wednesday due to the continuation of network issues/interventions reported last week. 

      Observed large discrepancy between CMS (monit) and RAL (Vande) monitoring on running cores. This turned out to be scheduling inefficiency on CMS side with slow ramp up of scheduling agents at FNAL. 

      Tape downtime caused CMS to go into drain several times. The Rucio status (for Echo) was overridden by Data Management. Katy overrode the status to keep jobs running. CMS needs to treat this better and not send sites into drain just because tape is down.

      A few spikes in job failures and low efficiency which may be related to network blips (long read times).

      RAL FTS removed entirely from CMS-Rucio operations.

    • 13:45 13:50
      LHCb Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Issues:

      • Network problems on the 23rd of April
        • Affected LHCb jobs
        • Looks OK since the morning of the 23rd
      • LHCb drained on Saturday
        • Lack of jobs
    • 13:50 13:55
      ALICE Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
    • 13:55 14:00
      LSST Operations Report 5m
      Speakers: Mathew Sims, Timothy Noble (Science and Technology Facilities Council STFC (GB))
      • DC2 pipeline complete, but working with CM team to standardise the data return process
        • The script to return data is hard-coded in places, which all rely on URI, which is currently incorrect
          • Working on moving data to the correct location now
          • Approx 4 million files 
          • writing a script to use multiprocessing with Python to move data
      • Will run DC2 again with the weekly code from week 18 from Friday (1st May)
        • And will continue to run every other week going forwards to assist with site and code base testing and troubleshooting
      • After DC2 data return process complete will work on getting the RC2 data registered in a butler repository for smaller test but with pre-curser data
      • LSST want to use a UK FTS for European transfers, so will want to either use SKA FTS or LCGFTS - or at least use it for fail over
        • Not using a FTS for a time and wanting to use it for failover could lead to further failures
        • Do ATLAS and CMS use a single central FTS for all transfers or do they also use geographically closest ones?
          • From talking to other devs Rucio currently uses only the FTS related to the destination 
    • 14:00 14:01
      Tier-1 Projects 1m
    • 14:15 14:25
      Anatares Upgrade 10m

      New EOS nodes
      Tape Robotics downtime

      Speakers: George Patargias, Thomas Byrne
    • 14:25 14:35
      XRootD Development 10m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)
    • 14:35 14:45
      Varnish For ATLAS 10m
      Speaker: Brij Kishor Jashal (Rutherford appelton laboratory)
    • 14:45 14:46
      AOB 1m
    • 14:46 14:55
      Summary of Operational Status and Issues 9m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 14:55 15:00
      Any other Business 5m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore