RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:40
      ATLAS Operations Report 5m
      Speakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:40 13:45
      CMS Operations Report 5m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Another update on previously reported SAM test issues:

      1. svc20 (AAA server) not showing problems since being added with new memory limits. 
      2. Network problems - a few hiccups last week but not significant for CMS. Problem hopefully fixed on Friday/today. 
      3. 'Connection' test for Antares endpoints in warning due to no IPv6 - how are the tests for the new EOS nodes going? 

      Another good week in production for CMS, with lots of cores and good performance. 

      Transfers:

      From last week, the large number of file exists errors likely due to network disruption - automated clean up mechanism was very effective and errors went to zero without Katy having to do anything.

      Errors with Echo as source mentioned last week were mostly transfers failing to Antares as mentioned above. 

      I observe that this week we are running a lot of data Processing jobs, and much of the data is likely sourced from CERN. However, at a glance, jobs reading Offsite are performing better than those reading Onsite! N.B. Update - those jobs failing with Onsite reads are almost entirely User Analysis.

    • 13:45 13:50
      LHCb Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      News:

      • LHCb drained last Sunday due to a problem with one of the LHCbDIRAC VMs
        • Recovered by Monday morning


      Issues:

      • Network outages [GGUS ticket]:
        • All looks ~OK since Friday afternoon
        • The issue was fully fixed today, so we can close the ticket?
      • ceph-svc24 was intermittently crashing yesterday
        • Due to a bug in the new checksum dump code
          • Fixed now
        • That caused a (minor) increase in upload failures.
      • Spike of deletion failures this morning
        • All failures were due to timeouts.


      CVMFS:
      Stratum-1 servers were rebooted yesterday. That stopped snapshots due to removal of the /run/cvmfs.local directory. The directory was created manually yesterday evening, that enabled snapshots again. A proper fix is being prepared.

       

       

    • 13:50 13:55
      ALICE Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
    • 13:55 14:00
      LSST Operations Report 5m
      Speakers: Mathew Sims, Timothy Noble (Science and Technology Facilities Council STFC (GB))
      • RC2 data movement getting there, with some blockers on the US site, these have been overcome and data now mostly registered in Rucio - and working on the rest

       

      • IngestD version increase to v20 today by Mat

       

      • DC2 w18 now complete with only a minor issue that was resolved and reported
        • Job requested 4GB and used around 14/15GB 
          • Didn't fail at Lancs or IN2P3 due to not having the same sort of limits set (3x to request then kill)
          • This has been reported back to CM team
        • Will now run w22

       

       

       

    • 14:00 14:01
      Tier-1 Projects 1m
    • 14:15 14:25
      Anatares Upgrade 10m

      New EOS nodes
      Repack Progress

      Speakers: George Patargias, Thomas Byrne
    • 14:25 14:35
      XRootD Development 10m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)
    • 14:35 14:55
      Echo SN24 Install 20m
    • 14:55 15:05
      Utilizing GPUs 10m
      Speakers: Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Thomas Birkett
    • 15:05 15:25
      SWIFT-HEP SSD storage 20m
    • 15:25 15:26
      AOB 1m
    • 15:26 15:35
      Summary of Operational Status and Issues 9m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 15:35 15:40
      Any other Business 5m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore