RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:40
      ATLAS Operations Report 5m
      Speakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:40 13:45
      CMS Operations Report 5m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

       

      CMS plan to switch over the ETF tests from the pre-prod this week:

      • ETF pre-production was updated with latest HTCondor client and patches undone one by one; everything looks good;
      • issue with the /Role=production proxy discovered last week in EL9 instances
      • will give us stable job submission to ARC-CEs; old worker node tests will stay production; switch to new worker node tests next/in a few weeks?"

       

      Spotted again that 'running cores' did not match between CMS and RAL monitoring - operator checking FNAL schedulers again, but looks better today. 

      '500' transfer errors reported by Brij also seen by CMS at RAL, but not only at RAL. It may be this is quite a generic error but it is seen at several other sites. It's mentioned in this ticket with FTS: https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC4460338

      The mystery of failing 'mc' SAM tests due to stageout error 2 weeks ago - this coincided with downtime of RALPP storage. RALPP is our fallback site for stage out. The problem was that the test was always failing to stage out to Echo because the port was missing in the config. This is now corrected. Production jobs were never affected. 

    • 13:45 13:50
      LHCb Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Issues:

      • ECHO problems (faulty disk on the 30th of April causing some slow IOPs + slow IOPs for uknown reason on the 7th of May) caused file loss and corruption (GGUS 683184)
        • Lost files restored, xrootd uploads disabled, to avoid race conditions
          • around 1k files lost, all deleted or restored now
        • Full list of corrupted files is still to be identified.
          • So far only one was found
          • Would it be possible to get a list of all files with their sizes and checksums?
      • DIRAC issues on Tuesday (night + early morning)
        • Resuted in increase in completed and rescheduled jos
      • Spikes of failed WGProduction jobs
        • Buggy xrootd client version used (5.3), can not be helped
          • This version is unable to execute any vector read request if it has more than two chunks in it
            • Therefore such jobs will not work at any site with Xrootd storage
      • Failed uploads from HLTFarm
        • Expected, due to lack of network connectivity
      • LHCb certification pilots may overload our CEs
        • Feel free to block the certificaiton DN if it is too annoying.
    • 13:50 13:55
      ALICE Operations Report 5m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      SE tests for antares preprod were failing since the addition of the new EOS node. Probably needs to be resolved before the addigion of the nodes to antares prod.

    • 13:55 14:00
      LSST Operations Report 5m
      Speakers: Mathew Sims, Timothy Noble (Science and Technology Facilities Council STFC (GB))
      • Noticed some slow responses from ARC-CEs
        • Could be due to jobs from a VO failing and being resubmitted
      • Data movement still in progress,
        • about 50% done
      • New requested location has alternative voms requirements (as requested) and jobs are now not running after the movement of this pointer file
        • Working with CM team to update their voms roles to match what was requested by data security team
    • 14:00 14:01
      Tier-1 Projects 1m
    • 14:15 14:25
      Anatares Upgrade 10m

      New EOS nodes
      Repack Progress

      Speakers: George Patargias, Thomas Byrne
    • 14:25 14:35
      XRootD Development 10m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Jyothish Thomas (STFC)
    • 14:35 14:45
      Utilizing GPUs 10m
      Speakers: Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Thomas Birkett
    • 14:45 14:46
      AOB 1m
    • 14:46 14:55
      Summary of Operational Status and Issues 9m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 14:55 15:00
      Any other Business 5m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore