RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 14:00 14:01
      Major Incidents Changes 1m
    • 14:05 14:06
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore, Kieran Howlett (STFC RAL)
    • 14:10 14:11
      Experiment Operational Issues 1m
    • 14:15 14:16
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      AAA machines somehow lost ability to authenticate with certificates. It is unclear why this happened. Jyothish fixed it today. Certificate SAM tests failed for a couple of days, while token tests remained green.

      Problems with Echo gateways since last Wednesday and over the weekend. Particular gateways were seen to be failing and timing out. These gateways were removed to mitigate the problem. SAM tests had a red overall status on Wednesday, Friday and Saturday. 

      CMS went into production drain...however we kept our slots due to 'Tier 0' jobs that are (still) not respecting the site status. In this case everything was fine - performance of the jobs was excellent. 

      I believe the IPv6 inaccessibility problem with AAA was fixed by DI. This also affected other machines not using LHCONE or LHCOPN. 

      Seeing a 'glitch' most days in SAM tests, affecting CE tests and sometimes others as well. Possible network disconnection? Error is:

      "Job completed but failed to get job output"

    • 14:20 14:21
      VO-Liaison ATLAS 1m
      Speakers: Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 14:25 14:26
      VO Liaison LHCb 1m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Tickets:

      • Failed uploads to ECHO (GGUS 167781)
        • Gateways were unstable throughout the weekend
        • Some files are lost as a result
          • 219 files identified during consistency check, most of them had only 1 replica at RAL, so lost for good
        • Some files are corrupted
          • Need to retrieve checksums of all lhcb files on ECHO
            • Is there anything to consider from operational perspective before doing this?
      • Failed downloads/direct access requests from ECHO (GGUS 167617)
        • New restart script was deployed to preprod farm last week
        • Sometimes jobs are still failing with "Cannot allocate memory" error
          • Makes sense, since turning on pgRead does not affect direct access requests
          • Github issue is to be opened


      Operational issues;

      • Xrootd bug follow-up?
    • 14:30 14:33
      VO Liaison LSST 3m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 14:35 14:36
      VO Liaison APEL 1m
      Speaker: Thomas Dack
    • 14:39 14:40
      VO Liaison Others 1m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Katy Ellis (Science and Technology Facilities Council STFC (GB))

      ALICE:

      • ETF tests for ALICE were failing because of the ARC CE configuration issue (incorrect mapping)
        • Fixed last week

      SNO+:

      • (Katy) Messaged our contacts to inform them they are streaming through our FTS due to protocol mismatch. This is causing the RAL FTS disk to fill up. 
    • 14:45 14:46
      AOB 1m
    • 14:50 14:51
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore