RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:45
      VO-Liaison ATLAS 10m
      Speakers: Brij Kishor Jashal (TIFR, RAL, IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:45 13:55
      VO Liaison CMS 10m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      On Monday AAA machines all went from completely green SAM tests to constantly red. It was the 'federation' test failing, the error showing it was unable to connect to the global redirector. ARC-CE tests also failed constantly in the same period due to failures of the associated 'xrootd-access' test, which uses the AAA machines. This was on top of the intermittently failing ARC-CE SAM tests at submission, which is ongoing for some months but became worse over Christmas / New Year / January.

      Tom Birkett has been following up a suspected network/firewall problem with DI. There was suspicion the intermittent ARC-CE test failures could be caused by this, along with many other observed problems, such as variable number of CMS jobs running despite work being available, lack of ATLAS jobs running, general slowness in Tier 1 machines, etc., etc. 

      On Tuesday morning around 10:30 DI made a change by removing one port from a network component. After this, many or all of the above problems seem miraculously fixed/improved immediately! 

      AAA tests went green; ARC-CE xrootd-access test went green; intermittent submission failures looking much much better. 

      UPDATE, Wed morning: AAA tests went red again last night. Jyothish did some clean up and restarts and tests are going green again. 

      Note, where CMS jobs have run, in general performance has been good, except Monday night into Tuesday there was a spike of Production job failures, attempting to read remotely and getting a FileOpen error. 

      AAA OOM errors when under high load still to be followed up. Also the problem with svc20 continuously dropping its monitoring in Vande. 

      CMS / CERN IT jumbo frames testing ongoing all week.

    • 13:55 14:05
      VO Liaison LHCb 10m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      / * Sorry, i am on leave 15.01, so the below data may be outdated; valid on the evening of 14.01*/

      News:

      • Data reprocessing camaign hopefully ends mid February (except for CNAF, which had many problems, though that is not relevant for us)
      • Therefore UK DC at the end of February/beginning of March should be fine

       

      Operational issues:

      • nproc limit issue is fixed
        • Pilot restores the original limit after gfal context creation
      • Lots of failed WGProduction (direct access) jobs on Tuesday morning (see job plots)
        • Jobs used xrootd-5.3.1 for streaming, this version has a bug that causes all vector reads with more than one chunk in request to fail (see this ticket for details)
        • So, not our fault; LHCb is informed and will update the application linkage to a newer xrootd version

       

    • 14:10 14:20
      VO Liaison ALICE 10m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
    • 14:20 14:30
      VO Liaison LSST 10m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))

      Low activity over Christmas

      Networking between Butler DB and BatchFarm is an issue, and as changes are being made to the network soon, moved LSST to 2020 BF nodes on the new network last night so that they may have access to the Butler

      After draining old jobs this morning, and disabling job types that were not effected, this has meant only the "DC2" jobs remain, but are currently long running and none have finished at this time (despite running for nearly 2 hours) due to remaining on older nodes, rather than the new ones specified

    • 14:30 14:40
      VO Liaison APEL 10m
      Speaker: Thomas Dack
    • 14:45 14:55
      WP-D - GPU, Data Management, Other 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 15:00 15:01
      Major Incidents Changes 1m
    • 15:05 15:15
      Summary of Operational Status and Issues 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 15:20 15:21
      AOB 1m
    • 15:22 15:32
      Any other Business 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore