RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Videoconference
RAL Tier1 Experiments Liaison Meeting
Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 14:00 14:01
      Major Incidents Changes 1m
    • 14:05 14:06
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore, Kieran Howlett (STFC RAL)
    • 14:10 14:11
      Experiment Operational Issues 1m
    • 14:15 14:16
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      It would be good to get the token tests green for CMS. These SAM tests have been running for a while, and although they do not affect the 'site status' yet, it would be good to start working on this now that the Tape REST API is in place. You can see the tests here: https://cmssst.web.cern.ch/siteStatus/detail.html?site=T1_UK_RAL

      The 'federation' SAM test for AAA machines has been failing since early Tuesday morning. This is a CMS-wide problem, affecting many European sites. It is being investigated on the CMS side.

      Also for AAA, the EU collector has been turned off and we are warned that if any machine with this config attempts a restart then it will fail. The collector was turned off due to the owner not wanting to update the OS. Shoveler will replace it in time. The RAL-based AAA proxies already had this monitoring commented-out. Jyothish and I removed the monitoring from the various redirectors under our control - committed in Aquilon. 

      Katy is attempting to further test Shoveler and validate on behalf of CMS. Alessandra will also do some work on this for ATLAS but perhaps later in the year. Jyothish already had a new VM which will be the 'production' Shovler instance but it hasn't been sending any monitoring information. We suspect the firewall is not open to this VM and made a ticket to DI requesting this. Hopefully then we will immediately see data in the plots. 

      We also need Shoveler to run on the WN gateways. This would add a line to the Xrootd config on each WN gateway. Katy to test on the CMS test WN and report back. When it is working request a roll-out on the batch farm.

      The new AAA proxy machine (svc20) is now being monitored in Vande. It seems to show the same number of xrootd connections as the other machines but the throughput is higher. My assumption is this is expected due to it being a newer, better machine. Jyothish confirmed that the number of xrootd connections being the same is expected due to the round-robin assignment of requests.

      Discussion this week at RAL in the #networking Slack channel over the IPv6 connectivity of the AAA machines. A ticket has been sent to DI.

      Job performance variable again - further issues being investigated on the CMS side for particular campaigns with very low efficiency.

      CMS job submission to use EL9 queue only - this seems to be mostly working. CMS think only EL9 is being used. However at RAL we see a few jobs are still EL7 - Jose pointed out they are coming from one particular CMS machine. Katy has requested more information about this.

      Transfer failures to Antares from both CERN and Echo since last Tuesday evening were caused by an upgrade issue and the xrootd version not being 'pinned'. The Antares team fixed this. CMS has not had a lot of tape activity this week so was not affected by the issue over garbage collection. Katy is still following up the handful of production transfers that have been failing for several weeks - CMS DM has a ticket and we are currently a bit confused about Rucio's behaviour.

    • 14:20 14:21
      VO-Liaison ATLAS 1m
      Speakers: Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 14:25 14:26
      VO Liaison LHCb 1m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Operational issues:

      • Glasgow file access issue is affecting RAL as well
        • Bug in xrootd client, so we can not do much
        • LHCb is working on a mitigation.
      • ETF tests are still failing (EL7 queue is not decommissioned)
        • Warnings are due to lack of openssl
        • Critical errors are probably due to ARC bug ("Job completed but failed to get job output")
      • A lot of LHCb jobs killed recently, due to memory excess
        • Buggy production from LHCb
        • Work in progress on pilot limit enforcing
      • Xrootd Bug follow-up?
    • 14:30 14:33
      VO Liaison LSST 3m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 14:35 14:36
      VO Liaison APEL 1m
      Speaker: Thomas Dack
    • 14:39 14:40
      VO Liaison Others 1m
      Speakers: Alexander Rogovskiy (Rutherford Appleton Laboratory), Brij Kishor Jashal (RAL, TIFR and IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory), Katy Ellis (Science and Technology Facilities Council STFC (GB))
    • 14:45 14:46
      AOB 1m
    • 14:50 14:51
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore