RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Videoconference
RAL Tier1 Experiments Liaison Meeting
Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:00 13:01
      Major Incidents Changes 1m
    • 13:01 13:02
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 13:02 13:03
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 13:04 13:05
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 13:05 13:06
      Experiment Operational Issues 1m
    • 13:15 13:16
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Images show new tables I have made in kibana/openSearch to show number of failures in last days per worker node. I added one image for a period of 15 days and another for 16 days because there seem to have been a huge number of failures on one WN 16 days ago. Looking at just the last 15 days, there is no particular problem with any one WN. A few WNs show more than 20% errror. A few show more than 50% error, but these are running relatively few jobs - possibly all SAM test jobs running on ML cores rather than multicore.

       

      Another occurance of the DNS issue this morning (third apparent appearance in 2 weeks). However, today this could be attributed to some work being done by DI, e.g  https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=477872

       

      SAM tests are failing due to the above, but have been going green in the last hours. Likewise prod transfer efficiency is coming back. 

       

      Farm is low on capacity due to WN firewall updates. 

       

      Failure rate and efficiency of jobs is good in the last week.

       

       

    • 13:16 13:17
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
      • Atlas-wide rucio issues on monday affecting job stage-in, stage-out. Beleived due to overloading of servers with presigning of URL on non-standard storage.
      • DNS failed name resolution on Weds am ~ 30
      •  
    • 13:20 13:21
      VO Liaison LHCb 1m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Very low number of running jobs, due to lack of production requests from LHCb.

      Vector read patch production testing has started. The patch does not resolve the problem.

    • 13:25 13:28
      VO Liaison LSST 3m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 13:30 13:31
      VO Liaison Others 1m
    • 13:31 13:32
      AOB 1m
    • 13:32 13:33
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))