RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

    • 13:38 13:39
      Major Incidents Changes 1m
    • 13:39 13:40
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 13:40 13:41
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 13:41 13:42
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 13:42 13:43
      Experiment Operational Issues 1m
    • 13:44 13:45
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))

      ATLAS back to 98% fairshare (from Vande).

       

    • 13:46 13:47
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Echo downtime yesterday seemed to go fine. SAM tests and Phedex transfers are green. I speculate it was for 2 reasons that CMS never went below the normal number of cores on the batch farm, and in fact picked up more than the normal number when they were released by e.g. ATLAS:

      1. Some CMS jobs (pilots) last a really long time, and the downtime was only ~6 hours. The drain was started 24 hours in advance Some input data may have been accessed via AAA where necessary.

      2. It appeared that only 'Production' jobs were put into drain. RAL picked up a lot of User Analysis jobs, many of which failed, but by no means all. 

      More failures than normal were seen in the category FileOpen. In normal running, we do see some of these, but usually it's far more in the FileRead category. I don't believe there was any problem staging out data from completed jobs - none of these errors appeared. I didn't check explicitly to see if this happened but the design is to stage out to RALPP if the local storage is unavailable. 

      Job efficiency actually reacted positively at times! I believe this was due to all the Analysis jobs running, which typically have lower I/O requirements.

      I am making a document to provide evidence of all of this.

       

      Other stuff: IPv4 slowness being investigated by DI. SAM tests for AAA have been failing intermittently - possibly network related. 

    • 13:48 13:49
      VO Liaison LHCb 1m
      Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))

      LHCb

      1. Normal operations resumed after yesterday's downtime automatically
      2. Streaming issue from ECHO
        • Waiting for fix to xrootd vector reads
        • Some fine tuning done of proxy configuration in production system from our better understanding now
          • Has helped error rates in Atlas I understand
          • No improvement seen in LHCb jobs
        • Next : working on understanding remaining failures and by generation
          • For example "file name too long" error which has been seen

      DUNE

      1. Moved the ETF tests to using CRIC
      2. Still low level of running jobs on the grid
    • 13:52 13:53
      VO Liaison Others 1m
    • 13:53 13:54
      Experiment Planning 1m
    • 13:54 13:55
      Dune/protoDune 1m
    • 13:55 13:56
      Euclid 1m
    • 13:56 13:57
      SKA 1m
    • 13:57 13:58
      AOB 1m
    • 13:58 13:59
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))