RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Description

Please attend via the following Zoom meeting:

https://ukri.zoom.us/j/98562731547?pwd=UU9Wb2xCL05tWmROT1h6SUlWdUJ3dz09

 

    • 13:38 13:39
      Major Incidents Changes 1m
    • 13:39 13:40
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 13:40 13:41
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 13:41 13:42
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 13:42 13:43
      Experiment Operational Issues 1m
    • 13:44 13:45
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))

      Updated Echo allocations for FY 21/22
       - https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=406939 
       - Looks like no issues; will update ATLAS space json  today. 

       

           GGUS-Ticket-ID: #151098 "IN PROGRESS" "NGI_UK" "High failure rate at RAL-LCG2_TEST"
      * Possible that interaction between Docker and pilot causes some unexpected termination of docker. 
      i.e after Job 1, pilot tries to remove any orphaned processes with kill signal. 
      might be killing 'something' that terminates docker job (HTCondor receives a ExitReason = “died on signal 9 (Killed)”)
       - If confirmed , ... ?

    • 13:46 13:47
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Still running 3k cores. Hoping to increase that number this week. Job failures are ok and efficiency is 40-60% this week. 

      SAM tests look better...fewer 'missing' tests. ARC-CE01 had no test results for 24 hours after a reboot of that machine, but a second reboot seems to have fixed that.

      Talked to James A during the meeting and he agreed to increase the number of CMS jobs running on the newest software (Dell19 tranche). This may have been reduced in the recent past (since end-Feb) due to single-core jobs taking over, and CMS jobs only run multicore.

    • 13:48 13:49
      VO Liaison LHCb 1m
      Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))
    • 13:52 13:53
      VO Liaison Others 1m
    • 13:53 13:54
      Experiment Planning 1m
    • 13:54 13:55
      Dune/protoDune 1m
    • 13:55 13:56
      Euclid 1m
    • 13:56 13:57
      SKA 1m
    • 13:57 13:58
      AOB 1m
    • 13:58 13:59
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))