RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Description

Please attend via the following Zoom meeting:

https://ukri.zoom.us/j/98562731547?pwd=UU9Wb2xCL05tWmROT1h6SUlWdUJ3dz09

 

    • 13:38 13:39
      Major Incidents Changes 1m
    • 13:39 13:40
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 13:40 13:41
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 13:41 13:42
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 13:42 13:43
      Experiment Operational Issues 1m
    • 13:44 13:45
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))

      ATLAS needs to run more single-core analysis jobs
      - https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=397775 

      - Will be direct IO; need for vectored reads

      Did notice that 100% on Vande no longer corresponds to 100% *11.7/10 on Atlas monitoring (accounting for corepower difference).  Obscured by current changes
      - Some recent change to batch workers ?
      - Some change to absolute Fairshare values ?

      Echo Read access for Oxford ATLAS XCache
      - https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=397191

      TPC-http
       - Bespoke checksum script on Test Gateway to return checksum
       - Return of the '//' macaroon path normalisation issue.

       

    • 13:46 13:47
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      CMS is running 'at pledge' due to being limited for LHCb to be fixed and they are now running 200% of their pledge. Most CMS-only nodes are empty. 

      SAM tests looking much better this week. No change or fix was applied. However, I see a large number of job failures and very low efficiency. The failures are mostly FileOpen or FileRead. I have an example 'step chain' job to try - i.e. a multi-step job. I want to try this on one of the empty CMS-only nodes, hopefully this week. 

      After talking to Chris Brew, we think there is a problem with the /etc/hosts file for the CMS docker config. He says you can't do this with the same IP address:

      172.28.1.1 xrootd.echo.stfc.ac.uk

      172.28.1.1 ceph-gw10.gridpp.rl.ac.uk

      172.28.1.1 ceph-gw11.gridpp.rl.ac.uk

      He said I should ask for a change to: 

      172.28.1.1 xrootd.echo.stfc.ac.uk ceph-gw10.gridpp.rl.ac.uk ceph-gw11.gridpp.rl.ac.uk

    • 13:48 13:49
      VO Liaison LHCb 1m
      Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))

      LHCb

      1. Low number of running jobs
        • https://ggus.eu/?mode=ticket_info&ticket_id=150679
        • Seems fixed after limits put on CMS, ATLAS
        • Not permanent solution, but this seems to have allowed LHCb jobs to be picked up by batch system (???)
      2. ECHO streaming issue
        • Waiting for release of fix to vector reads
        • Timescale?
      3. Trying to understand discrepancy between storage used reported by RAL vs DIRAC
        • Currently 20% discrepancy - big since 2019 (LHCb move to ECHO)
        • Date       : DIRAC vs RAL (Grafana)
        • 31/12/2020: 5.61 vs 6.46PB
          31/12/2019: 5.62 vs 6.42PB 
          31/12/2018: 4.55 vs 4.54PB 
          08/02/2018: 4.13 vs 4.10PB
          31/12/2016: 3.61 vs 3.09PB 
          18/08/2016: 3.17 vs 3.22PB
          31/12/2015: 3.13 vs 3.12PB
          25/08/2015: 2.30 vs 2.34PB
          18/01/2015: 2.23 vs 2.28PB 

      DUNE

      1. Normal operations
      2. Testing dynafed access to RAL storage to transfer data between RAL and Fermilab
        • Is dynafed supported?
        • Or other protocols supporting http(s)?
    • 13:52 13:53
      VO Liaison Others 1m
    • 13:53 13:54
      Experiment Planning 1m
    • 13:54 13:55
      Dune/protoDune 1m
    • 13:55 13:56
      Euclid 1m
    • 13:56 13:57
      SKA 1m
    • 13:57 13:58
      AOB 1m
    • 13:58 13:59
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))