RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Description

Please attend via the following Zoom meeting:

https://ukri.zoom.us/j/98562731547?pwd=UU9Wb2xCL05tWmROT1h6SUlWdUJ3dz09

 

    • 12:38
      Major Incidents Changes
    • 1
      Summary of Operational Status and Issues
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 2
      GGUS /RT Tickets

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 3
      Site Availability

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 12:42
      Experiment Operational Issues
    • 4
      VO-Liaison ATLAS
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Dr Tim Adye (Science and Technology Facilities Council STFC (GB))

      O(3k) job failures overnight from 9 problematic WNs ( no running xrootd containers, but still accepting jobs).

      Interest expressed (ATLAS and NA62) on expected CTA timeline.

      TPC http pull bug fix available; but not in time for 5.1.0
       - Running through normal tests now. 

    • 5
      VO Liaison CMS
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      AAA-related SAM tests were completely broken for a couple of days. This turned out to be because when the manager was rebooted it came back without an IPv6 address. I will go back to watching for intermittent failures related to Echo access (including via AAA). I will try running SAM tests by hand. Ian J also asked me for ways to test the Vector Read, which I feel may be related to these SAM test failures (and likely many other failures and inefficiencies in CMS jobs at RAL). I'm hoping we can try this in a test machine in the next 1-2 weeks.

      I also need to continue following up the multiple repeat queries appearing in the redirector logs (which are probably causing a problem, but are definitely filling up the machine with logs within ~20 days).

       

      The IPv6 side of the firewall change was done yesterday. I did a quick test of transfers from Nebraska and Florida before and after the change. The rate after the change was better.

      270MB file from Florida - before 42s, after 12 s.

      4GB file from Nebraska - before 201s, after 141s.

      The proof of the pudding will be the IPv4 change (Monday 8th March).

       

      With Darren I have moved the AAA Vande dashboard into the new area. I am in touch with Christos and planning to add a new plot here to monitor requests to the redirector machine based at RAL. This will be an incomplete picture, but better than nothing.

    • 6
      VO Liaison LHCb
      Speaker: Raja Nandakumar (Science and Technology Facilities Council STFC (GB))

      LHCb

      1. Streaming from ECHO issue
        • https://ggus.eu/?mode=ticket_info&ticket_id=142350
        • Waiting for information on development of fix to vector reads
        • Other tests of alleviation measures
          • Issues with job statuses in DIRAC - being investigated
      2. Low number of running jobs
        • https://ggus.eu/?mode=ticket_info&ticket_id=150679
        • Under investigation
        • Number of running jobs went down from 9.5K to ~3K over last few weeks
        • No known obvious reason
      3. FTS library missing
        • https://ggus.eu/?mode=ticket_info&ticket_id=150653
        • RAL FTS does not support macaroons
        • RAL FTS to be upgraded at some point

      DUNE

      • Nothing special operationally to report
      • From meeting on Monday
        • Confirmed that jobs that run in the UK (not too many so far) do read input data from UK storage if available
        • Planning for changing data composition in UK storages following this.
    • 7
      VO Liaison Others
    • 12:53
      Experiment Planning
    • 8
      Dune/protoDune
    • 9
      Euclid
    • 10
      SKA
    • 12:57
      AOB
    • 11
      Any other Business
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))