RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:00 13:01
      Major Incidents Changes 1m
    • 13:01 13:02
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
    • 13:02 13:03
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 13:04 13:05
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 13:05 13:06
      Experiment Operational Issues 1m
    • 13:15 13:16
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      Katy planning to remove gsiftp tests from contribution to SAM status. 

       

      CMS saw spikes in failures on writes to Echo on the 18th and 23rd during the DNS issues. Also SAM status failed on those days due to storage tests (gsiftp and webdav).

       

      Large numbers of (Processing type) jobs failing, but this is reflected at other sites.

    • 13:16 13:17
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)

      GGUS: 160156 

      The last DNS outage 'reset' most transfers. As of this morning, 33k submitted transfers to write into Echo, + 12k files being recalled from Antares (via Echo). 

      Very few failures (O(100)) in the last 24hrs where the source file had been evicted prior to transfer; We can (hopefully) resolve the ticket this afternoon if no further issues arise. 

       

      DNS Failed name resolution from external hosts:

                                 Last Wednesday, AM, and Monday evening.  ~ 220k transfers failed (not started). 

                                 GOCDB was also affected (any other ancillary services?).

       

       

    • 13:20 13:21
      VO Liaison LHCb 1m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
      1. Reading part of the tape challenge finished last week. Results look promising -- expected throughput was 1.93GB/s, we achieved ~ 1GB/s more than this.
      2. There was a major LHCb Dirac update on Monday, which introduced some issues. Recovered within several hours. There were lot of failed jobs due to this.
      3. Low number of running LHCb jobs due to insufficient number of production requests.
      4. Consistency check identified some dark and lost data. Dark data was removed, lost files were re-replicated by (all data operations were done by the LHCb Computing team).

       

      Tickets:

      1. Slow checksums (stats):
        • Still waiting
      2. Deletion problems
        • Solved
      3. Problems with simultaneous access to the same file on ECHO
        • On hold, tests are ongoing at Glasgow
      4. Vector read.
        • One more test: what happens with the LHCb applications is vector read requests returns "wrong" (i.e. not the one that was requested) data. This was tested (the same patch, but once in a 1000 vector reads it shifts one of the requested chunks by 1 byte), and it seems like the application crashes.
        • Dedicated patched WN for production LHCb jobs is being prepared.
    • 13:25 13:28
      VO Liaison LSST 3m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 13:30 13:31
      VO Liaison Others 1m
    • 13:31 13:32
      AOB 1m
    • 13:32 13:33
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))