RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Videoconference
RAL Tier1 Experiments Liaison Meeting
Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:00 13:01
      Major Incidents Changes 1m
    • 13:01 13:02
      Summary of Operational Status and Issues 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB)), Kieran Howlett (STFC RAL)
    • 13:02 13:03
      GGUS /RT Tickets 1m

      https://tinyurl.com/T1-GGUS-Open
      https://tinyurl.com/T1-GGUS-Closed

    • 13:04 13:05
      Site Availability 1m

      https://lcgwww.gridpp.rl.ac.uk/utils/availchart/

      https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL

      http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden

    • 13:05 13:06
      Experiment Operational Issues 1m
    • 13:15 13:16
      VO Liaison CMS 1m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      A lot of problems on the CMS side this week, following the expiry of top-level certificates at CERN on Saturday. 

       

      Rucio did not have a valid CMS certificate and did not complete a successful transfer between Saturday lunchtime and Monday/Tuesday midnight.

       

      RAL Tier 1 went into drain for production work on Sunday. SAM tests were green, and I could not see a good reason for this, so I forced us back into production (other T1s were running jobs at normal capacity). Failure rate good; efficiency variable.

       

      HammerCloud tests stopped (across the whole of CMS grid) Saturday - Tuesday.

       

      On Friday I saw a high failure rate of Analysis jobs again (as reported a couple of times now). These were (again) reading mostly across the trans-Atlantic link. I saw they were all one person who was running large numbers of CRAB jobs across the grid using the ‘IgnoreLocality’ option which does not match the job with the location of the input data. So I wrote a polite email and he has now killed those jobs and will hopefully allow the system to assign jobs to a better location. 

       

      Failure rates on particular WNs at RAL T1 batch farm: The liaisons have been investigating significantly high failure rates split by worker node. My analysis showed a surprising number of the most recent (2022) nodes having ‘significant’ failures (my definition being >10% failures, where 4% failure is the average across the farm, during the period 10-20th April). 

       

      AAA at RAL T1: There was an Echo downtime for reboots on Thursday; the RAL based redirector was also rebooted. After the downtime the AAA manager continued to fail SAM tests, so we got a red status for the day due to that. I fixed it by the end of the working day. 

       

      Had the 2023 data tape families created. Made an adjustment to the requested families based on the 2022 experience.

       

      Tape deletions were happening yesterday, 1.5PB. Appears to be finished. ~2600 files 'not found' which is usually fine. 

       

      Looking at monitoring job read rates:
      https://monit-grafana.cern.ch/d/BZfBLpE4k/user-kellis-average-data-input-over-read-time?orgId=11&from=now-4y&to=now&viewPanel=3&editPanel=3
      You can see RAL is very low in the first years, but is closer to other sites in recent months (still not amazing, but perhaps that is not surprising given we are further from e.g. CERN than IN2P3, CNAF…?)

    • 13:16 13:17
      VO-Liaison ATLAS 1m
      Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
      • Over the weekend; issue related to the exipiration of CERN CA intermediate cerficate
        • Variety of different problems surfaced. 
        • Manual interventions in ATAS to bring most services back (and permanent fixes done this week)

       

      • Observed number of job failures (piloterrorcode 1368) with timeouts at point of setting up ATLAS software via CVMFS:
        • Errors appeared to stop early am on 25th (?) 

       

      • Today's issue:
        • Harvester configuration issue, using gridftp to submit jobs via HTcondor 10 to ARC-CE's; which can't work, resulting in pilot faults:
          • Now fixed and hopefully the source of reduced jobs running at RAL 

       

      FTS: currently have a backlog of 154k transfers from RAL. Likely related to cern cert issues, and a number of CERN FTS hosts that stopped working. 

       

       

    • 13:20 13:21
      VO Liaison LHCb 1m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)

      Tickets:

      • Transfer failures to IHEP
        • IHEP is involved in the ticket and admitted problems on their side
        • Site is suspended in GGUS, so ticket can not be opened against them
      • Env variable removal request
        • LHCb confirmed that XrdSecGSIDELEGPROXY is set in their environment
        • waiting for them to proceed with its deletion
      • Vector read
        • See slides

      Operational issues:

      • Failed FTS transfers from IN2P3 to antares
        • Corresponds to the outage of the French NREN
        • is antares on LHCONE?
    • 13:25 13:28
      VO Liaison LSST 3m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
    • 13:30 13:31
      VO Liaison Others 1m
    • 13:31 13:32
      AOB 1m
    • 13:32 13:33
      Any other Business 1m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))