RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
13:00
→
13:01
Major Incidents Changes 1m
-
13:01
→
13:02
Summary of Operational Status and Issues 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB)), Kieran Howlett (STFC RAL)
-
13:02
→
13:03
GGUS /RT Tickets 1m
https://tinyurl.com/T1-GGUS-Open
https://tinyurl.com/T1-GGUS-Closed -
13:04
→
13:05
Site Availability 1m
https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
-
13:05
→
13:06
Experiment Operational Issues 1m
-
13:15
→
13:16
VO Liaison CMS 1mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
A lot of problems on the CMS side this week, following the expiry of top-level certificates at CERN on Saturday.
Rucio did not have a valid CMS certificate and did not complete a successful transfer between Saturday lunchtime and Monday/Tuesday midnight.
RAL Tier 1 went into drain for production work on Sunday. SAM tests were green, and I could not see a good reason for this, so I forced us back into production (other T1s were running jobs at normal capacity). Failure rate good; efficiency variable.
HammerCloud tests stopped (across the whole of CMS grid) Saturday - Tuesday.
On Friday I saw a high failure rate of Analysis jobs again (as reported a couple of times now). These were (again) reading mostly across the trans-Atlantic link. I saw they were all one person who was running large numbers of CRAB jobs across the grid using the ‘IgnoreLocality’ option which does not match the job with the location of the input data. So I wrote a polite email and he has now killed those jobs and will hopefully allow the system to assign jobs to a better location.
Failure rates on particular WNs at RAL T1 batch farm: The liaisons have been investigating significantly high failure rates split by worker node. My analysis showed a surprising number of the most recent (2022) nodes having ‘significant’ failures (my definition being >10% failures, where 4% failure is the average across the farm, during the period 10-20th April).
AAA at RAL T1: There was an Echo downtime for reboots on Thursday; the RAL based redirector was also rebooted. After the downtime the AAA manager continued to fail SAM tests, so we got a red status for the day due to that. I fixed it by the end of the working day.
Had the 2023 data tape families created. Made an adjustment to the requested families based on the 2022 experience.
Tape deletions were happening yesterday, 1.5PB. Appears to be finished. ~2600 files 'not found' which is usually fine.
Looking at monitoring job read rates:
https://monit-grafana.cern.ch/d/BZfBLpE4k/user-kellis-average-data-input-over-read-time?orgId=11&from=now-4y&to=now&viewPanel=3&editPanel=3
You can see RAL is very low in the first years, but is closer to other sites in recent months (still not amazing, but perhaps that is not surprising given we are further from e.g. CERN than IN2P3, CNAF…?) -
13:16
→
13:17
VO-Liaison ATLAS 1mSpeakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
- Over the weekend; issue related to the exipiration of CERN CA intermediate cerficate
- Variety of different problems surfaced.
- Manual interventions in ATAS to bring most services back (and permanent fixes done this week)
- Observed number of job failures (piloterrorcode 1368) with timeouts at point of setting up ATLAS software via CVMFS:
- Errors appeared to stop early am on 25th (?)
- Today's issue:
- Harvester configuration issue, using gridftp to submit jobs via HTcondor 10 to ARC-CE's; which can't work, resulting in pilot faults:
- Now fixed and hopefully the source of reduced jobs running at RAL
- Harvester configuration issue, using gridftp to submit jobs via HTcondor 10 to ARC-CE's; which can't work, resulting in pilot faults:
FTS: currently have a backlog of 154k transfers from RAL. Likely related to cern cert issues, and a number of CERN FTS hosts that stopped working.
- Over the weekend; issue related to the exipiration of CERN CA intermediate cerficate
-
13:20
→
13:21
VO Liaison LHCb 1mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
Tickets:
- Transfer failures to IHEP
- IHEP is involved in the ticket and admitted problems on their side
- Site is suspended in GGUS, so ticket can not be opened against them
- Env variable removal request
- LHCb confirmed that XrdSecGSIDELEGPROXY is set in their environment
- waiting for them to proceed with its deletion
- Vector read
- See slides
Operational issues:
- Failed FTS transfers from IN2P3 to antares
- Corresponds to the outage of the French NREN
- is antares on LHCONE?
- Transfer failures to IHEP
-
13:25
→
13:28
VO Liaison LSST 3mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
-
13:30
→
13:31
VO Liaison Others 1m
-
13:31
→
13:32
AOB 1m
-
13:32
→
13:33
Any other Business 1mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))
-
13:00
→
13:01