RAL Tier1 Experiments Liaison Meeting

Name: RAL Tier1 Experiments Liaison Meeting
Start: 2023-04-26T12:30:00+01:00
End: 2023-04-26T14:30:00+01:00
Location: RAL R89

Wednesday 26 Apr 2023, 12:30 → 14:30 Europe/London

Access Grid (RAL R89)

Access Grid

RAL R89

66811541532

Alastair Dewhurst

Join via phone

- 13:00 → 13:01
  
  Major Incidents Changes 1m
- 13:01 → 13:02
  
  Summary of Operational Status and Issues 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB)), Kieran Howlett (STFC RAL)
  
  Weekly Report 26 April 2023.docx
  
  Weekly Report 26 April 2023.pdf
- 13:02 → 13:03
  
  GGUS /RT Tickets 1m
  
  https://tinyurl.com/T1-GGUS-Open
  https://tinyurl.com/T1-GGUS-Closed
- 13:04 → 13:05
  
  Site Availability 1m
  
  https://lcgwww.gridpp.rl.ac.uk/utils/availchart/
  
  https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
  
  http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=RAL-LCG2&startTime=2020-01-29&endTime=2020-02-06&templateType=isGolden
- 13:05 → 13:06
  
  Experiment Operational Issues 1m
- 13:15 → 13:16
  
  VO Liaison CMS 1m
  
  Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
  
  WNs_on_LHCONE_26Apr23.pdf
  
  WNs_on_LHCONE_26Apr23.pptx
  
  A lot of problems on the CMS side this week, following the expiry of top-level certificates at CERN on Saturday.
  
  Rucio did not have a valid CMS certificate and did not complete a successful transfer between Saturday lunchtime and Monday/Tuesday midnight.
  
  RAL Tier 1 went into drain for production work on Sunday. SAM tests were green, and I could not see a good reason for this, so I forced us back into production (other T1s were running jobs at normal capacity). Failure rate good; efficiency variable.
  
  HammerCloud tests stopped (across the whole of CMS grid) Saturday - Tuesday.
  
  On Friday I saw a high failure rate of Analysis jobs again (as reported a couple of times now). These were (again) reading mostly across the trans-Atlantic link. I saw they were all one person who was running large numbers of CRAB jobs across the grid using the ‘IgnoreLocality’ option which does not match the job with the location of the input data. So I wrote a polite email and he has now killed those jobs and will hopefully allow the system to assign jobs to a better location.
  
  Failure rates on particular WNs at RAL T1 batch farm: The liaisons have been investigating significantly high failure rates split by worker node. My analysis showed a surprising number of the most recent (2022) nodes having ‘significant’ failures (my definition being >10% failures, where 4% failure is the average across the farm, during the period 10-20^th April).
  
  AAA at RAL T1: There was an Echo downtime for reboots on Thursday; the RAL based redirector was also rebooted. After the downtime the AAA manager continued to fail SAM tests, so we got a red status for the day due to that. I fixed it by the end of the working day.
  
  Had the 2023 data tape families created. Made an adjustment to the requested families based on the 2022 experience.
  
  Tape deletions were happening yesterday, 1.5PB. Appears to be finished. ~2600 files 'not found' which is usually fine.
  
  Looking at monitoring job read rates:
  https://monit-grafana.cern.ch/d/BZfBLpE4k/user-kellis-average-data-input-over-read-time?orgId=11&from=now-4y&to=now&viewPanel=3&editPanel=3
  You can see RAL is very low in the first years, but is closer to other sites in recent months (still not amazing, but perhaps that is not surprising given we are further from e.g. CERN than IN2P3, CNAF…?)
- 13:16 → 13:17
  VO-Liaison ATLAS 1m
  
  Speakers: James William Walder (Science and Technology Facilities Council STFC (GB)), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
  Over the weekend; issue related to the exipiration of CERN CA intermediate cerficate
  Variety of different problems surfaced.
  Manual interventions in ATAS to bring most services back (and permanent fixes done this week)
  
  Observed number of job failures (piloterrorcode 1368) with timeouts at point of setting up ATLAS software via CVMFS:
  Errors appeared to stop early am on 25th (?)
  
  Today's issue:
  Harvester configuration issue, using gridftp to submit jobs via HTcondor 10 to ARC-CE's; which can't work, resulting in pilot faults:
  Now fixed and hopefully the source of reduced jobs running at RAL
  
  FTS: currently have a backlog of 154k transfers from RAL. Likely related to cern cert issues, and a number of CERN FTS hosts that stopped working.
- 13:20 → 13:21
  VO Liaison LHCb 1m
  
  Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
  
  pres.pdf
  Tickets:
  Transfer failures to IHEP
  IHEP is involved in the ticket and admitted problems on their side
  Site is suspended in GGUS, so ticket can not be opened against them
  Env variable removal request
  LHCb confirmed that XrdSecGSIDELEGPROXY is set in their environment
  waiting for them to proceed with its deletion
  Vector read
  See slides
  Operational issues:
  Failed FTS transfers from IN2P3 to antares
  Corresponds to the outage of the French NREN
  is antares on LHCONE?
- 13:25 → 13:28
  
  VO Liaison LSST 3m
  
  Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- 13:30 → 13:31
  
  VO Liaison Others 1m
- 13:31 → 13:32
  
  AOB 1m
- 13:32 → 13:33
  
  Any other Business 1m
  
  Speakers: Brian Davies (Lancaster University (GB)), Darren Moore (Science and Technology Facilities Council STFC (GB))

Choose timezone

RAL Tier1 Experiments Liaison Meeting

Access Grid

RAL R89