US ATLAS Computing Facility Capacity Spreadsheet: https://bit.ly/usatlas-capacity
Through March 2020 (FY20Q2):
Updates on US Tier-2 centers
For all the sites that see small percentage of jobs fail with timeouts on input/output:
we are investigating interaction between rucio mover, gfal2 and xrootd. In a number of cases actual transfer was not even attempted and the reason seems to be the way rucio mover tries to stat file and get checksum. Hopefully fix will come soon, once ready we will try to get it expressly tested and deployed. This does not exclude possibility there are other issues lurking there.
It was an OK week for production.
21th April, one of our new R740x2d dcache server died, the daughterboard was burnt, we got it replaced within 48 hours with dell sending an onsite technician. Before that, we submitted a JIRA ticke to declare the unavailability of the files.
We still see jobs get killed due to OOM, 200 jobs/2 weeks. This mostly happens to work nodes with less than 2GB/core, we are in the process of 1) adding more memory to work nodes with retired parts 2) disable HT for work nodes witout spare DIMM parts.
We see 60% of the cluster is being used by the analysis jobs, this might be caused by our recent reconfigurtion of condor and gatekeeper in order to balance giving enough cores to covid-19 jobs and having less fragementation in condor cores. Too many analysis jobs seem to increase the failure rate of jobs in the site.
Condor is updated to 8.8.8
Retired 20TB usable space from dCache to get spare parts to cover the storage enclosures not under warranty anymore.
- Not much, all running well
- Upgraded xrootd to 4.11.3, which fixed space reporting and logging, and were able to delete some old data from OU_OSCER_ATLAS_LOCALGROUPDISK
Slowly ramping up with XCaches and VP.
AGLT2 - replaced their node with the new one with more storage. Change them to direct access.
Prague - running smoothly. Will upgrade further this or next week.
LRZ - issue with the clean up, managed to cross HWM.
ROOT TChain bug discovered and fixed. Waiting for the LCG build to get it in production.