US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
DOMA/Token Transition
Plot below shows the breakdown of token-vs-certificate transfers through current (Monit link; filtered on destination=USA, group by auth_method). During DC24, token usage peaked at >50% token-based transfers by volume.
Many thanks to all the sites for their hard work!
Currently, post-DC24 retrospectives are ongoing. Request: Can sites please send us any issues they observed with tokens during the DC24 period? We would like to sort through the issues and make sure we upstream / work on your issues.
Pacing items for this year to watch out for:
-
CERN IAM services to be migrated to a new infrastructure.
-
Mature / release the FTS version that supports tokens.
Working with WLCG to update a community timeline.
Software
-
Release
-
XRootD 5.6.8 expected within the week
-
-
Kubernetes Accounting
- How flexible is the wording of the milestone “Deploy monitoring, alerting and APEL accounting for UTA k8s cluster using Prometheus”?
-
Effort is beginning.
-
Met with others working on the same things (AUDITOR, KAPEL). There are certainly differences:
-
Existing KAPEL only uses summarized data, not per job data that GRACC expects.
-
-
But, we can certainly build off their existing code.
- ATLAS k8s access for the Software Team
- Working on access to NET2
- Need to verify access to SWT2/UTA Google k8s
- Status on UTA creds?
-
-
13:10
→
13:25
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Infrastructure and Compute Farm 5mSpeaker: Thomas Smith
-
13:15
Storage 5mSpeaker: Vincent Garonne (Brookhaven National Laboratory (US))
-
13:20
Tier1 Services 5mSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
- GGUS:165414 - Staging failures at BNL-OSG2_MCTAPE
- Due to the NET2 congestion
- Can be solved with multi-hop but that would occupy space at the source DATADISK.
- IPv6, ALMA9 and ARM tests - ongoing
- Blacklisted over the weekend due to lackin cvmfs on some nodes
- Looking to upgrade the cvmfs client
- GGUS:165414 - Staging failures at BNL-OSG2_MCTAPE
-
13:10
-
13:25
→
13:35
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Running affected by DC24 - see section 2.3.5 for details. In general the over the past 30 days the sites have been pretty stable.
- AGLT2 is fighting cvmfs problems where cvmfs hangs servers in a way that requires a reboot. They also had some minor squid/varnish issues.
- MWT2 working on network upgrade at IU. Some production loss caused by two FTS incidents. A dCache parameter that had been tuned to allow a lot of movers proved to be set too high for the DC24 on the older, dense storage servers.
- NET2 working on bringing the remaining compute servers.
- OU had various transfer issues. Some of the tickets received were marked won't fix.
- CPB needs to implement tokens. CPB got a ticket for data transfer issues.
- Held Tier 2 Technical Meeting last week
- Lots of discussion about stuck/queued transfers (overran the end of the meeting).
- CPB got blocked from receiving new work when the gueue of pending transfers got too big.
-
Considerable follow up in an email thread started by Ivan.
-
Saw that CPB site is now running high memory jobs in the Google cloud.
-
Sites are preparing for the update to EL9 by June with some sites further along than others
- Lots of discussion about stuck/queued transfers (overran the end of the meeting).
-
Still chasing issues with the current version of cvmfs.
- The zombie pilot situation improved by use of a Zombie pilot killer built into the pilot wrapper but still don’t understand the underlying cause.
- Running affected by DC24 - see section 2.3.5 for details. In general the over the past 30 days the sites have been pretty stable.
-
13:35
→
13:40
WBS 2.3.3 HPC Operations 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
TACC running started
NERSC throughput increased but now held up by poor transfer success rate between BNL and Glasgow
With the help of NERSC staff, now measuring HS23 values for various configurations using Hep score coming from cvmfs -
When running with 256 threads (entire machine) - HS23 result - 1592.4074 or 6.2/logical core
When running with 8 threads (whole node scheduling still) - HS23 result - 145.7546 or 18.2/logical core
measuring other configurations to determine the optimal configuration and values.
-
13:40
→
13:55
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:40
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- IRIS-HEP AGC Demo Day #4 this Friday (link)
-
13:45
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
- Updating HTCondor queue configs. Previously we have short and long queue that runs on seperated set of nodes, this leads to resource underutilization when only certain queues got picked. We are now updating the queues/deployments such that both are configured with HPA. The deployment are affinted to node partitions but can be scheduled across the partition. Also updated the HPA metrics so that the scaling is in a more controlled fasion.
-
13:40
-
13:55
→
14:10
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
13:55
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
- ADC:
- Missing CVMFS repos still showing up in HC exclusions at BNL and SWT2_CPB
- Trying to create “US News of the day” summary mail, but too much work for one person. Feel free to add your observations to it (Link)
- Missing CVMFS repos still showing up in HC exclusions at BNL and SWT2_CPB
- NET2
- Overwhelmed with transfers / stageouts
- 10 Gbit link (will be upgraded to 100 Gbit in the summer)
- This should not happen
- In DDM todo list: To find a way to take into account the queue length at destination (ideally also at source and per link)when proposing destination storage for staging.
- SWT2
- Blacklisted due to missing cvmfs on some nodes. The CVMFS check should solve that.
- Slow deletions (Monit)
- OU_OSCER
- Removed the
PQ.environ:"XRD_LOGLEVEL=Debug"
from the CRIC settings. It was filling the Harvester discs over the weekend. - Slowest deletions in US cloud (Monit)
- Removed the
- All transfers failing at WT2 (GGUS)
- iut2-slate squid not reporting after switch maintenance (GGUS)
- Still seeing notable xcache bypass levels at VP sites, including BNL after lowering storage watermarks (link)
- ADC:
-
14:00
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14:05
Facility R&D 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- Joint ATLAS / IRIS-HEP Kubernetes Hackathon coming up April 24-26 at UChicago
- Recent presentation here
- Please sign up if you are interested in attending.
- Training will be remote-friendly, the rest of the workshop will be in person.
- https://indico.cern.ch/event/1384683/
- MWT2, AGLT2 and UVic (Canadian cloud) have already provided hardware and login for the stretched platform.
- Someone should come up with a clever name for it :)
- Joint ATLAS / IRIS-HEP Kubernetes Hackathon coming up April 24-26 at UChicago
-
13:55
-
14:10
→
14:20
AOB 10m
-
13:00
→
13:05