US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:05 → 13:10
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
NFS components supporting the staging workflow (dCache to/from HPSS) migrated from NFSv3 to NFSv4
- Transition completed successfully on 01/14/26 between 10:00–11:00 AM.
- Migration was transparent to users.
Reviewed Network/TCP kernel parameters for dCache dual-home pools and doors:
- Doors were using legacy settings; dual-home pools were not optimized for WAN access.
- Network/TCP kernel parameters were identified based on ESnet Fasterdata tuning guidelines.
- TCP tuning has been applied to dcdoorsX and dual-home pools since 01/12/26.
- Per-file transfer performance improvements observed during testing:
TCP pull from CERN EOS to a BNL dual-home pool (16 GB file):
Baseline (as-is): 34.23 MiB/s
After WAN tuning: 167.08 MiB/s
Performance improvement: ~5× throughput increase
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- The holiday period was quiet and there was a high production level.
- There two notable outage over the holiday:
- At AGLT2 a dCache issue out caused a one day outage.
- At CPB annual power once again caused issues. Some sort of DNS table corruption caused by the power outage took time trace.
- Since people have returned to work, there have been various small reductions in production.
- There two notable outage over the holiday:
- The PIs need to meet in early February to discuss procurement.
- The funding outlook has improved but the story is not complete yet.
- As Shawn said we are still working on getting a good story about how essential that Tier 2 sites are.
- Get your quarterly report in.
- The holiday period was quiet and there was a high production level.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
Quarterly report submitted
Perlmutter: ~38k/15k CPU/GPU hours remain
- 50K CPU hours added by NERSC on Monday
- after we ran out of time, Doug contacted Wahid Bhimji(NERSC) for additional CPU time.
- if we run out of time again - should we ask for more , it is supposed to last until 21-Jan-26
- AY26 will start on Jan 23 after the maintenance
- 50K CPU hours added by NERSC on Monday
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
- IRIS-HEP/AGC Demo Day #11 this Friday, 11am ET (link)
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
- The work on integrating COmanager for Jupyter federated authentication is under going
- User space management updates
- No updates about users space policy from Viviana and Hector
- User quota testing reached a pause point
-
One issue observed: inconsistent return code of webdav protocol
-
Addressed in 9.2.46 or later release
-
-
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
SENSE Deployment Update
-
VLAN trunking has been configured on switch ports to allow proxy components to run on additional servers, freeing the ConnectX-7 card for exclusive use by the software router.
-
Follow up with Diego on the current status and confirm whether the setup will be ready in time for the mini challenges.
-
-
14:10
→
14:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeakers: Armen Vartapetian (University of Texas at Arlington (US)), Kaushik De (University of Texas at Arlington (US))
- AGLT2 S3 LocalGroupDisk service issues (GGUS)
- FTS concurrency reduced but doesn't seem to be respected by CERN service?
- Transfer failures from TW-FTT, Yi-Ru is investigating (GGUS)
- One BNL shared pool CE VM migrated and updated, to be returned to service this afternoon
- Observed that deactivating the CE in CRIC did not stop jobs from being scheduled, had to detach from all PQs
- Thank you to Armen who has been chairing the daily ops meetings in Ivan's absence!
- AGLT2 S3 LocalGroupDisk service issues (GGUS)
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- Caches
- everything works fine
- Varnish for CVMFS at SFU being reconfigured so it uses their stratum-0
- Analytics
- moved the rest of Alarms crons to local Github actions.
- AF Assistant
- Subtle changes in agents.
- Now it "knows" user.
- working on integrating Glance data
- Got DGX Sparks, now installed and getting benchmarked. These are to be used primarily for development, running Evals and if fast enough for inference (responding to users).
- ServiceX/Y
- NTR
- Caches
-
14:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
-
14:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
14:10
-
14:30
→
14:40
AOB 10m
-
13:00
→
13:05