US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Unfortunately the IB conflicts with this meeting https://indico.cern.ch/event/1558310/ (as well as Supercomputing 2025).
- Alexei will miss today because he is giving a talk at SC25
The NSF next CA proposal was successfully submitted on Monday a little before 2 PM Eastern. Now we wait to hear...
We still have to confirm the USATLAS presenter for https://indico.cern.ch/event/1526077/timetable/
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
No major issues were encountered this week.
Ongoing work continues with the BNL dCache storage team on user analysis accounts, involving additional SDFC members regarding policy and dCache user space management.
A verification of dCache user-management features is also in progress with the dCache developers (see: https://github.com/dCache/dcache/issues/7947).
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- T1 farm restored to full capacity yesterday after replacement of PDU circuit breaker
- Some job failures due to (8) black hole nodes last week that resulted from the HTCondor upgrade procedure (cleared by reboot)
- Ongoing issues with ARM queue (exclusions last week, draining this week)
- Discussing GPU queue reconfiguration to allow for 16 CPU/1 GPU jobs
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Reasonable running over the last 2 weeks
- NET2 is in downtime to support a demo at SC25
- CPB had another power failure.
- New condor and cvmfs versions installed at AGLT2 and MWT2
- Some reductions in number of slots running to do the underlying rolling updates.
- MWT2 also updated to the AlmaLinux version to 9.7 at UC & IU. UIUC is still running RHEL 9.4.
- SWT2 CPB continues to work along on the EL9 updates.
- TW-FTT / Yi-Ru continue to make good progress.
- Still having transfer failures but seems like only to certain sites.
- The mean transfer success rate has been ~90% but the failures are concentrated at certain sites.
- For now the the site is being restricted to simulation jobs to reduce the amount of data that is transferred.
- Still having transfer failures but seems like only to certain sites.
- Reasonable running over the last 2 weeks
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
Perlmutter: possible extra CPU hours from NERSC
- Eric: DOE has no extra allocation left. NERSC Management said probably, but they’re at SC25 so will get back to us after that.
- Multi-year proposal: estimation of Doudna CPU performance -> 2-4x of perlmutter node (assuming Nvidia CPU). Projection for 2028-2030.
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
- Executed first step in review of local user storage: zeroed storage tokens for users with deactivated SDCC accounts and dedicated, but empty, dCache directories. This recovered ~209 TB. Next steps are under discussion.
- Also analyzed and review the other two category users
- 41 Deactivated users with space usage, used ~79.01TB, move to the data to archived storage and then release those storage tokens?
- 23 activate users with zero size
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
GitHub Actions Runner Controller Added
-
Giordon is leading the deployment effort. This new controller will replace the existing cron jobs used to manage AF benchmarks through GitHub Actions.
-
The system automatically scales the runners based on demand.
-
It includes a self-maintained runner image and configuration setup to provide an execution environment tailored for benchmarking and other workflow needs.
-
-
13:50
-
14:10
→
14:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Kaushik De (University of Texas at Arlington (US))
- Functional testing of WAN transfers to dual-homed dCache pools was successful; however, the configuration changes were completely reverted due to some misconfiguration that led to staging problems. We would like to redo the functional test asap in preparation for Dec. capability testing.
- WLCG DOMA BDT meeting earlier today (link) - test tape+tokens at BNL early next year?
- Requested add'l information from Pilot condor_chirp (jira)
- Invited Canadian and South American facility teams to join our daily ops meeting - not much response yet
- BNLHPC disks decommissioning is progressing - only several hundred files left on scratchdisk.
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
XCache
- lost one AF xcache, sent for repairs
- issus at BHAM
- there is a version 5.9.1 that should include some bugfixes. Will be testing it this week.
Varnish
- CERN local varnishes are hammered with 4x/h requests of 1.9GB. Nurcan checking who does it.
- Everything works fine despite CloudFlare issues.
- Remaining traffic on old Frontiers: CYFRONET, Mainz, three people GitLab cronjobs, NERSC
- ECDF is independently deploying Varnish for CVMFS
AI
- updated documentation for AF. GitHub Action is now doing most of it.
- Exported all the requests up to now and manually reviewed/classify them.
Analytics
- new alarms and alerts.
- updated A&A frontend packages
- More ES cleanup, moves to cold storage node
-
14:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
- ADC TCB meeting tomorrow to discuss PHYSLITE distribution (link)
- Facility R&D meeting notes from last week
-
Final presentation from Armen re:SWT2 k8s cluster (MS427)
-
Review of previous week’s integration challenge
-
Updates on RP1 and Kuantifier for Jupyter notebooks
-
Discussed impact of MkDocs EOL announcement (Zensical?)
- CERN is not making an immediate change, but probably move to Zensical at some point in the future
-
-
14:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
14:10
-
14:30
→
14:40
AOB 10m
Brief discussion of mini-capacity challenge in early December. Update USATLAS sites similar to how USCMS results were obtained here https://docs.google.com/presentation/d/1JleJXPMjRyAqxRBcp4x6X_Ozt4ifDszjLz7bMPuY2-k/edit?slide=id.g276cf42bd13_0_98#slide=id.g276cf42bd13_0_98
Previous notes at https://docs.google.com/document/d/1zuHdDeMfp0lsFMphy0_WwnFAdihpSCX_PPe_yFTG2R8/edit?tab=t.0
-
13:00
→
13:05