US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
We are waiting on news of the end-of-CA funds to allow us to be able to spend
- Need to schedule a meeting as soon as the funds are in the pipeline, so we can discuss the process and plans
Check the Milestones at https://docs.google.com/spreadsheets/d/1z5Ud_hMKzogVkFm5lXM5GFpcFZl5Bu0Hkd9xkNagYfY/edit?gid=173778962#gid=173778962
HEPiX is this week (Board meeting is going on now) https://indico.cern.ch/event/1598655/
dCache topic
- AGLT2 and MWT2 planning to upgrade to v11.2.4. AGLT2 nominal Apr 30 9 AM - 2 PM, MWT2 May 4
- dCache workshop will have a USATLAS presentation by Eduardo https://indico.nikhef.nl/event/7562/
- Shawn will present on SciTags/Firefly work as well
GENESIS Phase I proposals due April 28th
Summer meetings
- USATLAS F2F at HTC26 in Madison June 9-10
- ATLAS S&C week at CERN June 29-July 2
- USATLAS Scrubbing July 13-15
- USATLAS Summer meeting July 27-29 (?)
Today we have a special guest: Megha Moncy who will let us know about plans for OSG Security exercises.
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- HTCondor 25.10.0 undergoing stress testing in the CHTC this week, OSPool next week. Headline feature is common file reuse on the EP-side. Release in ~2 weeks
- Still need to start the mass rebuild process for XRootD 6
- Newest version of Kuantifier adds support for tracking usage of Jupyter notebooks: https://osg-htc.org/docs/other/monitor-kubernetes-kuantifier/
- Working with the CRIC team to grab resources + contact info from Topology
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
gridgk03,4 were drained on 4/21 by mistake. Ivan caught and corrected this. No interruption in jobs or throughput (gridgk06,7 picked up the extra work). Things have rebalanced
Preparations are being made to migrate the Tier 1 condor nodes to use the new config we've been working on. This process should be relatively seamless. There will be a brief spike in failure rate as jobs are killed to rebuild the workers. Targeting a phased migration in batches of ~25%, with a pause after the first batch to verify jobs are flowing and completing successfully. Small scale testing so far has been good! Uptime during this whole process should remain 100% with (very) brief periods of 75% capacity
Targeting to begin next week, pending success of all the prep work (a LOT of code to verify and merge)
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Great running in the past couple of weeks.
- MWT2 Illinois site had its quarterly preventive maintenance on April 15
- A user sent ~1M small derivation jobs and caused job failures at MWT2 on April 17-18
- Some of the monit plots were corrupted by an Oracle overload April 16-19.
- CPB is nearly finished with the update EL9.
- There are a handful storage servers that remain to be updated.
- The release of dCache version 11.2.4 will be next week.
- Shawn believes this version does Fireflies/SciTags correctly.
- AGLT2 and MWT2 will wait for this release before updating dCache.
- The amount of additional equipment funding is about $1.7 million per site.
- This is above and beyond your FY25 funding.
- Given the unexpectedly large amount of funding I am asking people to submit new procurement plans by the end of May.
- I have access to the Dell Customer testing center and will be benchmarking 5th generation (Turin) EPYC processors.
- I will look at the list price of various server configurations to identify the most cost effective server configurations.
- One can follow the price of memory over the past 18 months at this web site.
- Still working on the quarterly report.
- Great running in the past couple of weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
Perlmutter: Production job still pending
- (Doug) Pilot is not picking up the valid x509 User Proxy. Working with Asoka DeSilva to debug what has changed.
- (Doug) updated the pilot to the latest version
TACC: LRAC (large scale) call for Horizon starting in the summer of 2026 -- proposal deadline: May 15
- Large allocations from 125,000 to 500,000 SUs (Horizon) and up to 50,000 (Vista) for six months duration
- current peer-reviewed research funding to support the activities conducted on Horizon
- Proposals from or including junior researchers are encouraged
- Horizon: a mix of CPU and GPU computing resources, including 4,750 Dell/NVIDIA Vera CPU nodes, and 2,000 Dell/NVIDIA Grace-Blackwell nodes
- Vera: 2x of Grace, ~1x of AMD EPYC 7763 (Perlmutter)
- Vera-Robin (Doudna) ~10x of Grace-Blackwell
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
- User space token clean update
- notification email content is finalized and the ban file testing has been done
- Will send the notification to inactive users until the patch for the production storage system to enable ban feature
-
JupyterHub Development & Deployment updates
- Improved Frontend design
-
Go through the federated authentication workflow and resolve issues with CILogon integration
- Integration testing of the federated JupyterHub workflow
- User space token clean update
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
Containerd open file limit fix
A Coffea-Casa issue caused HTCondor workers to transition to “completed” shortly after startup. This was traced to the ingress controller exhausting available file descriptors.
The root cause was the removal of an explicit open file limit configuration for containerd some time ago. The limit has now been set in the default systemd configuration, and the fix has been deployed on the UC Analysis Facility cluster.
-
13:50
-
14:10
→
14:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Kaushik De (University of Texas at Arlington (US))
- LHC
- It produces low-mu collisions resulting of up to 50-hour long runs and 1 PB datasets. The low-mu run will be over by the end of the week
- ADC Ops:
- SAM tests are currently failing.
- BOINC submission is broken in the moment.
- Job monitoring artifacts due to overload of the Monit filler. To be repainted
- An ongoing campign to synchronize the SE protocol basepaths. This is needed since tokens are not per protocol.
- CERN CephFS problem was due to SSD Micron 5200 with power_on_hours SMART counter larger than 65536.
- If you have Micron 5* SSDs with power_on_hours > 65536 (i.e. older than 7 years) - please let us know.
- US Cloud Ops
- Armen kindly agreed to help with daily issues for US sites - failures, problems, following up on issues and also summarize still opened issues on Mondays.
- NET2 CE downtime shortening revealed CRIC bug. Still to be solved
- MWT2 storage overload because of a misconfigured user workflow.
- Solved on ADC side, but site storage protection should be put in place (number of connection per pool - reduced)
- TW increased number of slots (to 4k) and also removed FTS limit. Running all ADC workloads now.
- Agreed to decommission NEVIC localgroupdisk
- LHC
-
14:15
Services DevOps 5m
XCaches - all OK
Varnishes - all OK. MWT2 CVMFS varnish moved to ingress
Frontiers - due to CERN Openstack retirement of nodes belonging to FRONTIER-A, I had to change all the nodes. They also changed from m2 to m4 nodes.
AI - small updates to most of the AI agentsSpeaker: Ilija Vukotic (University of Chicago (US)) -
14:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
-
14:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
14:10
-
14:30
→
14:40
AOB 10m
-
13:00
→
13:05