US ATLAS Computing Facility (Possible Topical)

Name: US ATLAS Computing Facility (Possible Topical)
Start: 2026-04-22T13:00:00-04:00
End: 2026-04-22T15:25:00-04:00
Location: No location set

Wednesday 22 Apr 2026, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  We are waiting on news of the end-of-CA funds to allow us to be able to spend
  
  - Need to schedule a meeting as soon as the funds are in the pipeline, so we can discuss the process and plans
  
  Check the Milestones at https://docs.google.com/spreadsheets/d/1z5Ud_hMKzogVkFm5lXM5GFpcFZl5Bu0Hkd9xkNagYfY/edit?gid=173778962#gid=173778962
  
  HEPiX is this week (Board meeting is going on now) https://indico.cern.ch/event/1598655/
  
  dCache topic
  
  - AGLT2 and MWT2 planning to upgrade to v11.2.4. AGLT2 nominal Apr 30 9 AM - 2 PM, MWT2 May 4
  
  - dCache workshop will have a USATLAS presentation by Eduardo https://indico.nikhef.nl/event/7562/
  
  - Shawn will present on SciTags/Firefly work as well
  
  GENESIS Phase I proposals due April 28th
  
  Summer meetings
  
  - USATLAS F2F at HTC26 in Madison June 9-10
  
  - ATLAS S&C week at CERN June 29-July 2
  
  - USATLAS Scrubbing July 13-15
  
  - USATLAS Summer meeting July 27-29 (?)
  
  Today we have a special guest: Megha Moncy who will let us know about plans for OSG Security exercises.
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  HTCondor 25.10.0 undergoing stress testing in the CHTC this week, OSPool next week. Headline feature is common file reuse on the EP-side. Release in ~2 weeks
  
  Still need to start the mass rebuild process for XRootD 6
  
  Newest version of Kuantifier adds support for tracking usage of Jupyter notebooks: https://osg-htc.org/docs/other/monitor-kubernetes-kuantifier/
  
  Working with the CRIC team to grab resources + contact info from Topology
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
    
    gridgk03,4 were drained on 4/21 by mistake. Ivan caught and corrected this. No interruption in jobs or throughput (gridgk06,7 picked up the extra work). Things have rebalanced
    
    Preparations are being made to migrate the Tier 1 condor nodes to use the new config we've been working on. This process should be relatively seamless. There will be a brief spike in failure rate as jobs are killed to rebuild the workers. Targeting a phased migration in batches of ~25%, with a pause after the first batch to verify jobs are flowing and completing successfully. Small scale testing so far has been good! Uptime during this whole process should remain 100% with (very) brief periods of 75% capacity
    
    Targeting to begin next week, pending success of all the prep work (a LOT of code to verify and merge)
  - 13:20
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
    
    Tape staging backlog (due to staging timeouts) resolved while supporting ongoing ATLAS tape activities (i.e 1M files staged since 04/14)
    
    Bulk staging service configured to support 200K active FTS staging requests
    
    Data Carousel capped at ~190K requests
  - 13:25
    
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Great running in the past couple of weeks.
  
  MWT2 Illinois site had its quarterly preventive maintenance on April 15
  
  A user sent ~1M small derivation jobs and caused job failures at MWT2 on April 17-18
  
  Some of the monit plots were corrupted by an Oracle overload April 16-19.
  
  CPB is nearly finished with the update EL9.
  
  There are a handful storage servers that remain to be updated.
  
  The release of dCache version 11.2.4 will be next week.
  
  Shawn believes this version does Fireflies/SciTags correctly.
  
  AGLT2 and MWT2 will wait for this release before updating dCache.
  
  The amount of additional equipment funding is about $1.7 million per site.
  
  This is above and beyond your FY25 funding.
  
  Given the unexpectedly large amount of funding I am asking people to submit new procurement plans by the end of May.
  
  I have access to the Dell Customer testing center and will be benchmarking 5th generation (Turin) EPYC processors.
  
  I will look at the list price of various server configurations to identify the most cost effective server configurations.
  
  One can follow the price of memory over the past 18 months at this web site.
  
  Still working on the quarterly report.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: Production job still pending
    
    (Doug) Pilot is not picking up the valid x509 User Proxy. Working with Asoka DeSilva to debug what has changed.
    
    (Doug) updated the pilot to the latest version
    
    TACC: LRAC (large scale) call for Horizon starting in the summer of 2026 -- proposal deadline: May 15
    
    Large allocations from 125,000 to 500,000 SUs (Horizon) and up to 50,000 (Vista) for six months duration
    
    current peer-reviewed research funding to support the activities conducted on Horizon
    
    Proposals from or including junior researchers are encouraged
    
    Horizon: a mix of CPU and GPU computing resources, including 4,750 Dell/NVIDIA Vera CPU nodes, and 2,000 Dell/NVIDIA Grace-Blackwell nodes
    
    Vera: 2x of Grace, ~1x of AMD EPYC 7763 (Perlmutter)
    
    Vera-Robin (Doudna) ~10x of Grace-Blackwell
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    User space token clean update
    
    notification email content is finalized and the ban file testing has been done
    
    Will send the notification to inactive users until the patch for the production storage system to enable ban feature
    
    JupyterHub Development & Deployment updates
    
    Improved Frontend design
    
    Go through the federated authentication workflow and resolve issues with CILogon integration
    
    Integration testing of the federated JupyterHub workflow
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Containerd open file limit fix
    
    A Coffea-Casa issue caused HTCondor workers to transition to “completed” shortly after startup. This was traced to the ingress controller exhausting available file descriptors.
    
    The root cause was the removal of an explicit open file limit configuration for containerd some time ago. The limit has now been set in the default systemd configuration, and the fix has been deployed on the UC Analysis Facility cluster.
- 14:10 → 14:30
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Kaushik De (University of Texas at Arlington (US))
    
    LHC
    
    It produces low-mu collisions resulting of up to 50-hour long runs and 1 PB datasets. The low-mu run will be over by the end of the week
    
    ADC Ops:
    
    SAM tests are currently failing.
    
    BOINC submission is broken in the moment.
    
    Job monitoring artifacts due to overload of the Monit filler. To be repainted
    
    An ongoing campign to synchronize the SE protocol basepaths. This is needed since tokens are not per protocol.
    
    CERN CephFS problem was due to SSD Micron 5200 with power_on_hours SMART counter larger than 65536.
    
    If you have Micron 5* SSDs with power_on_hours > 65536 (i.e. older than 7 years) - please let us know.
    
    US Cloud Ops
    
    Armen kindly agreed to help with daily issues for US sites - failures, problems, following up on issues and also summarize still opened issues on Mondays.
    
    NET2 CE downtime shortening revealed CRIC bug. Still to be solved
    
    MWT2 storage overload because of a misconfigured user workflow.
    
    Solved on ADC side, but site storage protection should be put in place (number of connection per pool - reduced)
    
    TW increased number of slots (to 4k) and also removed FTS limit. Running all ADC workloads now.
    
    Agreed to decommission NEVIC localgroupdisk
  - 14:15
    
    Services DevOps 5m
    
    XCaches - all OK
    Varnishes - all OK. MWT2 CVMFS varnish moved to ingress
    Frontiers - due to CERN Openstack retirement of nodes belonging to FRONTIER-A, I had to change all the nodes. They also changed from m2 to m4 nodes.
    AI - small updates to most of the AI agents
    
    Speaker: Ilija Vukotic (University of Chicago (US))
  - 14:20
    
    Facility R&D 5m
    
    Speaker: Robert William Gardner Jr (University of Chicago (US))
  - 14:25
    
    Cybersecurity plan(s) 5m
    
    Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
- 14:30 → 14:40
  
  AOB 10m