US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-04-23T13:00:00-04:00
End: 2025-04-23T15:25:00-04:00
Location: No location set

Wednesday 23 Apr 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 1
  WBS 2.3 Facility Management News
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Thanks to everyone for getting in their WBS 2.3 quarterly reports.
  
  WBS 2.3 top-level quarterly should be done soon.
  
  WLCG/HSF meeting coming up in early May
  
  Tier-2s need to work on finalizing procurement and ops plans (discuss in WBS 2.3.2)
  
  After procurement plans are ready, we need to work on 5-year estimator
  
  Milestone updates still needed for WBS 2.3 https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?gid=1906829311#gid=1906829311
  
  #117 Feb 2025 Delayed (by SWT2) Updates? WLCG site network monitoring 2 years delayed so far...
  
  #374 Apr 2025 On Schedule (waiting on BNL?) Need updated comment?
  
  #279 Apr 2025 Delayed Need updated comment? Tier-1
  
  #392 Jan 2025 "On Schedule" Needs update Tier-1
  
  #393 Jan 2025 "On Schedule" Needs update Tier-1
  
  #191 Apr 2025 Delayed Tier-1 Update comment?
  
  #310 Feb 2025 Delayed SWT2 update estimated date and comment
  
  #316 Mar 2025 Delayed SWT2 update estimated date and comment
  
  #363 Mar 2025 On Schedule update status or estimate date/comment
  
  #410 Apr 2025 Delayed WBS 2.3.4 update comment?
  
  #414 Apr 2025 On Schedule but is this a real milestone (WBS 2.3.4)
  
  #328 Apr 2025 Delayed WBS 2.3.5.1 see comment, update estimated date
  
  #415 Mar 2025 WBS 2.3.5.2 Update estimated date and comment OR retire?
  
  #416 Jun 2025 WBS 2.3.5.2 Is estimated date correct? Update comment?
  
  #419 Mar 2025 On schedule WBS 2.3.5.2 New estimated date needed. Change Status to Delayed
  
  #428 Mar 2025 Delayed WBS 2.3.5.3 New estimated date, update comment
- 2
  OSG-LHC
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  XRootD 5.8.1 in osg-testing
  
  ATLAS NRP deployments
  
  Zeek
- WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 3
    
    Tier-1 Infrastructure
    
    Speaker: Jason Smith
  - 4
    
    Compute Farm
    
    Speaker: Thomas Smith
  - 5
    
    Storage
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
  - 6
    Tier1 Operations and Monitoring
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    WBS 2.3.1.2 Tier-1 Infrastructure - Jason
    
    NTR
    
    WBS 2.3.1.3 Tier-1 Compute - Tom
    
    New compute racks added, cpu count for Tier 1 temporarily raised to ~45K cpu
    
    Older equipment retirement/ donation to Tier 3 will happen soon, Tier 1 core count will show a small net decrease, but there will still be a net gain in HEPscore23 (since the new hardware is better/faster core for core)
    
    WBS 2.3.1.4 Tier-1 Storage - Carlos
    
    5280TB DISK space added to 2025 pledge
    
    10 pools hosts commissioned into production
    
    25030TB TAPE space added to 2025 pledge
    
    WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
    
    Emptying of the cluster today due to a user assigning all his jobs to BNL only (~100k jobs)
    
    Killing all assigned user jobs to BN
    
    LUnsetting site for all his jobs
    
    Limiting number of score jobs at BNL temporarily
    
    The site started to recover in the last hour
- WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Reasonable running for the last two weeks.
  
  AGLT2 continued work on understnding why cvmfs hangs at their sites
  
  Still trying to understand why AGLT2 does not seem to be able to run more than 6000 SCORE jobs at a time. This did cause a small draining on one day.
  
  MWT2 had a reduced production last week due to rolling draining to remount cvmfs repos.
  
  The draining/remount did end the cvmfs aborts and seemed to activate the fix of the bug causing the aborts.
  
  It also finally caused the increased number of file descriptors specifued in the configuration file to be used.
  
  I recommend that all sites update to cvmfs version 2.12.7
  
  OU had problems with their scratch area setup and had more failures than usual.
  
  Fixed some issues but the problem still occassionally appers on some servers.
  
  SWT2_CPB had trouble staying full
  
  ADC tried submitting 16 core MCORE jobs.
  
  Setup a second gate keeper.
  
  Seems better?
  
  Finished the quarterly reporting
  
  Now focussing on the Operations and Procurement plans.
- WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 7
    
    HPC Operations
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    TACC: job submission suspended during the weekend. Stop the harvester instance right now. ~1.5K SU.
    
    Perlmutter: maintenance last week. CPU usage is slightly below expectation. MCORE job rate is quite stable (not Premium). Suggestions from NERSC (on Rucio) reduce the job in queue to improve the throughput
    
    ACCESS: need to discuss with Doug on details
  - 8
    
    Integration of Complex Workflows on Heterogeneous Resources
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 9
    Analysis Facilities - BNL
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Work on the Jupyter testbed deployment to evaluate the authentication and retrive UID/GID dynamically.
  - 10
    
    Analysis Facilities - SLAC
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 11
    Analysis Facilities - Chicago
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Binderhub service launching preps
    
    Identity reconfigurations - switched to keycloak-prod instance, replaced the connect lookup(security, performance, reliability issues) with posix claims
    
    Adding monitoring
- WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 12
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    Rucio DB overload on Wednesday due to hangin multiple ART jobs queries
    
    Problem is mitigated
    
    Ongoing:
    
    DB experts are working on DB optimization (ATDBOPS-406)
    
    ART workflow should be optimized (ATLINFR-5755)
    
    HC - Starting tomorrow PFT_MCORE tests will be able to auto-exclude Production-only PQs.
    
    Working on automatic storage blacklisting based on functional tests transfers
    
    A campaign to verify that all pledged compute resources are allowing 96 hour jobs.
    
    Fred found some:
    
    leaky Exotics derivations - triggered discussion on automatic stopping of leaky tasks
    
    failing evgen
  - 13
    Services DevOps
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Waiting on ingress access to test new Varnish container at BNL
    
    MS415 ?? EventLoop data access monitoring
  - 14
    Facility R&D
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Armada work continues on stretched k8s, some deficiencies in how to securely store the postgres password in the deployment
    
    Ticket for clarification / request for improvement will be filed
    
    Coffea Casa deployment work continues, debugging 'client not found' issue between JupyterHub and Keycloak
    
    Moving various AF/K8S services to keycloak-prod, deprecating keycloak-dev, syncing AF users into Keycloak periodically
- 15
  
  AOB

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

WBS 2.3.1.2 Tier-1 Infrastructure - Jason

WBS 2.3.1.3 Tier-1 Compute - Tom

WBS 2.3.1.4 Tier-1 Storage - Carlos

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan