US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-05-21T13:00:00-04:00
End: 2025-05-21T15:25:00-04:00
Location: No location set

Wednesday 21 May 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  We need to prepare for (pre) Scrubbing. A WBS 2.3 L3 template has been shared https://docs.google.com/presentation/d/1mU1eDQQxIE3Lm6qqZsLFPZ-EtmjrgbxXxggr61gJau4/edit?usp=sharing
  
  - Target is having draft slides by June 9th (to be confirmed)
  
  The 5-year evolution spreadsheets for the Tier-2 facility is complete but still needs updates and final numbers
  
  - Each Tier-2 should be working on spending plans for a possible end of CA distribution (see Tier-2 Spending Proposal)
  
  - Tier-2 managers will meet Friday to discuss
  
  HTC25 is fast approaching. We have a draft agenda started at https://agenda.hep.wisc.edu/event/2297/timetable/#20250605.detailed
  
  - Comments welcome
  
  LHCOPN/LHCONE meeting proposed shutting off IPv4 for LHCOPN
  
  - HEPiX IPv6 working group discussed today and we want to see if ATLAS/BNL and CMS/FNAL are willing to try this with the expectation that any IPv4 traffic fails over to LHCONE
  
  - Phil Demar is asking CMS and FNAL if they are will to try to do this in the next month or two. Shawn is tasked with doing the same for ATLAS and BNL.
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  frontier-squid-5.10.1
  
  Various security fixes https://frontier.cern.ch/dist/rpms/frontier-squidRELEASE_NOTES
  
  Logrotation fix
  
  XRootD 5.8.2
  
  Fixes one cause related to HTTP GETs that show up in the logs as "close does not refer to an open file"
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
    
    Tier 1 report 052025.pdf
  - 13:25
    
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Lack of work caused significant disruption over the past week.
  
  Over the weekend there were only SCORE_HIMEM jobs that would not broker to a site unless meanRSS was set to 3000 MB. We set this at AGLT2, MWT2, and I believe SWT2_CPB and these sites refilled.
  
  There was also a large number of exotics group jobs that failed at all tier 2 sites for looping.
  
  EL9 updates/FY24 equipment installs continue at MSU and UTA.
  
  MSU believes that a Satellite will allow them to finish.
  
  CPB has been struggling with zombie condor entries.
  
  There is ticket open with the condor team about the issue.
  
  Tier 2 PIs will meet on Friday to discuss procurement both FY25 and end of grant special funds.
  
  I need to get with Rafael on pre-scubbing slides.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: both CPU&GPU usage stay above expectation
    
    TACC: bring UCORE back online to finish the remaining allocation (in flex queue)
    
    ACCESS: scheduling a chat with experts on setting up an HTCondor Overlay cluster
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Managed to load user name and group name while creating the pod with init container
    
    The dCache nfs client still needs to configure NFSv4 identity mapping properly
    
    Needs further work on the idmap on openshift work node or pod
    
    The test of pull/push image to/from to SDCC Quay service is done
    
    Customize the alma9 base image and build it to register on SDCC Quay.
    
    Tom Smith is deploying the accounting monitoring that was missing from the A9 Tier-3 pool and interactive hosts
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    BinderHub service now available for users(link is up on the portal)
    
    Started to look into Dask-gateway/HTCondor queue integration(do we need to reinvent the wheel)
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    Preparing for deployment of FTS update at BNL (v14.0.1 to be released next week) - will allow for token testing during data challenge
    
    Varnish at BNL now functional on OpenShift with Quay image; still some network routing to deploy
    
    DDM moved BNL VP queue xcache to ESNET server
    
    Ongoing discussions of Varnish deployment and management
    
    CRIC permissions were updated (more info)
    
    BNL-OSG2_DATADISK protocol priorities to be changed from 0 to null.
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCaches
    
    multiple issues in UK cloud.
    
    ESNet xcache is operational but no monitoring coming from it.
    
    will try to build new image this week
    
    VP
    
    BNL_VP trying to use ESnet xcache
    
    Varnish
    
    starting building neo_frontier infrastructure at OpenStack k8s cluster at CERN
    
    asked SWT2 to deploy their own Varnish
    
    all Varnishes removed from WLCG monitoring. Dedicated varnish monitoring meeting on Friday 9:30 AM CST
    
    CREST
    
    NTR
    
    ServiceX/Y
    
    updated all the components to 1.6.1
    
    testing RDataFrame codegenerator and transformers
    
    AF
    
    cleaned up images and their naming.
    
    added python 3.12 to login nodes.
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Armada seems to be working locally on the stretched k8s, and we are investigating the auth components needed to send tasks to another cluster
    
    We are actively debugging/trying to understand EOS user authentication.
    
    Kerberos nonstarter, X509 might be tricky because the EOS containers are all EL7 (!) and we're trying to understand the CA/cert situation
    
    "plain" OAuth2 deprecated, with support shifting to SciToken-based auth
    
    Not quite clear how to bridge the gap from Keycloak to SciTokens, still working on it
    
    Coffea Casa JupyterHub should be working on https://coffea-casa.hl-lhc.io/ , with caveats..
    
    Must have a UChicago AF account already, to get your /home, /data, and access to HTCondor
    
    Still working on:
    
    General ATLAS users coming from IAM without a UChicago AF account
    
    Only get Jupyter, no persistence
    
    Probably will crash right now if you try it
    
    HTCondor pool on the stretched cluster
    
    Mounting NFS/Ceph over the WireGuard interface within K8S
    
    Jupyter limited to UChicago nodes at the moment, where we can mount locally
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder