US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-08-13T13:00:00-04:00
End: 2025-08-13T15:25:00-04:00
Location: No location set

Wednesday 13 Aug 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Working on WBS 2.3.2 budget for next 5-year CA (due Friday)
  
  Next week and the following week are targeting capacity mini-challenges: https://twiki.cern.ch/twiki/bin/view/LCG/DomaMiniChallenges and https://docs.google.com/document/d/1RiTDBMR2xRnjLa2tGT_kvGLfTaDBUfHPUXpnoPftnjc/edit?tab=t.0#heading=h.fej4ky3z75a2
  
  Haven't heard final results from the scrubbing process but expect to have something by September. Alexei and Shawn will send out details for each WBS 2.3.x area.
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release
  
  XRootD 5.8.4
  
  cvmfs-2.13.2-2.1
  
  Includes important bug fixes to prevent client hangs and crashes and to avoid multiple concurrent server snapshots. Everyone who has installed cvmfs client 2.12 or greater is especially encouraged to upgrade promptly.
  
  Frontier Squid 6.13-1.6 (restricted to upcoming)
  
  Other
  
  Kuantifier: investigating support for pod names that don't change between runs, e.g. Jupyter
  
  BrianL needs admin access to Marian Babik's networking GitHub repos to move them to the osg-htc organization
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
    
    Vacation, NTR
  - 13:15
    Compute Farm 5m
    
    Speaker: Thomas Smith
    
    New grafana dashboard for viewing condor adstash data is a work in progress
    
    Can view historical Job completion rates, memory usage / request
    
    Can view historical job failures and vacate reason codes
    
    Filter by user or accounting group
    
    Created new python script (htctl) which is a command line wrapper for common actions performed on the linux farm. Starting/stopping daemons, draining, enable/disable workers. More to come
    
    Tier3:
    
    Updated attsub01-04 submission nodes (almalinux 9.6)
    
    Jobs now request locally 40% more than what Harvester requested to allow for the pilot to kill earlier (done with post-route transform)
  - 13:20
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
    
    No major issue to report
    
    Enhanced monitoring/visualization for standby databases
  - 13:25
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
    
    Slightly reduced occupancy (by 2%) due to a user submitting a lot of VHIMEM jobs.
    
    Limited number of VHIMEM jobs
    
    Shut down temporarily the BNL Varnish server until the experts are back
    
    No operational effect on the overall ADC Varnish infrastructure
    
    BNL FTS interruption for several hours (8/6, 18:30 - 01:40 CEST) (INC4612171). No operational effect for ADC.
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Tier-2 running well. No Tier-2 technical meeting this week.
  
  AGLT2:
  S3 storage endpoints with intermittent failures. Wei increased memory and CPU cores for service. Seems to have improved.
  MSU nodes with CVMFS 2.13.1 had problems. The same was not seen in UM where nodes are running CVMFS 2.12.6. Upgrade in MSU to 2.13.2 improved the issue, but team suspects it may be related to Varnish. Investigations ongoing.
  
  MWT2:
  Tuning cgroup configuration. Current pilot version (3.10.5.57) is not setting memory limits for the child/payload cgroups
  
  NET2:
  Still suffering with constant failures of ESnet xcache for VP queue operations. When working, VP queue only getting ~30-35% of data from cache. Investigating if this can be improved.
  
  SWT2:
  Continue to evaluate Varnish migration (now in position 0).
  Waiting for central deletion of dark data.
  UTA complete deployment of all network monitoring (BGP community and SMNP-based). EL9 migration still ongoing.
  Jobs to OU GPU queue using all the GPU instances of a node. Checking if slurm is receiving the correct information to only run in 1 instance per job.
  
  From Kaushik:
  
  Varnish installation at SWT2 showed up a weakness in monitoring that was not foreseen. Varnish failovers are not recorded in the DB or log files, which can make Frontier access slow and inefficient. SWT2 therefore decided not to set another Varnish for failover - instead using Squid for failover. This showed ~1% failovers since squid was monitored. The failovers were then rectified on local nodes. This process should be followed by all sites for a clean migration.
  
  Varnish installation identified a software/trf bug. When debug mode was turned on for Frontier, it caused jobs to fail from failover warnings in the trf. This was reported to Frontier team - since the jobs should succeed. The careful migration at SWT2 is already paying many dividends in improving the rollout process for all sites.
  
  After SWT2 pointed out that sites should not locally delete dark data on ATLAS managed storage, the DDM ops team discovered an internal issue in trying to understand the dark data at SWT2. There is a safety feature in DDM that stops dark data deletion if the amount is above an arbitrary threshold. This stopped dark data cleanup at SWT2 for a long time, without DDM realizing the problem. They have fixed the problem now, and they are working to delete the 290 TB of dark data which had built up at SWT2
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: stable
    
    Paolo is asking for an additional 90K CPU & 18K GPU hours.
    
    ERCAP Allocation for AY26: call is open, deadline Oct 6
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Continue working on the Kubespawner class
    
    The python class is ready to use and uploaded on SCDF git repository
    
    Launch containers using specific UID/GID
    
    Attached GPFS and dCache storage with the proper permission on Openshift
    
    Working on deploying a VM-based solution, in parallel as well
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    ADC Operations
    
    Low operational support. Many people on leave.
    
    Tokens:
    
    How-to configure tokens: ADC documentation (thanks to Petr)
    
    Tier-0: 1/1, Tier-1: 10/10, Tier-2: 35/51. Commissioning stopped due to request byt the IAM team.
    
    We have managed to have 15M active access tokens in the IAM DB. This bottleneck will be removed at the end of the year with the new IAM v.1.13
    
    CRIC: We lost Alexey. CERN IT is to provide support
    
    Biggest thing to look into - FTS configuration in CRIC
    
    BLACKLISTING should not be separate role and should be added to privileges of every cloud member
    
    If you are not able to edit resources which you are supposed to able to, please login from a clean bgrowser via CERN SSO to CRIC. This should synchronize correctly your e-group membership with CRIC.
    
    FTS issues noted. Transient and withoput operational impact. Both CERN and BNL. (INC4612171)
    
    US Cloud Operations
    
    Site Issues
    
    AGLT2:
    
    Investigatioon of high memory jobs
    
    Looking into CVMFS issues. Upgrading to the latest CVMFS version (2.13.2) helped.
    
    NET2:
    
    Blaclisted several times in one night due to SCRATCHDISK issues. Solvedc
    
    SWT2:
    
    Fabio is back and working on dark data
    
    OU:
    
    GPU jobs taking allGPUs on a node. Solved centrally for ADC in ALRB
    
    Looking into SLURM cgroup plugin deployment
    
    Tickets
    
    AGLT2:
    
    GGUS:1000227: S3 storage. Lookes fixed.
    
    GGUS:1000291: Closed as a duplicate.
    
    BNL:
    
    GGUS:1000316: Request for an update of “Activity Shares” at the BNL FTS instance
    
    GGUS:3795: Varnish installation. Experts are just back from vacation.
    
    NET2:
    
    GGUS:3255: Pilot using the disk space for the full hypervisor.
    
    OU:
    
    GGUS:2096: Network Monitoring.
    
    GGUS:3559: Dual-stack support.
    
    GGUS:1000035: Storage token supportr. ETA: Lijubliana
    
    SLAC:
    
    GGUS:3792: SITE_NAME is not set.
    
    SWT2:
    
    GGUS:3793: Varnish installation.
    
    GGUS:1000094: SCRATCHDISK space allocation. Waiting for darkdata cleanup.
    
    HPC:
    
    GGUS:1484: NERSC_LOCALGROUPDISK support line corrected.
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCache
    
    issues at LRZ-LMU, BHAM
    
    one restart at ESNet xcache
    
    ServiceX/ServiceY
    
    now not hardcoding xcaches but using VP service to discover xcaches
    
    upgrades to uproot
    
    Varnish
    
    All sites except BNL now use Varnish for conditions data
    
    Most Varnishes moved to use the new Frontier. Remaining - LRZ-LMU, Wuppertal, CERN, Prague HPC centar
    
    Finding remaining corner cases, creating tickets to sites that overwrite CRIC configurations, have networking issues or have batch queues with wrong settings.
    
    Varnish for CVMFS is in use at MWT2, AGLT2 and NET2. With Wenjing investigating unexpected squid usage and performance.
    
    AI
    
    a lot of changes to Elasticsearch MCP in order to correctly interpret aggregation data.
    
    work starting on adding MCP for email handling.
    
    students working on HTCondor MCP using Google ADK and LangChain.
  - 14:20
    
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Not much to report this week. Team is focused on UChicago AF support while Fengping is on vacation. At some point, someone from G-Research may like to give a talk about Armada for multi-cluster batch scheduling, TBD.
- 14:25 → 14:35
  
  AOB 10m

US ATLAS Computing Facility

Facilities Team Google Drive Folder

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

ADC Operations

US Cloud Operations

Site Issues

Tickets