US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Working on WBS 2.3.2 budget for next 5-year CA (due Friday)
Next week and the following week are targeting capacity mini-challenges: https://twiki.cern.ch/twiki/bin/view/LCG/DomaMiniChallenges and https://docs.google.com/document/d/1RiTDBMR2xRnjLa2tGT_kvGLfTaDBUfHPUXpnoPftnjc/edit?tab=t.0#heading=h.fej4ky3z75a2
Haven't heard final results from the scrubbing process but expect to have something by September. Alexei and Shawn will send out details for each WBS 2.3.x area.
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release
- XRootD 5.8.4
- cvmfs-2.13.2-2.1
- Includes important bug fixes to prevent client hangs and crashes and to avoid multiple concurrent server snapshots. Everyone who has installed cvmfs client 2.12 or greater is especially encouraged to upgrade promptly.
- Frontier Squid 6.13-1.6 (restricted to upcoming)
Other
- Kuantifier: investigating support for pod names that don't change between runs, e.g. Jupyter
- BrianL needs admin access to Marian Babik's networking GitHub repos to move them to the osg-htc organization
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
- 13:10
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
New grafana dashboard for viewing condor adstash data is a work in progress
-
Can view historical Job completion rates, memory usage / request
-
Can view historical job failures and vacate reason codes
-
Filter by user or accounting group
-
Created new python script (htctl) which is a command line wrapper for common actions performed on the linux farm. Starting/stopping daemons, draining, enable/disable workers. More to come
-
Tier3:
-
Updated attsub01-04 submission nodes (almalinux 9.6)
-
Jobs now request locally 40% more than what Harvester requested to allow for the pilot to kill earlier (done with post-route transform)
-
- 13:20
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
- Slightly reduced occupancy (by 2%) due to a user submitting a lot of VHIMEM jobs.
- Limited number of VHIMEM jobs
- Shut down temporarily the BNL Varnish server until the experts are back
- No operational effect on the overall ADC Varnish infrastructure
- BNL FTS interruption for several hours (8/6, 18:30 - 01:40 CEST) (INC4612171). No operational effect for ADC.
- Slightly reduced occupancy (by 2%) due to a user submitting a lot of VHIMEM jobs.
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))Tier-2 running well. No Tier-2 technical meeting this week.
AGLT2:
S3 storage endpoints with intermittent failures. Wei increased memory and CPU cores for service. Seems to have improved.
MSU nodes with CVMFS 2.13.1 had problems. The same was not seen in UM where nodes are running CVMFS 2.12.6. Upgrade in MSU to 2.13.2 improved the issue, but team suspects it may be related to Varnish. Investigations ongoing.MWT2:
Tuning cgroup configuration. Current pilot version (3.10.5.57) is not setting memory limits for the child/payload cgroupsNET2:
Still suffering with constant failures of ESnet xcache for VP queue operations. When working, VP queue only getting ~30-35% of data from cache. Investigating if this can be improved.SWT2:
Continue to evaluate Varnish migration (now in position 0).
Waiting for central deletion of dark data.
UTA complete deployment of all network monitoring (BGP community and SMNP-based). EL9 migration still ongoing.
Jobs to OU GPU queue using all the GPU instances of a node. Checking if slurm is receiving the correct information to only run in 1 instance per job.From Kaushik:
- Varnish installation at SWT2 showed up a weakness in monitoring that was not foreseen. Varnish failovers are not recorded in the DB or log files, which can make Frontier access slow and inefficient. SWT2 therefore decided not to set another Varnish for failover - instead using Squid for failover. This showed ~1% failovers since squid was monitored. The failovers were then rectified on local nodes. This process should be followed by all sites for a clean migration.
- Varnish installation identified a software/trf bug. When debug mode was turned on for Frontier, it caused jobs to fail from failover warnings in the trf. This was reported to Frontier team - since the jobs should succeed. The careful migration at SWT2 is already paying many dividends in improving the rollout process for all sites.
- After SWT2 pointed out that sites should not locally delete dark data on ATLAS managed storage, the DDM ops team discovered an internal issue in trying to understand the dark data at SWT2. There is a safety feature in DDM that stops dark data deletion if the amount is above an arbitrary threshold. This stopped dark data cleanup at SWT2 for a long time, without DDM realizing the problem. They have fixed the problem now, and they are working to delete the 290 TB of dark data which had built up at SWT2
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 13:40
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
-
13:50
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
ADC Operations
- Low operational support. Many people on leave.
- Tokens:
- How-to configure tokens: ADC documentation (thanks to Petr)
- Tier-0: 1/1, Tier-1: 10/10, Tier-2: 35/51. Commissioning stopped due to request byt the IAM team.
- We have managed to have 15M active access tokens in the IAM DB. This bottleneck will be removed at the end of the year with the new IAM v.1.13
- CRIC: We lost Alexey. CERN IT is to provide support
- Biggest thing to look into - FTS configuration in CRIC
- BLACKLISTING should not be separate role and should be added to privileges of every cloud member
- If you are not able to edit resources which you are supposed to able to, please login from a clean bgrowser via CERN SSO to CRIC. This should synchronize correctly your e-group membership with CRIC.
- FTS issues noted. Transient and withoput operational impact. Both CERN and BNL. (INC4612171)
US Cloud Operations
Site Issues
- AGLT2:
- Investigatioon of high memory jobs
- Looking into CVMFS issues. Upgrading to the latest CVMFS version (2.13.2) helped.
- NET2:
- Blaclisted several times in one night due to SCRATCHDISK issues. Solvedc
- SWT2:
- Fabio is back and working on dark data
- OU:
- GPU jobs taking allGPUs on a node. Solved centrally for ADC in ALRB
- Looking into SLURM cgroup plugin deployment
Tickets
-
AGLT2:
- GGUS:1000227: S3 storage. Lookes fixed.
- GGUS:1000291: Closed as a duplicate.
-
BNL:
- GGUS:1000316: Request for an update of “Activity Shares” at the BNL FTS instance
- GGUS:3795: Varnish installation. Experts are just back from vacation.
-
NET2:
- GGUS:3255: Pilot using the disk space for the full hypervisor.
-
OU:
- GGUS:2096: Network Monitoring.
- GGUS:3559: Dual-stack support.
- GGUS:1000035: Storage token supportr. ETA: Lijubliana
-
SLAC:
- GGUS:3792: SITE_NAME is not set.
-
SWT2:
- GGUS:3793: Varnish installation.
- GGUS:1000094: SCRATCHDISK space allocation. Waiting for darkdata cleanup.
-
HPC:
- GGUS:1484: NERSC_LOCALGROUPDISK support line corrected.
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
XCache
- issues at LRZ-LMU, BHAM
- one restart at ESNet xcache
ServiceX/ServiceY
- now not hardcoding xcaches but using VP service to discover xcaches
- upgrades to uproot
Varnish
- All sites except BNL now use Varnish for conditions data
- Most Varnishes moved to use the new Frontier. Remaining - LRZ-LMU, Wuppertal, CERN, Prague HPC centar
- Finding remaining corner cases, creating tickets to sites that overwrite CRIC configurations, have networking issues or have batch queues with wrong settings.
- Varnish for CVMFS is in use at MWT2, AGLT2 and NET2. With Wenjing investigating unexpected squid usage and performance.
AI
- a lot of changes to Elasticsearch MCP in order to correctly interpret aggregation data.
- work starting on adding MCP for email handling.
- students working on HTCondor MCP using Google ADK and LangChain.
- 14:20
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05