US ATLAS Computing Facility (Possible Topical)
→
US/Eastern
Description
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release
- Newest HTCondor versions released yesterday resolve the issue that AGLT2 saw with HTCondor-CE 25
- We are expecting a release this week for CVMFS and osg-configure. The latter has an important fix for new CE installations
- N.B. HTCondor Python bindings v1 are not available in OSG 25
- We are discussing internally the root causes of the various undiscovered issues in the initial OSG 25 release
Miscellaneous
- OSG Hub is going down for maintenance on Nov 18
- We are working on migrating other OSG / PATh services from UChicago -> UW + NRP
-
13:15
→
13:30
Topical Presentation 15m
ATLAS dCache Zpool reservation from 10%/11% to 5%
Speaker: Robert Hancock -
13:20
→
13:40
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:20
Tier-1 Infrastructure 5mSpeaker: Jason Smith
- 13:25
- 13:30
- 13:35
-
13:20
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Pretty good running over the past two weeks.
- AGLT2 doing rolling upgrades
- MWT2 has condor_chirp running on the MWT2_TEST queue with the development pilot.
- NET2 had some storage troubles that were triggered by large numbers of transfers to the tape storage. Also some failures in the past 24 hours.
- OU is working on using cgroups v2 to stop jobs using too much memory. This requires changes in Slurm.
- CPB is still updating some of their older storage servers to EL9 which is required to run recent versions of XRootD. In particular the XRootD 5.9.x series is becoming available.
- TW-FTT has solved a storage accounting problem which overstated the amount of available storage by a factor of two. Network transfers are working well again.
- Several software updates available. New versions of OSG 25 / HTCondor 25 (25.0.3) are ready to be installed. The new cvmfs version (2.13.3) should be installed soon at all sites as it has urgent bug fixes. As mentioned above, the XRootD team has released the version 5.9.0.
- There will be a planning discussion for spending the FY25 equipment funding on Friday at 10 am CST / 11:00 am EST.
- Everyone is welcome including system administrators – we will need their technical expertise.
- Pretty good running over the past two weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 13:40
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
Triton Service
-
Productionized the service setup.
-
Added CVMFS paths for the model repository.
-
Configured explicit model loading for better control and resource management.
ServiceX
-
Experienced an outage yesterday.
-
Root cause: a process in the app pod stopped retrying after 100 failed attempts to connect to the RabbitMQ service, preventing transform tasks from being dispatched.
-
Temporary fix: restarted the app pod.
-
Permanent fix: developers are updating the code to allow infinite retry attempts.
-
-
13:50
-
14:10
→
14:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Kaushik De (University of Texas at Arlington (US))
- Condor_chirp working at MWT2 test queue in new pilot; to test at BNL
- HTCondor/OSG/CVMFS updates in progress
- Questions about VP queue operation, currently stopped at NET2
- ADC
- We rely on sites with limited network to communicate the network sharing fair. And we have no way to control that (two independent FTS instances) We have already case (ROMA1)
- Working with HI folks on setup for HI datataking with as opened trigger as possible.
- SHA1 CAs - our software should not check the CA certificate (self signed anyway). Still a bug in dCache to be fixed (dcache#7927)
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache, VP - NTR
- Varnish
- deprecated backup proxies at FNAL and CERN
- got resources for CERN local Varnishes.
- T0 nodes failing over. Operators informed.
- still problematic: MPPMU, CYFRONET, BNL
- TW is installing a CVMFS varnish proxy
- Frontier
- next week we will have a tutorial session on how to manage Frontiers.
- AF
- every change in documentation now triggers parsing to new MD files and their reimport in OpenAI/ reindexing in ES vector store.
- testing OpenAI evaluations, grading, prompt optimization.
-
14:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
-
14:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
14:10
-
14:30
→
14:35
AOB 5m
-
13:00
→
13:05