US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
1:00 PM
→
1:05 PM
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
A few top of the meeting items:
- We have a meeting to discuss possible Tier-2 shopping lists with the PI's and relevant managers tomorrow afternoon with a goal of agreeing on a plan we can send to Chris and John
- Friday morning those of us working on the Trusted CI engagement will meet to discuss our freeback from homework #1 and initial responses for homework #2
- Please remember to track milestone progress (WBS 2.3 working copy at https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?usp=sharing )
- BNL has new (more strict) rules for international travel
- personal days, travel justification, number of partisipants per conference/WS
Upcoming meetings: LHCONE/LHCOPN and HEPiX
- 1:05 PM → 1:10 PM
-
1:10 PM
→
1:30 PM
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
1:10 PM
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
1:15 PM
Compute Farm 5mSpeaker: Thomas Smith
-
1:20 PM
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
1:25 PM
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- NTR
WBS 2.3.1.3 Tier-1 Compute - Tom
- gridgk06 upgraded to alma 9.5 condor 24
- gridgk07 closed to jobs, upgrade pending (this week)
- This will conclude upgrades to ATLAS T1 farm production CE infrastructure
- Added in production BNL_ARM resource (480 slots)
WBS 2.3.1.4 Tier-1 Storage - Carlos
- NTR
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
- NTR
-
1:10 PM
-
1:30 PM
→
1:40 PM
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Reasonable running in the past two weeks.
- MWT2 (1.5 days) and OU (1 day) took downtime.
- CPB had a DNS incident over last weekend and was offline for about a day.
- Otherwise good running...
- Two sites still finishing EL9 updates: MSU and UTA.
- MSU is close to having their installation system working.
- UTA (SWT2_CPB) is done with all servers except the storage servers.
- We have decided to set a deadline of March 31 fo submission of this year's Procurement and Operations plans.
- I will follow up on whether there are template milestones that we can adjust to the March 31 deadline.
- Reasonable running in the past two weeks.
-
1:40 PM
→
1:50 PM
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 1:40 PM
-
1:45 PM
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
still working on a solution to get nvidia libraries available to ATLAS jobs running on the NERSC GPU queue. Testing using a container created from the merging of the nvidia cuda development RockyLinux 9 container and the Docker files from the Alma 9 adc grid containers developed and maintained by Alessandro DeSalvo.
Still need to pass into container needed environmental variables , create a work area and add mount points for /pscratch and /cvmfs. then modify the pilot wrapper script. etc
-
1:50 PM
→
2:10 PM
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 1:50 PM
-
1:55 PM
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 2:00 PM
-
2:10 PM
→
2:25 PM
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- WLCG DOMA BDT effort was restarted today - link to slides and minutes
-
2:10 PM
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- ADC Operations:
- Data Carousel: Including analysis
- A few test users already included
- Still a few things to clarify (ATLASPANDA-1129)
- MC Evgen
- default maxFailure set to 3 (was 10)
- ARC CE bug found and fixed.
- XRD and EOS turls need davs replacing with https
- fix should go into ARC7 and backportd to ARC6. Deployment timescale unknown.
- This was the rewason for the SWT2 to SWT2 failing transfers
- IAM to K8S switch scheduled for 3/10/25
- Anything using voms-atlas-auth.app.cern.ch as token/proxy issuer will start failing
- Still many tokens and proxies requested. Contacted all users (btw who is nathan.crawford@uci.edu?)
- Started dedicated “Sites” section started in the new/developing “ADC Documentation”
- First contribution: “How to add a remote_queue to your CE/gatekeeper”
- Feel free to contribute anything that might be of help to other site admins
- Data Carousel: Including analysis
- US Cloud Operations:
- Site Issues
- NET2:
- Now running all ATLAS workflows
- OU_OSCER_ATLAS
- Was still appearing in monitoring in downtime after end of downtime. Solved (ADCMONITOR-559)
- Others
- Due to Data Carousel configuration - a tape staging problem at TRIUMF was visible as high destination failure rate on all US sites. Solved (GGUS:2430)
- NET2:
- Tickets
- AGLT2:
- GGUS:2431: Bad CVMFS mounts. Solved.
- BNL:
- GGUS:2428: High failure rate from gridgk04. Solved.
- MWT2:
- GGUS:2099: BGP tagging.
- NET2:
- GGUS:2404: Squid degraded due to power outage. Solved.
- GGUS:2365: Failing transfers during Jumbo frames test. Solved.
- GGUS:2097: BGP tagging.
- ATLDDMOPS-5707: NET2 tape comissioning is advancing.
- SWT2:
- GGUS:2098: BGP tagging
- AGLT2:
- Site Issues
- ADC Operations:
-
2:15 PM
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache
- issues with gStream monitoring are being debugged
- issues with 2 Oxford xcache nodes tests
- VP
- working fine
- Varnishes
- all working fine
- writing documentation on how to deploy it
- ServiceY
- writing documentation on how to deploy its Runner
- stress testing of AF ads nodes
- stress testing of FAB
- AF
- Assistant now can run bash commands and scripts.
- XCache
-
2:20 PM
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Testing flocking from UChicago AF to MWT2 with Docker-based containers + HTCondor overlay
- Some early results during a MWT2 downtime, displacing OSG workloads with AF workloads.
- A bunch of parameters to tune, but for now we're submitting fairly non-aggressively, with each container set to 8 cores / 48 GB RAM (~VHIMEM equivalent)
- Already identified a few things to fix - Singularity, for example, seems broken
- Users simply add "ALLOW_MWT2=True" in their job ad
- Should be generalizable to run elsewhere, but currently requires privilege. Might be possible without privilege for the containers, TBD.
- Some early results during a MWT2 downtime, displacing OSG workloads with AF workloads.
- Starting a document describing WireGuard implementation requirements
- Testing flocking from UChicago AF to MWT2 with Docker-based containers + HTCondor overlay
-
2:25 PM
→
2:35 PM
AOB 10m
-
1:00 PM
→
1:05 PM