US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
A few top of the meeting items:
- We have a meeting to discuss possible Tier-2 shopping lists with the PI's and relevant managers tomorrow afternoon with a goal of agreeing on a plan we can send to Chris and John
- Friday morning those of us working on the Trusted CI engagement will meet to discuss our freeback from homework #1 and initial responses for homework #2
- Please remember to track milestone progress (WBS 2.3 working copy at https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?usp=sharing )
- BNL has new (more strict) rules for international travel
- personal days, travel justification, number of partisipants per conference/WS
Upcoming meetings: LHCONE/LHCOPN and HEPiX
- 13:05 → 13:10
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- NTR
WBS 2.3.1.3 Tier-1 Compute - Tom
- gridgk06 upgraded to alma 9.5 condor 24
- gridgk07 closed to jobs, upgrade pending (this week)
- This will conclude upgrades to ATLAS T1 farm production CE infrastructure
- Added in production BNL_ARM resource (480 slots)
WBS 2.3.1.4 Tier-1 Storage - Carlos
- NTR
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
- NTR
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Reasonable running in the past two weeks.
- MWT2 (1.5 days) and OU (1 day) took downtime.
- CPB had a DNS incident over last weekend and was offline for about a day.
- Otherwise good running...
- Two sites still finishing EL9 updates: MSU and UTA.
- MSU is close to having their installation system working.
- UTA (SWT2_CPB) is done with all servers except the storage servers.
- We have decided to set a deadline of March 31 fo submission of this year's Procurement and Operations plans.
- I will follow up on whether there are template milestones that we can adjust to the March 31 deadline.
- Reasonable running in the past two weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 13:40
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
still working on a solution to get nvidia libraries available to ATLAS jobs running on the NERSC GPU queue. Testing using a container created from the merging of the nvidia cuda development RockyLinux 9 container and the Docker files from the Alma 9 adc grid containers developed and maintained by Alessandro DeSalvo.
Still need to pass into container needed environmental variables , create a work area and add mount points for /pscratch and /cvmfs. then modify the pilot wrapper script. etc
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:00
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- WLCG DOMA BDT effort was restarted today - link to slides and minutes
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- ADC Operations:
- Data Carousel: Including analysis
- A few test users already included
- Still a few things to clarify (ATLASPANDA-1129)
- MC Evgen
- default maxFailure set to 3 (was 10)
- ARC CE bug found and fixed.
- XRD and EOS turls need davs replacing with https
- fix should go into ARC7 and backportd to ARC6. Deployment timescale unknown.
- This was the rewason for the SWT2 to SWT2 failing transfers
- IAM to K8S switch scheduled for 3/10/25
- Anything using voms-atlas-auth.app.cern.ch as token/proxy issuer will start failing
- Still many tokens and proxies requested. Contacted all users (btw who is nathan.crawford@uci.edu?)
- Started dedicated “Sites” section started in the new/developing “ADC Documentation”
- First contribution: “How to add a remote_queue to your CE/gatekeeper”
- Feel free to contribute anything that might be of help to other site admins
- Data Carousel: Including analysis
- US Cloud Operations:
- Site Issues
- NET2:
- Now running all ATLAS workflows
- OU_OSCER_ATLAS
- Was still appearing in monitoring in downtime after end of downtime. Solved (ADCMONITOR-559)
- Others
- Due to Data Carousel configuration - a tape staging problem at TRIUMF was visible as high destination failure rate on all US sites. Solved (GGUS:2430)
- NET2:
- Tickets
- AGLT2:
- GGUS:2431: Bad CVMFS mounts. Solved.
- BNL:
- GGUS:2428: High failure rate from gridgk04. Solved.
- MWT2:
- GGUS:2099: BGP tagging.
- NET2:
- GGUS:2404: Squid degraded due to power outage. Solved.
- GGUS:2365: Failing transfers during Jumbo frames test. Solved.
- GGUS:2097: BGP tagging.
- ATLDDMOPS-5707: NET2 tape comissioning is advancing.
- SWT2:
- GGUS:2098: BGP tagging
- AGLT2:
- Site Issues
- ADC Operations:
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache
- issues with gStream monitoring are being debugged
- issues with 2 Oxford xcache nodes tests
- VP
- working fine
- Varnishes
- all working fine
- writing documentation on how to deploy it
- ServiceY
- writing documentation on how to deploy its Runner
- stress testing of AF ads nodes
- stress testing of FAB
- AF
- Assistant now can run bash commands and scripts.
- XCache
-
14:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Testing flocking from UChicago AF to MWT2 with Docker-based containers + HTCondor overlay
- Some early results during a MWT2 downtime, displacing OSG workloads with AF workloads.
- A bunch of parameters to tune, but for now we're submitting fairly non-aggressively, with each container set to 8 cores / 48 GB RAM (~VHIMEM equivalent)
- Already identified a few things to fix - Singularity, for example, seems broken
- Users simply add "ALLOW_MWT2=True" in their job ad
- Should be generalizable to run elsewhere, but currently requires privilege. Might be possible without privilege for the containers, TBD.
- Some early results during a MWT2 downtime, displacing OSG workloads with AF workloads.
- Starting a document describing WireGuard implementation requirements
- Testing flocking from UChicago AF to MWT2 with Docker-based containers + HTCondor overlay
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05