US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
OTP is due. Please check your entries and update ASAP:
-
Cloud Operation and Management (Please contact Ivan Glushkov or Ofer Rind with questions):
https://docs.google.com/spreadsheets/d/1ybaZYrDN1TbcpOfLYnpZLmCF241TWYjp08u8B_0tlXs/edit?gid=1172799657#gid=1172799657 -
Tier-1s (Please contact Alexei Klimentov with questions):
https://docs.google.com/spreadsheets/d/1Ffvq9sWcydDyKP2UL9prTVvFYP08OPG6C1H5W762WgI/edit?gid=13146214#gid=13146214 -
Tier-2s (Please contact Fred or Rafael with questions):
https://docs.google.com/spreadsheets/d/1QIGMlC3S9DTU7HcrrSNfFW5xW2n_u5wFmflu7BeRlWk/edit?gid=481588459#gid=481588459
We are in the middle of a mini-capacity challenge. Each site should capture notes, logs, diagrams in their folder: https://drive.google.com/drive/folders/1E7Xiox_SniBsbHXeb5rLFkt8fvdK6q4p?usp=drive_link (see google doc in this folder for info)
-
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
- 13:15
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Good running over the last two weeks.
- AGLT2 & MWT2 did rolling updates of condor causing a small reduction in production.
- NET2 had some downtime to support activity at SC25.
- CPB had some minor draining ~10 days ago.
- TW-FTT recovered from another network outage.
- CPB has drained 5 of 8 MD3640 servers that they will retire they are fully updated to EL9.
- TW-FTT is in the process of putting all 2.5k job slots online after converting to condor and Vanish.
- The site has been running much better recently: kudos to YiRu!
- Please file your operations plans before leaving leaving for the holidays.
- I need to look at the Tier 2 OTP entries.
- I am meeting with Andrey, Mayuko, and Kaushik later today. The meeting is about using a script that Andrey wrote to dump a list of files that have not been accessed in a "long" time in LOCALGROUPDISK.
- Good running over the last two weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
Perlmutter: running stably, a bit lower in rate than before the pausing
ACCESS: Stamped3 account --> needs Gordon (PI) to set up TACC account
-----
Fixed expired X509 credential on Monday. NERSC CPU queue running at full steam (100 nodes/SLURM job) 5 SLURM jobs in the queue at a time.
Restarted the NERSC GPU queue. Now debugging why HC jobs are failing.
BNLHPC_DATADISK and BNLHPC_SCRATCHDISK RSE's decommissioned and drained.
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
Maintenance
-
Scheduled for December 10.
-
Planned activities include routine updates to firmware, operating system, Kubernetes, Rook/Ceph, NVIDIA drivers, and other core components.
IaaS / Inference-as-a-Service
-
Continuing work with Xiangyang Ju (LBL) on testing Inference-as-a-Service for DAOD production.
-
Evaluation areas include memory footprint, Triton server capacity, and data throughput.
-
A functional deployment is now running at UC AF, and testing is currently in progress.
-
-
13:50
-
14:10
→
14:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Kaushik De (University of Texas at Arlington (US))
- Hiro is conducting the capacity challenge
- Progress in deploying IPv6 at OU
- Dev Pilot back working at MWT2 with cgroups OOM management in HTCondor25
- XRootd 5.9 proxy server deployment on EL9 performance issues at SWT2
- Updates from DDM at ADC weekly - beginning to test FTS4, also news about tape archive metadata and monitoring
- CERN network outage on Sunday broke Rucio service and caused a mass exclusion event (link)
- Leslie Groer (Waterloo T2 manager) joined our cloud daily ops meeting today
- Heterogeneous Architectures meeting began today
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache
- upgraded to 5.9.0.
- will do 5.9.1 once it is in OSG, will test at UC AF
- some issues with BHAM
- AF
- a lot of throughputs testing for the integration challenge
- Varnish
- BNL in operation now
- One brief issue over the weekend when CERN loadbalancers went down and new frontiers were unavailable.
- Next week will give a tutorial on Varnish monitoring
- CREST
- completely reworked Dev documentation and redeployed it.
- fixed TLS on production clusters.
- AI
- NTR
- XCache
-
14:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
-
14:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
14:10
-
14:30
→
14:40
AOB 10m
-
13:00
→
13:05