US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
1:00 PM
→
1:05 PM
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Recent meetings
- Last week was LHCOPN/LHCONE meeting in Manchester, UK: https://indico.cern.ch/event/1479019/
- WLCG DOMA was today https://indico.cern.ch/event/1520247/
- Next week is HEPiX in Lugano, Swizterland: https://indico.cern.ch/event/1477299/
We are working on a 5-year estimator for our facilities with a goal of understanding our resources needs to deliver US targets to the start of HL-LHC
Please consider attending HTC25 in Madison Wisconsin June 2-6. On June 4th we intend to have joint USATLAS-USCMS meetings https://agenda.hep.wisc.edu/event/2297/
- 1:05 PM → 1:10 PM
-
1:10 PM
→
1:30 PM
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
1:10 PM
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
1:15 PM
Compute Farm 5mSpeaker: Thomas Smith
-
1:20 PM
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
1:25 PM
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- 75 servers (3 racks) arriving at BNL this week. Expect it to be available to Tier-1 in ~2 weeks
- RBT submitted to meet the WLCG request for 5 PB additional tape
WBS 2.3.1.3 Tier-1 Compute - Tom
- Gridgk04,06 rebooted unexpectedly over the weekend (cause under investigation)
- This caused a temporary dip in running jobs, service was restored Monday
- Security fix has been pushed out across the Atlas T1 pool per HTCondor dev recommendation
- SEC_TOKEN_REQUEST_LIMITS = DENY
- SEC_ISSUED_TOKEN_EXPIRATION = 0
WBS 2.3.1.4 Tier-1 Storage - Carlos
- ATLAS reprocessing started Monday 17
-
+ 310K files restored so far.
-
Target is to use BNL-OSG2_MCTAPE size: 5414.3TB datasets: 2073 files: 67548
-
2No major issues observed at dCache or HPSS
-
-
Integration/test instance migrated to Openshift
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
- NTR
-
1:10 PM
-
1:30 PM
→
1:40 PM
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Pretty good running over the last couple of weeks.
- MWT2 back to full production after doing a rolling update.
- NET2 still working on repairs for the high core count servers.
- EL9
- MSU past all install system issues but still working to get installation parameters that work.
- UTA working on installing new storage servers so it can update its storage to EL9.
- The rest of CPB is at EL9.
- Operations and Procurement plans
- Sent out templates yesterday.
- We will need to define milestones to match the contents of the plans.
- Pretty good running over the last couple of weeks.
-
1:40 PM
→
1:50 PM
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
1:40 PM
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
-
1:45 PM
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
Perlmutter in downtime today
Over weekend ran out of inodes (again!!!)
- asked to increase inode quota to 50M for 6 months
- reduced the number of running SCORE slurm jobs from 5 to 2 (ie workers in Harvester)
- reduced the number of nodes running SCORE slurm jobs from 20 to 10
- Net reduction of a factor 5 in number of SCORE jobs running on NERSC - madgraph jobs caused havoc...
Success in HEP-CCE Globus Compute. first PanDA validation jobs successfull started with Test Harvester and Globus compute submitter.
- working on monitor for Globus Compute and need to work with PanDA team to come up with a working solution Globus compute sweeper.
-
1:40 PM
-
1:50 PM
→
2:10 PM
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
1:50 PM
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
1:55 PM
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
2:00 PM
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
-
1:50 PM
-
2:10 PM
→
2:25 PM
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
2:10 PM
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
2:15 PM
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
XCaches
- all moved to FluxCD or direct docker deployment
- Wuppertal needs to fix gStream monitoring
- Next Monday 8:30 CDT meeting on how DE will use XCaches in HTC only era.
Varnishes
- All working fine
- Need Rod to change port
- Agreed to get PIC, IN2P-CC and Roma to set up instances next
ServiceX/Y
- We had a meetup in UofW.
- A lot of new functionalites discussed: RDFrame support, Joins, ARM support, ServiceX-Local, new version of local cache, ...
- ServiceY will be a continued as a demonstrator, its functionallities will be picked and reimplemented in ServiceX at their timeline.
CREST
- Had one more HLT test
- Need to update CERN Openstack k8s cluster due to nodes retirement.
Analytics
- brand new logstash configs and templates for WLCG_WPAD and cms-frontier data
-
2:20 PM
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
-
2:10 PM
-
2:25 PM
→
2:35 PM
AOB 10m
-
1:00 PM
→
1:05 PM