US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Recent meetings
- Last week was LHCOPN/LHCONE meeting in Manchester, UK: https://indico.cern.ch/event/1479019/
- WLCG DOMA was today https://indico.cern.ch/event/1520247/
- Next week is HEPiX in Lugano, Swizterland: https://indico.cern.ch/event/1477299/
We are working on a 5-year estimator for our facilities with a goal of understanding our resources needs to deliver US targets to the start of HL-LHC
Please consider attending HTC25 in Madison Wisconsin June 2-6. On June 4th we intend to have joint USATLAS-USCMS meetings https://agenda.hep.wisc.edu/event/2297/
- 13:05 → 13:10
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- 75 servers (3 racks) arriving at BNL this week. Expect it to be available to Tier-1 in ~2 weeks
- RBT submitted to meet the WLCG request for 5 PB additional tape
WBS 2.3.1.3 Tier-1 Compute - Tom
- Gridgk04,06 rebooted unexpectedly over the weekend (cause under investigation)
- This caused a temporary dip in running jobs, service was restored Monday
- Security fix has been pushed out across the Atlas T1 pool per HTCondor dev recommendation
- SEC_TOKEN_REQUEST_LIMITS = DENY
- SEC_ISSUED_TOKEN_EXPIRATION = 0
WBS 2.3.1.4 Tier-1 Storage - Carlos
- ATLAS reprocessing started Monday 17
-
+ 310K files restored so far.
-
Target is to use BNL-OSG2_MCTAPE size: 5414.3TB datasets: 2073 files: 67548
-
2No major issues observed at dCache or HPSS
-
-
Integration/test instance migrated to Openshift
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
- NTR
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Pretty good running over the last couple of weeks.
- MWT2 back to full production after doing a rolling update.
- NET2 still working on repairs for the high core count servers.
- EL9
- MSU past all install system issues but still working to get installation parameters that work.
- UTA working on installing new storage servers so it can update its storage to EL9.
- The rest of CPB is at EL9.
- Operations and Procurement plans
- Sent out templates yesterday.
- We will need to define milestones to match the contents of the plans.
- Pretty good running over the last couple of weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
Perlmutter in downtime today
Over weekend ran out of inodes (again!!!)
- asked to increase inode quota to 50M for 6 months
- reduced the number of running SCORE slurm jobs from 5 to 2 (ie workers in Harvester)
- reduced the number of nodes running SCORE slurm jobs from 20 to 10
- Net reduction of a factor 5 in number of SCORE jobs running on NERSC - madgraph jobs caused havoc...
Success in HEP-CCE Globus Compute. first PanDA validation jobs successfull started with Test Harvester and Globus compute submitter.
- working on monitor for Globus Compute and need to work with PanDA team to come up with a working solution Globus compute sweeper.
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
-
13:50
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
XCaches
- all moved to FluxCD or direct docker deployment
- Wuppertal needs to fix gStream monitoring
- Next Monday 8:30 CDT meeting on how DE will use XCaches in HTC only era.
Varnishes
- All working fine
- Need Rod to change port
- Agreed to get PIC, IN2P-CC and Roma to set up instances next
ServiceX/Y
- We had a meetup in UofW.
- A lot of new functionalites discussed: RDFrame support, Joins, ARM support, ServiceX-Local, new version of local cache, ...
- ServiceY will be a continued as a demonstrator, its functionallities will be picked and reimplemented in ServiceX at their timeline.
CREST
- Had one more HLT test
- Need to update CERN Openstack k8s cluster due to nodes retirement.
Analytics
- brand new logstash configs and templates for WLCG_WPAD and cms-frontier data
-
14:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05