US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-03-26T13:00:00-04:00
End: 2025-03-26T15:25:00-04:00
Location: No location set

Wednesday 26 Mar 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Recent meetings
  
  Last week was LHCOPN/LHCONE meeting in Manchester, UK: https://indico.cern.ch/event/1479019/
  
  WLCG DOMA was today https://indico.cern.ch/event/1520247/
  
  Next week is HEPiX in Lugano, Swizterland: https://indico.cern.ch/event/1477299/
  
  We are working on a 5-year estimator for our facilities with a goal of understanding our resources needs to deliver US targets to the start of HL-LHC
  
  Please consider attending HTC25 in Madison Wisconsin June 2-6. On June 4th we intend to have joint USATLAS-USCMS meetings https://agenda.hep.wisc.edu/event/2297/
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release this week: XRootD 5.7.3-1.5 (gstream fixes, adding support for purge plugins)
  
  Kuantifier: verified access to test NET2 cluster, need Eduardo to set up unprivileged Prometheus
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
  - 13:25
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    WBS 2.3.1.2 Tier-1 Infrastructure - Jason
    
    75 servers (3 racks) arriving at BNL this week. Expect it to be available to Tier-1 in ~2 weeks
    
    RBT submitted to meet the WLCG request for 5 PB additional tape
    
    WBS 2.3.1.3 Tier-1 Compute - Tom
    
    Gridgk04,06 rebooted unexpectedly over the weekend (cause under investigation)
    
    This caused a temporary dip in running jobs, service was restored Monday
    
    Security fix has been pushed out across the Atlas T1 pool per HTCondor dev recommendation
    
    SEC_TOKEN_REQUEST_LIMITS = DENY
    
    SEC_ISSUED_TOKEN_EXPIRATION = 0
    
    WBS 2.3.1.4 Tier-1 Storage - Carlos
    
    ATLAS reprocessing started Monday 17
    
    + 310K files restored so far.
    
    Target is to use BNL-OSG2_MCTAPE size: 5414.3TB datasets: 2073 files: 67548
    
    2No major issues observed at dCache or HPSS
    
    Integration/test instance migrated to Openshift
    
    WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
    
    NTR
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Pretty good running over the last couple of weeks.
  
  MWT2 back to full production after doing a rolling update.
  
  NET2 still working on repairs for the high core count servers.
  
  EL9
  
  MSU past all install system issues but still working to get installation parameters that work.
  
  UTA working on installing new storage servers so it can update its storage to EL9.
  
  The rest of CPB is at EL9.
  
  Operations and Procurement plans
  
  Sent out templates yesterday.
  
  We will need to define milestones to match the contents of the plans.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    TACC: a sequece of rebooting of computing nodes and login nodes this week
    
    Perlmutter: following up with inode quota usage
    
    Doug requested the inode quota to be increased 20M->50M, and the SCORE is reduced to 2 workers with 10 nodes each per submission (a factor of 5 reduction)
  - 13:45
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    Perlmutter in downtime today
    
    Over weekend ran out of inodes (again!!!)
    
    asked to increase inode quota to 50M for 6 months
    
    reduced the number of running SCORE slurm jobs from 5 to 2 (ie workers in Harvester)
    
    reduced the number of nodes running SCORE slurm jobs from 20 to 10
    
    Net reduction of a factor 5 in number of SCORE jobs running on NERSC - madgraph jobs caused havoc...
    
    Success in HEP-CCE Globus Compute. first PanDA validation jobs successfull started with Test Harvester and Globus compute submitter.
    
    working on monitor for Globus Compute and need to work with PanDA team to come up with a working solution Globus compute sweeper.
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    HTCondor update.
    
    Interruption will be brief. login nodes and the scheduler are already updated. Will rebuild the worker image and deploy tomorrow.
    
    The update address a security update and also will fix a bug affecting coffea-casa(job svc classad)
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    CRIC migrated to python3
    
    Led to HammerCloud not being able to whitelist sites
    
    Tasks sent to ARM queues with a release that does not have merge for the release. All tasks were fixed.
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCaches
    
    all moved to FluxCD or direct docker deployment
    
    Wuppertal needs to fix gStream monitoring
    
    Next Monday 8:30 CDT meeting on how DE will use XCaches in HTC only era.
    
    Varnishes
    
    All working fine
    
    Need Rod to change port
    
    Agreed to get PIC, IN2P-CC and Roma to set up instances next
    
    ServiceX/Y
    
    We had a meetup in UofW.
    
    A lot of new functionalites discussed: RDFrame support, Joins, ARM support, ServiceX-Local, new version of local cache, ...
    
    ServiceY will be a continued as a demonstrator, its functionallities will be picked and reimplemented in ServiceX at their timeline.
    
    CREST
    
    Had one more HLT test
    
    Need to update CERN Openstack k8s cluster due to nodes retirement.
    
    Analytics
    
    brand new logstash configs and templates for WLCG_WPAD and cms-frontier data
  - 14:20
    
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

WBS 2.3.1.2 Tier-1 Infrastructure - Jason

WBS 2.3.1.3 Tier-1 Compute - Tom

WBS 2.3.1.4 Tier-1 Storage - Carlos

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan