US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-02-26T13:00:00-05:00
End: 2025-02-26T15:25:00-05:00
Location: No location set

Wednesday 26 Feb 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  A few top of the meeting items:
  
  We have a meeting to discuss possible Tier-2 shopping lists with the PI's and relevant managers tomorrow afternoon with a goal of agreeing on a plan we can send to Chris and John
  
  Friday morning those of us working on the Trusted CI engagement will meet to discuss our freeback from homework #1 and initial responses for homework #2
  
  Please remember to track milestone progress (WBS 2.3 working copy at https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?usp=sharing )
  
  BNL has new (more strict) rules for international travel
  
  personal days, travel justification, number of partisipants per conference/WS
  
  Upcoming meetings: LHCONE/LHCOPN and HEPiX
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  XRootD 5.7.3
  
  cvmfs 2.12.6
  
  IGTF 1.133
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
  - 13:25
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    WBS 2.3.1.2 Tier-1 Infrastructure - Jason
    
    NTR
    
    WBS 2.3.1.3 Tier-1 Compute - Tom
    
    gridgk06 upgraded to alma 9.5 condor 24
    
    gridgk07 closed to jobs, upgrade pending (this week)
    
    This will conclude upgrades to ATLAS T1 farm production CE infrastructure
    
    Added in production BNL_ARM resource (480 slots)
    
    WBS 2.3.1.4 Tier-1 Storage - Carlos
    
    NTR
    
    WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
    
    NTR
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Reasonable running in the past two weeks.
  
  MWT2 (1.5 days) and OU (1 day) took downtime.
  
  CPB had a DNS incident over last weekend and was offline for about a day.
  
  Otherwise good running...
  
  Two sites still finishing EL9 updates: MSU and UTA.
  
  MSU is close to having their installation system working.
  
  UTA (SWT2_CPB) is done with all servers except the storage servers.
  
  We have decided to set a deadline of March 31 fo submission of this year's Procurement and Operations plans.
  
  I will follow up on whether there are template milestones that we can adjust to the March 31 deadline.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    TACC: running smoothly. Fully moved to token-based communication. No more issues seem on the shared file system
    
    Perlmutter: 15% usage below the expectation. May need to improve the thoughtput
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    still working on a solution to get nvidia libraries available to ATLAS jobs running on the NERSC GPU queue. Testing using a container created from the merging of the nvidia cuda development RockyLinux 9 container and the Docker files from the Alma 9 adc grid containers developed and maintained by Alessandro DeSalvo.
    
    Still need to pass into container needed environmental variables , create a work area and add mount points for /pscratch and /cvmfs. then modify the pilot wrapper script. etc
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Working with openshift expert on mounting persistent storage like NFS within container.
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Maintenence scheduled on Tuesday, March 4, 2025.
    
    Pod eviction issus on some of the nodes - Condor scratch is on the root filesystem. We will put in some ways to mitigate the issue.
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  WLCG DOMA BDT effort was restarted today - link to slides and minutes
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    ADC Operations:
    
    Data Carousel: Including analysis
    
    A few test users already included
    
    Still a few things to clarify (ATLASPANDA-1129)
    
    MC Evgen
    
    default maxFailure set to 3 (was 10)
    
    ARC CE bug found and fixed.
    
    XRD and EOS turls need davs replacing with https
    
    fix should go into ARC7 and backportd to ARC6. Deployment timescale unknown.
    
    This was the rewason for the SWT2 to SWT2 failing transfers
    
    IAM to K8S switch scheduled for 3/10/25
    
    Anything using voms-atlas-auth.app.cern.ch as token/proxy issuer will start failing
    
    Still many tokens and proxies requested. Contacted all users (btw who is nathan.crawford@uci.edu?)
    
    Started dedicated “Sites” section started in the new/developing “ADC Documentation”
    
    First contribution: “How to add a remote_queue to your CE/gatekeeper”
    
    Feel free to contribute anything that might be of help to other site admins
    
    US Cloud Operations:
    
    Site Issues
    
    NET2:
    
    Now running all ATLAS workflows
    
    OU_OSCER_ATLAS
    
    Was still appearing in monitoring in downtime after end of downtime. Solved (ADCMONITOR-559)
    
    Others
    
    Due to Data Carousel configuration - a tape staging problem at TRIUMF was visible as high destination failure rate on all US sites. Solved (GGUS:2430)
    
    Tickets
    
    AGLT2:
    
    GGUS:2431: Bad CVMFS mounts. Solved.
    
    BNL:
    
    GGUS:2428: High failure rate from gridgk04. Solved.
    
    MWT2:
    
    GGUS:2099: BGP tagging.
    
    NET2:
    
    GGUS:2404: Squid degraded due to power outage. Solved.
    
    GGUS:2365: Failing transfers during Jumbo frames test. Solved.
    
    GGUS:2097: BGP tagging.
    
    ATLDDMOPS-5707: NET2 tape comissioning is advancing.
    
    SWT2:
    
    GGUS:2098: BGP tagging
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCache
    
    issues with gStream monitoring are being debugged
    
    issues with 2 Oxford xcache nodes tests
    
    VP
    
    working fine
    
    Varnishes
    
    all working fine
    
    writing documentation on how to deploy it
    
    ServiceY
    
    writing documentation on how to deploy its Runner
    
    stress testing of AF ads nodes
    
    stress testing of FAB
    
    AF
    
    Assistant now can run bash commands and scripts.
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Testing flocking from UChicago AF to MWT2 with Docker-based containers + HTCondor overlay
    
    Some early results during a MWT2 downtime, displacing OSG workloads with AF workloads.
    
    A bunch of parameters to tune, but for now we're submitting fairly non-aggressively, with each container set to 8 cores / 48 GB RAM (~VHIMEM equivalent)
    
    Already identified a few things to fix - Singularity, for example, seems broken
    
    Users simply add "ALLOW_MWT2=True" in their job ad
    
    Should be generalizable to run elsewhere, but currently requires privilege. Might be possible without privilege for the containers, TBD.
    
    Starting a document describing WireGuard implementation requirements
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

WBS 2.3.1.2 Tier-1 Infrastructure - Jason

WBS 2.3.1.3 Tier-1 Compute - Tom

WBS 2.3.1.4 Tier-1 Storage - Carlos

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan