US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 1:00 PM 1:05 PM
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      A few top of the meeting items:

       

      • We have a meeting to discuss possible Tier-2 shopping lists with the PI's and relevant managers tomorrow afternoon with a goal of agreeing on a plan we can send to Chris and John
      • Friday morning those of us working on the Trusted CI engagement will meet to discuss our freeback from homework #1 and initial responses for homework #2
      • Please remember to track milestone progress (WBS 2.3 working copy at https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?usp=sharing )
      • BNL has new (more strict) rules for international  travel
        • personal days, travel justification, number of partisipants per conference/WS

       

      Upcoming meetings:  LHCONE/LHCOPN and HEPiX

    • 1:05 PM 1:10 PM
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (this week)

      • XRootD 5.7.3
      • cvmfs 2.12.6
      • IGTF 1.133
    • 1:10 PM 1:30 PM
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 1:10 PM
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 1:15 PM
        Compute Farm 5m
        Speaker: Thomas Smith
      • 1:20 PM
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 1:25 PM
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        WBS 2.3.1.2 Tier-1 Infrastructure - Jason

        • NTR

        WBS 2.3.1.3 Tier-1 Compute - Tom

        • gridgk06 upgraded to alma 9.5 condor 24
        • gridgk07 closed to jobs, upgrade pending (this week)
        • This will conclude upgrades to ATLAS T1 farm production CE infrastructure
        • Added in production BNL_ARM resource (480 slots)

         

        WBS 2.3.1.4 Tier-1 Storage - Carlos

        • NTR

        WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

        • NTR
    • 1:30 PM 1:40 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Reasonable running in the past two weeks.
        • MWT2 (1.5 days) and OU (1 day) took downtime. 
        • CPB had a DNS incident over last weekend and was offline for about a day.
        • Otherwise good running...
      • Two sites still finishing EL9 updates: MSU and UTA.
        • MSU is close to having their installation system working. 
        • UTA (SWT2_CPB) is done with all servers except the storage servers.
      • We have decided to set a deadline of March 31 fo submission of this year's Procurement and Operations plans.
        • I will follow up on whether there are template milestones that we can adjust to the March 31 deadline.
    • 1:40 PM 1:50 PM
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 1:40 PM
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC: running smoothly. Fully moved to token-based communication. No more issues seem on the shared file system

        Perlmutter: 15% usage below the expectation. May need to improve the thoughtput

      • 1:45 PM
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))

        still working on a solution to get nvidia libraries available to ATLAS jobs running on the NERSC GPU queue.  Testing using a container created from the merging of the nvidia cuda development RockyLinux 9 container and the Docker files from the Alma 9 adc grid containers developed and maintained by Alessandro DeSalvo.

         

        Still need to pass into container needed environmental variables , create a work area and add mount points for /pscratch and /cvmfs.  then modify the pilot wrapper script.  etc

    • 1:50 PM 2:10 PM
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 1:50 PM
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        1. Working with openshift expert on mounting persistent storage like NFS within container. 
      • 1:55 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:00 PM
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • Maintenence scheduled on Tuesday, March 4, 2025.
        • Pod eviction issus on some of the nodes - Condor scratch is  on the root filesystem. We will put in some ways to mitigate the issue.
    • 2:10 PM 2:25 PM
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 2:10 PM
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • ADC Operations:
          • Data Carousel: Including analysis
            • A few test users already included
            • Still a few things to clarify (ATLASPANDA-1129)
          • MC Evgen
            • default maxFailure set to 3 (was 10)
          • ARC CE bug found and fixed.
            • XRD and EOS turls need davs replacing with https
            • fix should go into ARC7 and backportd to ARC6. Deployment timescale unknown.
            • This was the rewason for the SWT2 to SWT2 failing transfers
          • IAM to K8S switch scheduled for 3/10/25
          • Started dedicated “Sites” section started in the new/developing “ADC Documentation
        • US Cloud Operations:
          • Site Issues
            • NET2:
              • Now running all ATLAS workflows
            • OU_OSCER_ATLAS
              • Was still appearing in monitoring in downtime after end of downtime. Solved (ADCMONITOR-559)
            • Others
              • Due to Data Carousel configuration - a tape staging problem at TRIUMF was visible as high destination failure rate on all US sites. Solved (GGUS:2430)
          • Tickets
      • 2:15 PM
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCache
          • issues with gStream monitoring are being debugged
          • issues with 2 Oxford xcache nodes tests
        • VP
          • working fine
        • Varnishes
          • all working fine
          • writing documentation on how to deploy it
        • ServiceY
          • writing documentation on how to deploy its Runner
          • stress testing of AF ads nodes 
          • stress testing of FAB
        • AF
          • Assistant now can run bash commands and scripts.
      • 2:20 PM
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Testing flocking from UChicago AF to MWT2 with Docker-based containers + HTCondor overlay
          • Some early results during a MWT2 downtime, displacing OSG workloads with AF workloads. 
            • A bunch of parameters to tune, but for now we're submitting fairly non-aggressively, with each container set to 8 cores / 48 GB RAM (~VHIMEM equivalent)
            • Already identified a few things to fix - Singularity, for example, seems broken 
            • Users simply add "ALLOW_MWT2=True" in their job ad 
          • Should be generalizable to run elsewhere, but currently requires privilege. Might be possible without privilege for the containers, TBD.
        • Starting a document describing WireGuard implementation requirements 
    • 2:25 PM 2:35 PM
      AOB 10m