US ATLAS Computing Facility (Replaced Tech Presentation)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 1:00 PM 1:05 PM
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Today is a regular facility meeting (we had no Topical Presentation lined up).   Please let us know if you have a topic you would like to present at a future meeting.

      There are a lot of things going on.

      • February 2025 is a "Capabilities" Testing and Demonstration month.   See current list of topics at https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link 
        • Please consider participating in one or more and feel free to edit existing documents or add new ones
      • The Tier-2s need to come up with a plan for how to use extra funds during this calendar year.
        • Highest priority is ensuring each of our Tier-2s will have 400 Gbps links by the end of 2029 (but it may be too early to spend directly on that now)
        • Each Tier-2 should be engaging the the relevant campus and regional networks to discuss their upgrade plans and timelines
        • Also consider needs for the funds to fix infrastructure issues (power, cooling)
        • First version of a WBS 2.3.2 document is due by the end of this month, with details needed by July scrubbing
      • Ongoing Jumbo frames testing is proceeding smoothly.  
        • Today is the last "regular" frames transfer testing from CERN-PROD_PILOT to both NET2 and BNL, tomorrow and Friday will be Jumbo frame testing
      • Upcoming Meetings
      • Also for your calendar, we plan to have a USATLAS facilities meeting as part of HTC25 in Madison Wisconsin June 2-6, 2025.   
      • USATLAS Scrubbing dates are decided July 14/15 at Stonybrook (possibly will be moved to 15/16 for European travel needs)
        • While many of you won't need to attend, you may be asked for input or slides for the scrubbing
    • 1:05 PM 1:10 PM
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (this week)

      • vo-client
      • XRootD shoveler
      • xrdcl-pelican

      Release (aiming for next week)

      Other projects

      • ARM package integration testing: made some progress in getting ARM VMs started by HTCondor and are working through some minor invocation issues
      • Kuantifier: waiting on NET2 authenticated Prometheus dev instance
        • Eduardo has nodes for this and is working on setting up the cluster
    • 1:10 PM 1:30 PM
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 1:10 PM
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 1:15 PM
        Compute Farm 5m
        Speaker: Thomas Smith
      • 1:20 PM
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
      • 1:25 PM
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        WBS 2.3.1.2 Tier-1 Infrastructure - Jason

        • NTR

        WBS 2.3.1.3 Tier-1 Compute - Tom

        • Testing Condor v24 LTS configuration on gridgk03
          • Some issues with jobs being evicted after 2 hours. Condor developers have been contacted and are providing support
        • All WNs upgraded condor 24.0 LTS and Alma Linux 9.5, operation of workers has been smooth

        WBS 2.3.1.4 Tier-1 Storage - Carlos

        • Database hardware issue affecting Pinmanager, Bulk, TransferManager and SpaceManager services
          • Degradation of service mainly affecting WRITEs (02/01/25 5PM EST)
          • Service recovered 02/02/25
          • Activity on synchronizing internal accounting (spacemanager) tables after restoring the service
        • Enabling JumboFrames on all doors and storage servers for ongoing Capabilities testing
        • Bulk service restarted on 02/09/25
          • 130k staging requests stuck in QUEUE state
          • After restarting the service the requests were submitted to HPSS. The entire workflow is working as expected. A follow up ticket created to dCache devs https://github.com/dCache/dcache/issues/7746

        WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

        • Occupancy: 92%, A/R: 100%
          • Occupancy is lower than expected due to:
            • 2/5/25: Site was emptied for several hours due to Harvester DB lock timeouts.
            • 2/1/25: The problem mention in the storage section above
    • 1:30 PM 1:40 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Some reduction in production in the last 30 days.
        • Two central outages:
          • 1/14/24-1/16/24 Change at CERN causes BNL to fail and sites drain until they are moved to CERN FTS instance.
          • 2/6/24 One of two harvester instances at CERN has a database issue. US sites using HTCondor-CE drain.
            • Does not affect NET2 and Kubernetes part of CPB.
        • For the month of January the Illinois site at MTW2 is offline reducing MWT2 production by about 1/3.
          • Jan 2-15 the site was down to move to a new building,
          • From Jan 16-22 (approximately) authentication was not working,
          • From Jan 23-31 (approximately) Systems were rebuilt as RHEL9 using new puppet setup.
            • There were also various hardware and power balance issues.
        • NET2 had a couple of interruptions to get their 400G uplink working.
          • The good news is the 400G is in service and working well!
        • OU_OSCER_ATLAS generally stable and lots opportunistic jobs.
          • Some draining 2/11/25
        • SWT2_CPB worked most of January to get their site up running Alma Linux 9.
          • Things stablelized on 2/3/24.
            • CPB did not refill last week for one whole day after the harvester issue was fixed.
              • Cause of the slow refilling is under investigation,
      • Procurement Planning
        • We need to come up with a list of extra network gear we need to spend $2-$4 million split between the Tier sites by the end of February.
        • Procurement plans will likely be due by the end of March now that the equipment funding levels are known.
      • Operations  Planning
        • Now that we are past the EL9 updates (except MSU), we need to plan for what we do going forward.
          • Clearly storage tokens will need to be supported at all sites,
          • Some sites need to update to OSG24/Condor24.
          • All sites have all public facing servers dual stacked and supporting IPv6 except the CE at OU.
          • AGLT2 and CPB still need to go to jumbo frames.
    • 1:40 PM 1:50 PM
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 1:40 PM
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
        • Perlmutter: jobs are running fine.
          • Empty pilot: Xin added an interval in between to reduce the amount of requests of job sent at the same time
        • TACC: shared file system failure
          • The scratch disk failed since Saturday; work disk failed on Monday
          • No information on the detailed status yet 
      • 1:45 PM
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 1:50 PM 2:10 PM
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 1:50 PM
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))

        1. Investigated storage technologies for user home areas to ensure correct storage ACLs for NFS and GPFS within a container, including solutions like GPFS CSI and NAPP CSI.

         
         
         
      • 1:55 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:00 PM
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • ServiceX updated to 1.5.6. It’s expected to be reliable, and Ben is confident that it’s ready for broader use.
        • Added Dask-Gateway support to the AB image (currently in a branch). Since it requires JupyterHub for launching, we are preping up BinderHub as the launching platform.
        • coffea-casa cull timeout adjusted from 1 hour to 1 day - this is to support users to launch computations from the terminal.
        • Maintenance is scheduled for late February or early March.
    • 2:10 PM 2:25 PM
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 2:10 PM
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • ADC Operations:
          • 05.02.2025: One Harvester (out of two) DB lock timeouts.
          • 29.01.2025: Panda issue due to token issuer change (ATLASPANDA-1291)
          • DDM Ops/US Ops: Fabio is back. His priorities were defined.
          • GPUs: Need Cuda > 12.8 on all PQs. Expect Helpdesk tickets.
          • SAM tests moved from python2@SL7 to python3@EL9.
        • US Cloud Operations
          • SWT2: Failed transfers due to ACT access problem. Ongoing.
          • Ongoing JumboFrames tests.
        • USATLAS Helpdesk Tickets (Link)
      • 2:15 PM
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCaches
          • several issues I should look at.
          • still did not debug gStream issue.
        • VP
          • working fine
          • need to follow up on NET2 VP queue mails.
        • Varnishes
          • all working fine
          • there was a discussion on wholesale move from squid to varnishes.
          • now adding instances at NRP in NL and CZ to serve frontier data.
        • ServiceY
          • retesting FAB server-side delivery.
          • new datasets, new cluster
        • ServiceX
          • upgraded to 1.5.6
          • new code gen images.
        • AI
      • 2:20 PM
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))

        rp1 ceph storage bottlenecked on wireguard interface at IU. Much older equipment (R720?), CPU might not be fast enough to handle the encryption overhead. 2 solutions implemented: 

        • increasing k8s MTU from 1280 to 8780 increased iperf throughput from 1Gbps to 4Gbps. 
        • adding non-wireguard backhaul network for Ceph increased performance to 10Gbps (line rate)

         

        testing feasibility of unprivileged wireguard on VM at UChicago: podman seems to let us create tunnel interfaces in containers without rootly privileges in current (EL9+) kernels. might have interesting implications for jobs.

         

        ongoing re-testing of ServiceY on FAB. Fengping will present at KNIT10 conference in March.

         

        Flocking tests from UChicago AF -> MWT2 ongoing, to be tested at large scale with upcoming MWT2 storage downtime. 

    • 2:25 PM 2:35 PM
      AOB 10m