US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      We are waiting on news of the end-of-CA funds to allow us to be able to spend

        - Need to schedule a meeting as soon as the funds are in the pipeline, so we can discuss the process and plans

      Check the Milestones at https://docs.google.com/spreadsheets/d/1z5Ud_hMKzogVkFm5lXM5GFpcFZl5Bu0Hkd9xkNagYfY/edit?gid=173778962#gid=173778962

      HEPiX is this week (Board meeting is going on now)   https://indico.cern.ch/event/1598655/

      dCache topic

        - AGLT2 and MWT2 planning to upgrade to v11.2.4.   AGLT2 nominal Apr 30 9 AM - 2 PM, MWT2 May 4

        - dCache workshop will have a USATLAS presentation by Eduardo https://indico.nikhef.nl/event/7562/

          - Shawn will present on SciTags/Firefly work as well

      GENESIS Phase I proposals due April 28th

      Summer meetings

        - USATLAS F2F at HTC26 in Madison June 9-10

        - ATLAS S&C week at CERN June 29-July 2

        - USATLAS Scrubbing July 13-15

        - USATLAS Summer meeting July 27-29 (?)

      Today we have a special guest:  Megha Moncy who will let us know about plans for OSG Security exercises.

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • HTCondor 25.10.0 undergoing stress testing in the CHTC this week, OSPool next week. Headline feature is common file reuse on the EP-side. Release in ~2 weeks
      • Still need to start the mass rebuild process for XRootD 6
      • Newest version of Kuantifier adds support for tracking usage of Jupyter notebooks: https://osg-htc.org/docs/other/monitor-kubernetes-kuantifier/
      • Working with the CRIC team to grab resources + contact info from Topology
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith

        gridgk03,4 were drained on 4/21 by mistake. Ivan caught and corrected this. No interruption in jobs or throughput (gridgk06,7 picked up the extra work). Things have rebalanced

        Preparations are being made to migrate the Tier 1 condor nodes to use the new config we've been working on. This process should be relatively seamless. There will be a brief spike in failure rate as jobs are killed to rebuild the workers. Targeting a phased migration in batches of ~25%, with a pause after the first batch to verify jobs are flowing and completing successfully. Small scale testing so far has been good! Uptime during this whole process should remain 100% with (very) brief periods of 75% capacity

        Targeting to begin next week, pending success of all the prep work (a LOT of code to verify and merge)

      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
        • Tape staging backlog (due to staging timeouts) resolved while supporting ongoing ATLAS tape activities (i.e 1M files staged since 04/14)
        • Bulk staging service configured to support 200K active FTS staging requests
          • Data Carousel capped at ~190K requests
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Great running in the past couple of weeks.
        • MWT2 Illinois site had its quarterly preventive maintenance on April 15
        • A user sent ~1M small derivation jobs and caused job failures at MWT2 on April 17-18
        • Some of the monit plots were corrupted by an Oracle overload April 16-19.
      • CPB is nearly finished with the update EL9.
        • There are a handful storage servers that remain to be updated.
      • The release of dCache version 11.2.4 will be next week.
        • Shawn believes this version does Fireflies/SciTags correctly.
        • AGLT2 and MWT2 will wait for this release before updating dCache.
      • The amount of additional equipment funding is about $1.7 million per site.
        • This is above and beyond your FY25 funding.
        • Given the unexpectedly large amount of funding I am asking people to submit new procurement plans by the end of May.
      • I have access to the Dell Customer testing center and will be benchmarking 5th generation (Turin) EPYC processors.
        • I will look at the list price of various server configurations to identify the most cost effective server configurations.
        • One can follow the price of memory over the past 18 months at this web site.
      • Still working on the quarterly report. 
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: Production job still pending

        • (Doug) Pilot is not picking up the valid x509 User Proxy. Working with Asoka DeSilva to debug what has changed. 
        • (Doug) updated the pilot to the latest version

         

        TACC: LRAC (large scale) call for Horizon starting in the summer of 2026 -- proposal deadline: May 15

        • Large allocations from 125,000 to 500,000 SUs (Horizon) and up to 50,000 (Vista) for six months duration
        • current peer-reviewed research funding to support the activities conducted on Horizon
        • Proposals from or including junior researchers are encouraged
        • Horizon: a mix of CPU and GPU computing resources, including 4,750 Dell/NVIDIA Vera CPU nodes, and 2,000 Dell/NVIDIA Grace-Blackwell nodes
          • Vera: 2x of Grace, ~1x of AMD EPYC 7763 (Perlmutter)
          • Vera-Robin (Doudna) ~10x of Grace-Blackwell
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • User space token clean update
          • notification email content is finalized and the ban file testing has been done
          • Will send the notification to inactive users until the patch for the production storage system to enable ban feature
        • JupyterHub Development & Deployment updates

          • Improved Frontend design
          • Go through the federated authentication workflow and resolve issues with CILogon integration

          • Integration testing of the federated JupyterHub workflow
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Containerd open file limit fix

        A Coffea-Casa issue caused HTCondor workers to transition to “completed” shortly after startup. This was traced to the ingress controller exhausting available file descriptors.

        The root cause was the removal of an explicit open file limit configuration for containerd some time ago. The limit has now been set in the default systemd configuration, and the fix has been deployed on the UC Analysis Facility cluster.

    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • LHC
          • It produces low-mu collisions resulting of up to 50-hour long runs and 1 PB datasets. The low-mu run will be over by the end of the week
        • ADC Ops:
          • SAM tests are currently failing.
          • BOINC submission is broken in the moment.
          • Job monitoring artifacts due to overload of the Monit filler. To be repainted
          • An ongoing campign to synchronize the SE protocol basepaths. This is needed since tokens are not per protocol.
          • CERN CephFS problem was due to SSD Micron 5200 with power_on_hours SMART counter larger than 65536.
            • If you have Micron 5* SSDs with power_on_hours > 65536 (i.e. older than 7 years) - please let us know.
        • US Cloud Ops
          • Armen kindly agreed to help with daily issues for US sites - failures, problems, following up on issues and also summarize still opened issues on Mondays.
          • NET2 CE downtime shortening revealed CRIC bug. Still to be solved
          • MWT2 storage overload because of a misconfigured user workflow.
            • Solved on ADC side, but site storage protection should be put in place (number of connection per pool - reduced)
          • TW increased number of slots (to 4k) and also removed FTS limit. Running all ADC workloads now.
          • Agreed to decommission NEVIC localgroupdisk
      • 14:15
        Services DevOps 5m

        XCaches - all OK
        Varnishes - all OK. MWT2 CVMFS varnish moved to ingress
        Frontiers - due to CERN Openstack retirement of nodes belonging to FRONTIER-A, I had to change all the nodes. They also changed from m2 to m4 nodes.
        AI - small updates to most of the AI agents

        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:40
      AOB 10m