US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
      • CHEP 2026 Abstracts due by COB Friday
      • Review coming up for next 5-year CA.  Need to clearly document the need for AF and Tier-2 plans
      • One person is approved to attend upcoming dCache workshop
      • Quarterly reports will be due soon.  Please check milestones and risk registry for updates as well.
    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • FNAL vo-client updates targeted for release Jan 15 and 29
      • HTCondor 25.6.0 undergoing integration testing in the OSPool, expecting a full release in the next few weeks
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))

        NFS components supporting the staging workflow (dCache to/from HPSS) migrated from NFSv3 to NFSv4

        •     Transition completed successfully on 01/14/26 between 10:00–11:00 AM.
        •     Migration was transparent to users.

         

        Reviewed Network/TCP kernel parameters for dCache dual-home pools and doors:

        • Doors were using legacy settings; dual-home pools were not optimized for WAN access.
        • Network/TCP kernel parameters were identified based on ESnet Fasterdata tuning guidelines.
        • TCP tuning has been applied to dcdoorsX and dual-home pools since 01/12/26.
        • Per-file transfer performance improvements observed during testing: 

        TCP pull from CERN EOS to a BNL dual-home pool (16 GB file):

        Baseline (as-is): 34.23 MiB/s 

        After WAN tuning: 167.08 MiB/s

        Performance improvement: ~5× throughput increase

      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • The holiday period was quiet and there was a high production level.
        • There two notable outage over the holiday:
          • At AGLT2 a dCache issue out caused a one day outage. 
          • At CPB annual power once again caused issues. Some sort of DNS table corruption caused by the power outage took time trace. 
        • Since people have returned to work, there have been various small reductions in production.
      • The PIs need to meet in early February to discuss procurement.
        • The funding outlook has improved but the story is not complete yet.
      • As Shawn said we are still working on getting a good story about how essential that Tier 2 sites are.
      • Get your quarterly report in.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Quarterly report submitted

        Perlmutter: ~38k/15k CPU/GPU hours remain

        • 50K CPU hours added by NERSC on Monday
          • after we ran out of time, Doug contacted Wahid Bhimji(NERSC) for additional CPU time. 
          • if we run out of time again - should we ask for more , it is supposed to last until 21-Jan-26
        • AY26 will start on Jan 23 after the maintenance
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • IRIS-HEP/AGC Demo Day #11 this Friday, 11am ET (link)
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • The work on integrating COmanager for Jupyter federated authentication is under going
        • User space management updates
          • No updates about users space policy from Viviana and Hector
          • User quota testing reached a pause point
            • One issue observed: inconsistent return code of webdav protocol

              • Addressed in 9.2.46 or later release

      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        SENSE Deployment Update

        • VLAN trunking has been configured on switch ports to allow proxy components to run on additional servers, freeing the ConnectX-7 card for exclusive use by the software router.

        • Follow up with Diego on the current status and confirm whether the setup will be ready in time for the mini challenges.

    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • Sent out a call for quarterly reports
      • Host-Tuning Mini-Challenge meeting (notes)

        • Next meeting Friday, January 23rd from 2-3 PM Eastern
      • OTF #8 "Tape Evolution and Challenges" registration now open (agenda)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Armen Vartapetian (University of Texas at Arlington (US)), Kaushik De (University of Texas at Arlington (US))
        • AGLT2 S3 LocalGroupDisk service issues (GGUS)
          • FTS concurrency reduced but doesn't seem to be respected by CERN service?
        • Transfer failures from TW-FTT, Yi-Ru is investigating (GGUS)
        • One BNL shared pool CE VM migrated and updated, to be returned to service this afternoon
          • Observed that deactivating the CE in CRIC did not stop jobs from being scheduled, had to detach from all PQs 
        • Thank you to Armen who has been chairing the daily ops meetings in Ivan's absence!
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • Caches
          • everything works fine
          • Varnish for CVMFS at SFU being reconfigured so it uses their stratum-0
        • Analytics
          • moved the rest of Alarms crons to local Github actions.
        • AF Assistant
          • Subtle changes in agents.
          • Now it "knows" user.
          • working on integrating Glance data
          • Got DGX Sparks, now installed and getting benchmarked. These are to be used primarily for development, running Evals and if fast enough for inference (responding to users).
        • ServiceX/Y
          • NTR
      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:40
      AOB 10m