US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      We will have a WLCG mini-capability discussion on Friday at 2:30 PM Eastern/1:30 PM Central including both USATLAS and USCMS (as well as ESnet).  Zoom info https://unl.zoom.us/j/96665685001   Please join if you are interested.

       

      During today's updates it would be good to hear about plans for the facility regarding holiday coverage (if any).

      The HEPiX Techwatch WG had some discouraging news on disk and memory pricing and supply chain delays...this will likely adversely impact our purchases at least till 2028

       

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • No releases planned until next year
      • Google is disallowing use of certs for client auth in Chrome (https://knowledge.digicert.com/alerts/sunsetting-client-authentication-eku-from-digicert-public-tls-certificates). IGTF CAs that are also respected by Chrome by default will need to adjust or be dropped from Chrome.
      • OSG-LHC PEP
        • Blueprints for OSG Accounting future planning, XRootD monitoring architecture authentication
        • Integration tests of containers
        • Cybersecurity tabletop
        • Capability challenges as part of mini DC challenges
        • Support enabling SciTags on US ATLAS/CMS
        • Remove last vestiges of X.509
    • 13:10 13:20
      Rucio/SENSE at NET2: integration, demonstration, and next steps 10m

      This presentation will cover the NET2 experience at SC25 and discuss near term plans for SENSE/Rucio work in the facility.

      Speaker: Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 13:20 13:40
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:20
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:25
        Compute Farm 5m
        Speaker: Thomas Smith

        All condor workers upgraded to alma 9.7

        Need to schedule upgrade downtimes for the condor CEs

        gratia reporting was non functioning, packages were updated and reporting was restored. The gap in reporting was filled retroactively without further intervention needed. Thanks Derek for reporting the issue

      • 13:30
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
        • dc284 pool server commissioned into production (1PB)
        • No major issues to report in operations
      • 13:35
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
    • 13:40 13:50
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running over the last two weeks
        • TW-FTT has had transfer troubles with certain European countries.
          • Today they have some sort of general network problem.
        • Otherwise a very quiet period with full production.
      • Please, please submit the tier 2 site operations before you leave for the end of year holidays.
      • We will follow up with the sites on administrative coverage over the holidays.
        • My guess is this will look like it normally does: best effort.
      • Need to revisit the Tier 2 equipment projects ASAP because the CA will likely be reviewed during January.
        • The current federal budget continuing resolutions expires at the end of January and NSF wants to get the external reviews of the CA done in January,
        • The goal is to have the CA ready to be presented at the National Science Board summer meeting.
    • 13:50 14:00
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:50
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: ~141K/18K CPU/GPU hour remains (AY will end on Jan 22, 2026)

        Perlmutter (Doug): in Monthly downtime today

        Perlmutter (Doug):  trying to understand the magnitude for empty pilots (pilots that do not get any jobs after 5 attempts)

         

      • 13:55
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))

        Debugging error seen on GPU queue - HammerCloud jobs recording error in stage out.  "File transfer timed out during stage-out: hc_test:ced04002-0d9e-4fab-b478-b8fb314d3e43_49036.1.job.log.tgz to BNL-OSG2_SCRATCHDISK, copy command timed out: TimeoutException: Unknown time-out related error, see batch log for more info, timeout=None seconds')]:failed to transfer files using copytools=['rucio'] "

        Trying to under why user jobs are not starting. 

    • 14:00 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Follow up the users(inactive users with data and active users with no data) and wait for the policy decision on handling users’ storage areas.
          • start to work on user quota management
        • Test the new federated JupyterHub services for FCC and DUNE; the same changes will be applied to ATLAS after they are puppetized.
      • 14:05
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:10
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Maintenance done on Dec 17.

        •  AlmaLinux: v9.6 → v9.7
        •  Kubernetes: v1.31 → v1.32
        •  Calico: v3.29 → v3.31
        •  NVIDIA Driver: 570.195.03 → 580.105.08
    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • A bug was introduced in the latest version of the pilot in which it divides the maxwdir to hard 8 (not PQ.corecount) (ATLASPANDA-1575). This affects mostly the score queues. To be fixed once Paul is back.
      • 14:20
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
      • 14:25
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14:30
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 14:35
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:40 14:50
      AOB 10m

      Hope everyone has a happy holidays!   We will meet again in 2026.