US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Working on finalizing next CA text and associated documents before COB Friday

      HEPiX in ongoing this week.   See https://indico.cern.ch/event/1536836/

      Today is a short meeting...let's try to finish in 30 minutes (BNL has an AHM)

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release

      • Newest HTCondor versions released yesterday resolve the issue that AGLT2 saw with HTCondor-CE 25
      • We are expecting a release this week for CVMFS and osg-configure. The latter has an important fix for new CE installations
      • N.B. HTCondor Python bindings v1 are not available in OSG 25
      • We are discussing internally the root causes of the various undiscovered issues in the initial OSG 25 release

      Miscellaneous

      • OSG Hub is going down for maintenance on Nov 18
      • We are working on migrating other OSG / PATh services from UChicago -> UW + NRP
    • 13:15 13:30
      Topical Presentation 15m

      ATLAS dCache Zpool reservation from 10%/11% to 5%

      Speaker: Robert Hancock
    • 13:20 13:40
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:20
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:25
        Compute Farm 5m
        Speaker: Thomas Smith
        • Smooth T1 operations in last week
        • Rough estimate on restoration of power to downed racks ~mid December
        • Targeting upgrade of AT1 pool to HTCondor 25.0.3 within the next few weeks
          • (AT3 which is part of shared pool, to be updated at a later time)
      • 13:30
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)

        dCache NFS v3 door service decommissioned

        No major issue to report 

      • 13:35
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Updated to CVMFS 2.13.3
        • Working on Varnish VM deployment - not as straightforward as was advertised
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Pretty good running over the past two weeks.
        • AGLT2 doing rolling upgrades
        • MWT2 has condor_chirp running on the MWT2_TEST queue with the development pilot.
        • NET2 had some storage troubles that were triggered by large numbers of transfers to the tape storage. Also some failures in the past 24 hours.
        • OU is working on using cgroups v2 to stop jobs using too much memory. This requires changes in Slurm.
        • CPB is still updating some of their older storage servers to EL9 which is required to run recent versions of XRootD. In particular the XRootD 5.9.x series is becoming available.
        • TW-FTT has solved a storage accounting problem which overstated the amount of available storage by a factor of two. Network transfers are working well again.
      • Several software updates available. New versions of OSG 25 / HTCondor 25 (25.0.3) are ready to be installed. The new cvmfs version (2.13.3) should be installed  soon at all sites as it has urgent bug fixes. As mentioned above, the XRootD team has released the version 5.9.0.
      • There will be a planning discussion for spending the FY25 equipment funding on Friday at 10 am CST / 11:00 am EST.
        • Everyone is welcome including system administrators – we will need their technical expertise.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: 12.5K/30% CPU/GPU hour remains, stable

        • Doug will reduce the CPU job rate to reserve time for tasks like Globus testing (w/ HEPCCE), HTCondor overlay Batch System testing, and etc.
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Reviewing ATLAS T3 user storage allocation policy
          • Begin to remove directories associated with deactivated accounts
          • To test dCache user quotas
          • Do we need NFSv4 mounts on AF worker nodes?
        • Test and debug the federated Jupyterhub
            • Noticed missing error_target configuration, will fix it
          •  
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Triton Service

        • Productionized the service setup.

        • Added CVMFS paths for the model repository.

        • Configured explicit model loading for better control and resource management.

        ServiceX

        • Experienced an outage yesterday.

        • Root cause: a process in the app pod stopped retrying after 100 failed attempts to connect to the RabbitMQ service, preventing transform tasks from being dispatched.

        • Temporary fix: restarted the app pod.

        • Permanent fix: developers are updating the code to allow infinite retry attempts.

    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • Condor_chirp working at MWT2 test queue in new pilot; to test at BNL
        • HTCondor/OSG/CVMFS updates in progress
        • Questions about VP queue operation, currently stopped at NET2
        • ADC
          • We rely on sites with limited network to communicate the network sharing fair. And we have no way to control that (two independent FTS instances) We have already case (ROMA1)
          • Working with HI folks on setup for HI datataking with as opened trigger as possible. 
          • SHA1 CAs - our software should not check the CA certificate (self signed anyway). Still a bug in dCache to be fixed (dcache#7927)
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCache, VP - NTR
        • Varnish 
          • deprecated backup proxies at FNAL and CERN
          • got resources for CERN local Varnishes.
          • T0 nodes failing over. Operators informed.
          • still problematic: MPPMU, CYFRONET, BNL
          • TW is installing a CVMFS varnish proxy
        • Frontier 
          • next week we will have a tutorial session on how to manage Frontiers.
        • AF
          • every change in documentation now triggers parsing to new MD files and their reimport in OpenAI/ reindexing in ES vector store.
          • testing OpenAI evaluations, grading, prompt optimization.
      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:35
      AOB 5m