US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Facility is exploring purchasing options to determine costs and timelines

      Meeting coming up:

        - HEPiX

        - LHCOPN/LHCONE

        - CHEP

      Work is progressing to debug SciTags support in dCache

      Genesis NOFO may be out Friday

      HTC26 planning is underway (Jun 9-12 in Madison)  - Put a calendar hold.  We will have 2 of the 4 days for USATLAS focused meetings and discussion (pre-scrubbing, data-challenges, purchasing, etc)

      =============AI NOTES======================

      Quick recap

      The meeting focused on facility coordination and updates across various teams. Shawn provided updates on HTC26 planning, noting uncertainty about meeting space availability on Monday and discussing potential costs for room rental. Brian reported on OSG LHC activities, including Frontier Squid testing and issues with Google's certificate changes affecting X509 client authentication. The Tier 1 report from Carlos indicated smooth operations with ongoing work on HT condor-related code refactor and new SELinux access controls. Frederick provided updates on Tier 2 operations, including procurement efforts and network equipment pricing. Rui reported on HPC operations and data transfer requests, while Qiulan discussed work on the Jupiter hub authentication workflow. Fengping shared plans for a maintenance window on March 23rd for system updates. Ofer provided updates on continuous integration and operations, including Data25 reprocessing testing and FTS4 testing with CERN. Ilia presented on analytics operations, including work on AI assistants for monitoring and the implementation of new varnish caching setups. The conversation ended with a discussion about FTS4 deployment timeline and testing requirements for the upcoming DC27 data challenge.

      Next steps

      • Shawn: Ask Janet what it would cost to book a meeting room on Monday for HTC26/US Atlas meeting and report back to the group.
      • Brian: Send a brief email to Kaushik (and CC Ivan) about the Google certificate changes and their possible impact on Harvester and other services; Kaushik to forward to the Harvester team.
      • Brian: Contact the CRIC team (cricDevs) to coordinate on topology fetching and authentication changes, and offer a ticket for further discussion.
      • Horst: Send an email to help@osghc.org to investigate possible impacts of the new certificate usage rules on their XRootD service.
      • Judith: Double-check with Rob and, if approved, send the post-mortem write-up regarding the recent incident to Ofer and the relevant group.
      • Ilija: Complete work on restoring user branch analytics data for Tilla within the next week or two, pending final input from Tilla.
      • Carlos: Send Ilija the contact information for the Dikach group working on AI for operations.
      • Ivan: Keep Shawn and the group informed about the status and readiness of FTS4 for DC27, especially regarding the timeline for deployment at BNL.
      • Hiro (and BNL team): Provide a rough estimate (e.g., one week) for the time needed to deploy FTS4 at BNL once it is released, for DC27 planning.
      • Shawn (and/or relevant planning group): Consider and communicate if DC27 schedule needs to be adjusted based on FTS4 release and deployment timelines.

      Summary

      Meeting Space Planning and NOFO

      The team discussed facility space options for upcoming meetings, including HTC26, HEPICS, LHCOP, LHC1, and CHEP. Shawn reported that while Monday space is unavailable, they could either pay for a room on Monday or meet on Tuesday and Wednesday, with the possibility of using hotel space. Alexei suggested booking a hotel room, and Rafael humorously proposed meeting at the pier with beer and sausage. Shawn mentioned that the Genesis NOFO might be released on Friday, requiring a quick response from multiple organizations.

      US Atlas Meeting and Updates

      The meeting covered several key topics. Alexei discussed uncertainty around whether the upcoming funding opportunity would be a NOFO or a different type of call, with potential for around 100 awards. Kaushik and Shawn discussed the need for a US Atlas meeting, with Shawn still awaiting confirmation from Paolo and Verena about hosting it. Brian provided updates on OSG LHC, including testing of Frontier Squid and recent changes to Google host certificates, which could affect various services. The Tier 1 report, presented by Carlos, highlighted smooth operations, ongoing HT condor-related code refactor, and new strict SELs for NFS access, along with some storage issues that were being addressed.

      Operational Updates and AI Monitoring

      The meeting covered updates on various operational aspects, including procurement, funding profiles, and maintenance schedules. Frederick discussed ongoing procurement efforts and shared a cost-effective Arista network equipment quote. Rui reported on HPC operations, highlighting stable job production and requests for group disk storage. The team discussed updates to the analysis facility, including authentication workflow improvements and potential simplifications to the Jupiter hub frontend. Fengping outlined a scheduled maintenance window for system updates at Chicago. Ofer and Ilia discussed AI monitoring efforts, with Ilia presenting an AI assistant for monitoring various services and communications. The team also addressed concerns about varnish caching at Southwest Tier 2 and discussed the timeline for deploying FTS4 in preparation for the Data Challenge 27. Ivan noted that FTS4 testing is ongoing, with plans to move production transfers to FTS4 once ready, and Shawn emphasized the importance of ensuring FTS4 meets the requirements for DC27.
    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
        • Mostly smooth operations on T1 for the last week
          • currently looking into some job failures
        • nearly finished with htcondor related puppet code refactor - scope complete, but untested
          • This includes changes to the way condor related config files are distributed, will remove dependency on nfs for some of these files
          • Other improvements to the general management of the pool infrastructure
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
        • No major issue to report
        • dCache NFSv4 door access restricted to a subset of FARM subnets; Tier-1 access excluded.
        • Storage issue on pool server dc282; disks are currently resilvering.
          dCache pools set to read-only mode.
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • No operational issues
        • Investigating a short HC exclusion from last night as well as question raised about another terminated job.
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Really good running recently with only a few minor outages.
        • AGLT2 small reduction after a dCache upgrade.
        • NET2 had short offline periods caused by disks that had slow readback speed.
        • OU had a maintenance downtime.
        • TW-FTT was down last weekend for an annual power maintenance.
      • Working on procurement.
        • Waiting for Dell to provide access to test systems to benchmark compute systems.
        • Now that Shawn and Alexei have given us the expected amount of funding, we need to put details into our purchasing plans.
          • There will be another meeting to discuss this "soon".
        • In my own experience the pricing for Arista networking is slightly down over that last year.
          • So now could be a good time to buy networking gear.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
        • Perlmutter: both about expectation, stable
          • Doug assigned group disk allocation for askPanDA
        • TACC: data transfer
          • waiting for information on Globus endpoint connection to dCache storage at BNL

         

        • Preparing FastChain workflow local test (EVNT-->DAOD) on Perlmutter 
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Debug on the Jupyterhub OAuth authentication workflow with CILogon to make it work.
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • A maintenance window is scheduled for March 23.

        • During this maintenance period we will perform routine system updates, including upgrades to firmware, operating systems, Kubernetes, Rook/Ceph, and NVIDIA drivers, along with other standard infrastructure maintenance tasks.

        • Services will be temporarily unavailable during the maintenance window and are expected to be restored once the updates are completed.

    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • Started first tests of data25 pp reprocessing (DATREP-381). The start of the full reprocessing will be announced at the ADC weekly meeting.

        • Continuing FTS4 tests.

          • Need planning to ensure readiness and assess performance improvement for DC27
          • Currently planning for a CY Q4 release
          • How long do we need for BNL deployment once it's released?  A: less than a month
        • Run4 ADC TDR work ongoing
        • AGLT2 has switched off IPv4 on LHCONE - no issues observed
        • OU re-raised concerns about SAM/ETF accounting of scheduled downtimes - to be followed up next month
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • Analytics
          • Working with Attila on getting Event-loop to report branch accesses.
        •  XCaches
          • stable
          • Oxford had downtime
        • Varnish for Conditions
          • added a second instance for SWT2 - the only site having two :)
        • Frontiers
          • Switched US, CA, ND, DE clouds to use new Frontier servers (k8s based).
          • The rest of sites will be switched tomorrow.
        • AI Assistance
          • Now have a monitoring/alerting system that can monitor: k8s cluster, individual applications (ServiceX/Y, HTCondor, Ceph, Panda queue, Varnishes, user emails, mattermost, ...) send emails, slack, mattermost messages. Will add Discourse integration. Still growing in scope. It is almost trivial to add other apps. Who will help me define a good prompt for checking dCache? Powered by Claude Sonnet. Not trivial price - roughly 4cents/check.
          • Investigating OpenAI gpt-5.4 that now also has computer use and tool search options.
          • There is an expectation for NVidia to announce an open source NemoClaw. Will give it a test once it's out.  

         

      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
        • Facility R&D Biweekly meeting last week (notes)
          • Discussion of UC AF security incident - Judith preparing a post mortem

          • Discussion of potential infrastructure reconfiguration

          • Strong progress on RP1 development

      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:40
      AOB 10m