US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      We have a lot going on in our Facilities!

      • Procurement and ops documents for Tier-2
      • 5-year planning development
      • Detailed plans for use of $3-4M for Tier-2 facility
      • Milestones and quarterly reporting updates due soon
      • Ongoing mini-challenges (jumbo frames, cloud storage, Scitags, IPv6-only, capacity testing, etc)

       

      We continue the engagement with Trusted CI and have some homework from the #3 set to complete.

      During the last 3 weeks were the LHCOPN/LHCONE meeting and HEPiX meeting, both very interesting and relevant for our facilities

       

      Upcoming meetings include

      • WLCG/HSF meeting in Lyon in early May
      • HTC25 with joint USATLAS-USCMS meeting in Madison (June 2-6)
      • ATLAS S&C in early July
      • USATLAS Scrubbing in mid July
      • USATLAS workshop at Michigan in late July
    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • Release
        • This week: Frontier Squid for ARM, vo-client removing old OpenShift IAM instances
        • Holding off on XRootD 5.8.0 in favor of XRootD 5.8.1 due to reported stacktrace
        • There was a request for cvmfs-2.12.7?
      • ARM integration tests are complete now!
      • Waiting on NET2 to configure a test Prometheus for the test cluster / namespace
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running over the last couple of weeks.
        • Some cvmfs issues.
        • Short outage at NET2.
          • A certificate expired over a weekend. The certificate belonged to the Harvester and was for the VM servicing Kubernetes. 
        • CPB has been having trouble keeping full.
          • Apparently the current job mixture is causing trouble.
      • EL9/FY24 purchases:
        • MSU still working on installation of EL9
          • They will install their FY24 equipment after their EL9 installation is working.
        • CPB is still working on converting their storage to Alma Linux 9.
          • So their FY24 storage is caught up in this.
      • Rafael and I are working on an email about the following documents:
        • The Jan-Mar quarterly reporting
        • The site procurement plan
        • The site operations  plan
        • The 5 year planning document each site
        • The proposed milestones for each site
      • Once we get this information we will. have a dedicated meeting to make sure that the regular and infrastructure planning is sensible and consistent.
        • This will serve as the kickoff for writing the infrastructure proposal.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
        • Perlmutter is at risk of 41k CPU hours. Plan to switch NERSC_Perlmutter to Premium
        • TACC is resolving a filesystem issue (login nodes are not accessible since 11:30) 
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))

        continuing to make slow but steady program on Harvester - Globus Compute.   Talk tomorrow in WFMS mtg

        several issues w/ Globus Compute uncovered. 

        Rucio + Globus progress slow but steady. noticed issue with not getting bulk transfer submits and working on testing deletion now

    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Interactive nodes for AF changes from spar nodes to attsub[01-08]
        • Create dCache user space for one AF user
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • notebook service reorg 
          • putting up a binderbub service that can launch all the existing notebooks offered via homegrown jupyterlab service
            • intuitive user interface that are easy to navigate
            • keycloak auth with multiple upstream id provider, run as local user if an AF account can be matched.
            • dask-gateway integration with Analysis base images. 
            • will run in parallel with existing svc and retire old svc if it's well received. 
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • 75 new T1 servers installed and undergoing burn-in/commissioning
        • Retry of Jumbo frame test CERN Pilot -> BNL

        • New BNL FTS online.  One BNL SE has been moved to the new FTS as an initial step.

        • CVMFS issues at AGLT2
        • Site draining issues at SWT2 - to deploy a second CE for resiliency
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCaches
          • UK nodes fixed again
          • everything else works fine
        • VP
          • working fine
        • Analytics
        • Varnish
          • Varnish for Frontier deployed at Rome1. Serving Roma and Milano, second choice for all of the Italian sites
          • Deployed at pic. Serving pic, second choice for all of the Iberian peninsula sites.
          • Waiting on IN2P3-CC to setup one for France.
          • LRZ installed a "private" varnish instance.
          • Discussed with ESnet a possibility to have an instance in Boston.
          • Waiting on BNL to get one there.
          • We should decide on US approach. 
          • Uni Victoria is now using CF Varnish.
        • AI
          • testing how we could use MCP (model context protocol) to expose our analytics/accounting data to AI models/clients. 
        • ServiceX/Y
          • rewriting part of the site to use HTTP over SSE.
          • will be changing client in the same way so it does not poll S3 all the time.
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Debugging slow network setup time on stretched cluster. Symptoms are network timeouts for the first minute or so of the container being up.
          • Seems to be related to Calico's utilization of XDP (eXpress Data Path) ... Calico keeps trying and failing to clean up XDP programs on the loopback device?? 
          • Disabled it and things seem OK now, but not clear if disabling will degrade overall performance
        • Work with Armada continues
        • Work with Jupyter (via Coffea Casa? TBD) continues 
    • 14:25 14:35
      AOB 10m