US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      We need to prepare for (pre) Scrubbing.   A WBS 2.3 L3 template has been shared https://docs.google.com/presentation/d/1mU1eDQQxIE3Lm6qqZsLFPZ-EtmjrgbxXxggr61gJau4/edit?usp=sharing

        - Target is having draft slides by June 9th (to be confirmed)

      The 5-year evolution spreadsheets for the Tier-2 facility is complete but still needs updates and final numbers

        - Each Tier-2 should be working on spending plans for a possible end of CA distribution (see Tier-2 Spending Proposal)

        - Tier-2 managers will meet Friday to discuss 

      HTC25 is fast approaching.   We have a draft agenda started at https://agenda.hep.wisc.edu/event/2297/timetable/#20250605.detailed

        -  Comments welcome

      LHCOPN/LHCONE meeting proposed shutting off IPv4 for LHCOPN

        - HEPiX IPv6 working group discussed today and we want to see if ATLAS/BNL and CMS/FNAL are willing to try this with the expectation that any IPv4 traffic fails over to LHCONE

        - Phil Demar is asking CMS and FNAL if they are will to try to do this in the next month or two.  Shawn is tasked with doing the same for ATLAS and BNL.

       

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • frontier-squid-5.10.1
        • Various security fixes https://frontier.cern.ch/dist/rpms/frontier-squidRELEASE_NOTES
        • Logrotation fix
      • XRootD 5.8.2
        • Fixes one cause related to HTTP GETs that show up in the logs as "close does not refer to an open file"
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Lack of work caused significant disruption over the past week.
        • Over the weekend there were only SCORE_HIMEM jobs that would not broker to a site unless meanRSS was set to 3000 MB. We set this at AGLT2, MWT2, and I believe SWT2_CPB and these sites refilled.
        • There was also a large number of exotics group jobs that failed at all tier 2 sites for looping.
      • EL9 updates/FY24 equipment installs continue at MSU and UTA.
        • MSU believes that a Satellite will allow them to finish.
      • CPB has been struggling with zombie condor entries.
        • There is ticket open with the condor team about the issue.
      • Tier 2 PIs will meet on Friday to discuss procurement both FY25 and end of grant special funds.
      • I need to get with Rafael on pre-scubbing slides.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: both CPU&GPU usage stay above expectation

        TACC: bring UCORE back online to finish the remaining allocation (in flex queue)

        ACCESS: scheduling a chat with experts on setting up an HTCondor Overlay cluster

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Managed to load user name and group name while creating the pod with init container
          • The dCache nfs client still needs to configure NFSv4 identity mapping properly
          • Needs further work on the idmap on openshift work node or pod
        • The test of pull/push image to/from to SDCC Quay service is done
          • Customize the alma9 base image and build it to register on SDCC Quay.
        • Tom Smith is deploying the accounting monitoring that was missing from the A9 Tier-3 pool and interactive hosts
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • BinderHub service now available for users(link is up on the portal)
        • Started to look into Dask-gateway/HTCondor queue integration(do we need to reinvent the wheel)
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • Preparing for deployment of FTS update at BNL (v14.0.1 to be released next week) - will allow for token testing during data challenge
        • Varnish at BNL now functional on OpenShift with Quay image; still some network routing to deploy
        • DDM moved BNL VP queue xcache to ESNET server
        • Ongoing discussions of Varnish deployment and management
        • CRIC permissions were updated (more info)
        • BNL-OSG2_DATADISK protocol priorities to be changed from 0 to null.
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCaches
          • multiple issues in UK cloud. 
          • ESNet xcache is operational but no monitoring coming from it.
          • will try to build new image this week
        • VP
          • BNL_VP trying to use ESnet xcache
        • Varnish
          • starting building neo_frontier infrastructure at OpenStack k8s cluster at CERN
          • asked SWT2 to deploy their own Varnish
          • all Varnishes removed from WLCG monitoring. Dedicated varnish monitoring meeting on Friday 9:30 AM CST
        • CREST
          • NTR
        • ServiceX/Y
          • updated all the components to 1.6.1
          • testing RDataFrame codegenerator and transformers
        • AF
          • cleaned up images and their naming. 
          • added python 3.12 to login nodes.
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Armada seems to be working locally on the stretched k8s, and we are investigating the auth components needed to send tasks to another cluster
        • We are actively debugging/trying to understand EOS user authentication. 
          • Kerberos nonstarter, X509 might be tricky because the EOS containers are all EL7 (!) and we're trying to understand the CA/cert situation
          • "plain" OAuth2 deprecated, with support shifting to SciToken-based auth
          • Not quite clear how to bridge the gap from Keycloak to SciTokens, still working on it
        • Coffea Casa JupyterHub should be working on https://coffea-casa.hl-lhc.io/ , with caveats..
          • Must have a UChicago AF account already, to get your /home, /data, and access to HTCondor
          • Still working on:
            • General ATLAS users coming from IAM without a UChicago AF account
              • Only get Jupyter, no persistence
              • Probably will crash right now if you try it
            • HTCondor pool on the stretched cluster
            • Mounting NFS/Ceph over the WireGuard interface within K8S
              • Jupyter limited to UChicago nodes at the moment, where we can mount locally
    • 14:25 14:35
      AOB 10m