US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      We need to complete our Tier-2 planning and use of possible end-of-CA funds before the end of this month

      As a facility, we should be thinking about how some of the work we do might be enhanced/improved by the use of AI/ML, since there may be funding options in the future

      Today is the ESnet blueprint meeting from 2:30-3:30 PM Eastern, with topics:

      •     - Tier-2 updates
      •     - IPv6-only LHCOPN ?
      •     - System tuning work (capabilty challenge?)
      •     - DC27 plans

       

      Trusted CI engagement continuing (tomorrow 5th meeting, next Wednesday USATLAS one-on-one meeting)

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (this week)

      • XRootD 5.8.1
      • IGTF 1.135
        • Updated SlovakGrid trust anchor with extended validity (SK)
        • Withdrawn discontinued HPCI CA (JP) 
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speakers: Thomas Smith, Tom Smith (BNL (gmail))
        • Weather related power dip on May 2nd, approx ⅓ of tier 1 was affected (by core count)

          • Lost power to 1 row of compute for a few minutes at ~02:30 (eastern time)

          • Received notification, onsite work was done to recover the lost portion of the condor pool

          • ~99% recovery completed by 05:00, 100% recovery by 10:00

        • Initial testing work has begun on revised condor memory (cgroups) config which should better protect worker nodes (EPs) from becoming completely exhausted of memory

          • These changes don’t affect the Tier 1 (yet) but are on the horizon. Currently being rolled out on one of our other pools

         

        Also Storage: (I dont have permissions to add there)

        • Infrastructure

          • The local OSG CAs Puppet class has been improved, enhancing CRL updates and repository management.

        • Monitoring

          • Integration of various dCache components into the ELK infrastructure is underway.

            • Pools are currently being integrated to complete the deployment of Filebeat.

         

      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running in the past two weeks.
        • There were small draining incidents at most sites.
      • Sites have mostly submitted Operations and Procurement plans.
        • One Procurement plan is still outstanding.
        • A few of us will discuss how to proceed later today and there will be a meeting on Friday with the sites.
      • EL9 upgrade / FY24 equipment install continues at MSU and UTA.
      • Discussed Varnish/NRP at today's daily meeting.
      • I will check with Valentin Voikl if the recommended cvmfs version is 2.12.7.
        • The OSG repository only has cvmfs version 2.12.6 which from the client should be about the same.
      • The NET2 tape system has been having difficulty keeping up with massive requests submitted all at one time.
        • Otherwise the tape system is working well.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
        • Perlmutter: stable week
        • TACC: LSCP people suggested we use 'flex' queue, which charges 0.8 (instead of 1)
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Work on the GPFS and dCache storage mount in a pod with proper permission on Openshift

          • Successfully mounted both GPFS and dCache storage within a pod.

          • The pod is configured to run as a non-root user using securityContext with the assigned UID and GID, ensuring correct access control.

          • Read/write operations on both storage systems work as expected.

      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • Thanos deployed for long term metrics persistence
        • Finalizing binderhub configurations 
        • Habor proxy cache deployed 
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • All BNL transfers migrated to new FTS; old service to be decommissioned in a few weeks
        • Some squid and varnish issues
        • BGP tagging confirmed at NET2
        • HTCondor update ongoing at AGLT2
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Diagnosing Coffea Casa deployment issues ongoing
          • Duplicate resource errors when spawning notebooks
        • EOS deployment ongoing
          • Cluster up with 2PB of storage across several 90TB arrays (retired MWT2 storage)
          • Need to understand options available for authentication. Don't really want to run Kerberos
          • Gathering a list of issues/tweaks/workarounds with the Helm charts, would like to meet with the developers at some point to discuss further
        • Experimenting with WireGuard 'routing node' features 
          • Don't have to install WireGuard on all nodes, but a node can be a NAT between the WG network and a private LAN
            • Demonstrated connectivity from, e.g. umich001 to UChicago AF NFS server via the WG network[1]
              • Was also able to mount /home and it seems to work. 200MB/s read/write - not great but probably due to MTU ~1500 as Aidan/Judith observed
        • Sent Wei a demonstration `podman` command to create a pod to join the WG network 
        • Tested HTCondor glidein on NET2 (ostensibly to connect back to UChicago AF), caused HTCondor to segfault :)
        • Kuantifier discussion tomorrow, use Facility R&D link:

         

        [1]

        [root@umich001 ~]# tracepath 192.168.240.133
         1?: [LOCALHOST]                      pmtu 1280
         1:  100.81.190.82                                         6.515ms 
         1:  100.81.190.82                                         6.275ms 
         2:  192.168.240.133                                       6.750ms reached
             Resume: pmtu 1280 hops 2 back 2 
    • 14:25 14:35
      AOB 10m