US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Very good meeting last week at BNL, discussing new Tier-1 organization and storage items with the dCache team https://indico.bnl.gov/event/22078/

        - We will plan to start a new site networking focused meeting to bring in site/campus network people.   There is an existing weekly meeting on Thursday's at 10 AM and we can move its focus to campus/site network information exchange once per month.

      Reminder that CHEP abstracts are due May 10 https://indico.cern.ch/event/1338689/page/31560-call-for-abstracts

      HEPiX is in a few weeks, consider attending and submit an abstract:  https://indico.cern.ch/event/1377701/

      Quarterly reports are due Friday, April 19, 2024:  https://atlasreporting.bnl.gov/

        -  We need to review and update milestones as well.   

        -   Please suggest any new milestones or let Rob and I know if there are milestones to retire/remove

      There are needed updates for upcoming IAM changes.  Tickets were issued to non-US site presumably assuming OSG would coordinate this for them. We should discuss with OSG about their plans.  

        - VOMS configuration changes for LHC Experiments https://ggus.eu/index.php?mode=ticket_info&ticket_id=165668

        - Token configuration ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=165816

        - Timeline document at https://docs.google.com/document/d/1onp_qMOvE5s9byaDF9L2Fx1LIVd2smUtNHwKa7ejnJA/edit#heading=h.7vqi4tau13n6 

      We need to continue to look at the results and data from DC24, trying to identify issues that can be resolved by configuration, architecture, software and/or hardware changes.

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:10 13:25
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Infrastructure and Compute Farm 5m
        Speaker: Thomas Smith
      • 13:15
        Storage 5m
        Speaker: Jason Smith
      • 13:20
        Tier1 Services 5m
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
        • Farm
          • Alma9 + Condor23 transition: testing the full job submission chain
          • IPv6 transition: Testing a script for automatic node conversion ti IPv6
          • CVMFS: No more errors at BNL. Waiting for the new pilot release to get better monitoring
          • Lower job efficiency on BNL due to more than half of the cluster being filled with user analysis jobs.
          • HammerCloud blacklisting event due to switch problem at CERN (OTG0149318) did not affect BNL
        • Storage
          • Filled tape pools detected today. Solved.
        • Misc
          • Confirmed pledged resource delivery for 2024
        • GGUS:
          • GGUS:165929: Transfer failures. Solved.
          • GGUS:165532: Post-DC24 test ticket
            • Saw-pattern observed in the throughput which is yet to be understood. The reason is not with BNL.
          • GGUS:164216: The CMS request for running test jobs on BNL T1 slots
    • 13:25 13:35
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonable running
        • MWT2 IU site had a painful network refresh that caused some loss of production
        • NET2 has been struggling with network errors.
        • CPB is being asked to retire LSM by ADC but it's non-trivual. The issue affecting production at the CPB Kubernetes site.
      • End of quarter reporting: please update the following (if needed):
        • Site Capcity sheet: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
        • Site Evolution sheet: https://docs.google.com/spreadsheets/d/1YjDe4YdApHoB5_HbDnNwrG-ceJP3amNWMb_VzQEaxGI
        • Site Services sheet: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
      • At Rob's request, I have created a sheet to track progress on the open issues (most but not all are milestones):
        • https://docs.google.com/spreadsheets/d/1CHpVHqnLJz0dNfXh-v4GYSOq0ez9n6SPmF3hJ9iMflY
        • Please check your section. I will tend to the tier 2 items. Ofer will track the items for the tier 1.
        • If there are items that are delayed we need to know. In particular high level milestones visible to the funding agencies need to be handled carefully. If you are delayed by something out of your control (e.g. you can't  order equipment before the funding agency delivers the funding) those delays will not could against your site.

       

    • 13:35 13:40
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))

      TACC

      • Havaster is up and running, queue is set for testing
        • Jobs failed due to input file validation
          • Checksum matches between the local on and in Rucio. No issue was seen when reading the linked file locally or via the debug queue
          • Added the binding area of the local datadisk in CRIC
          • Requesting testing task
        • Updated the pilot version to 3.7.2.4
        • Very long queuing time ~4 days before set for testing
        • Need a TACC specific test Request. 

       

      NERSC

      • Running Harvester with older pilot (3.7.2.4) - above the uniform usage line
      • Testing latest pilot (version - 3.7.3.9) - currently all failed on the production. Using a Test queue. 
      • Need to decide if we want to make the GPU's available to ATLAS 
    • 13:40 13:55
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:40
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • New disk storage for Tier-3 GPFS update has arrived and is being installed; no downtime expected and should be transparent for users except they might notice available space fluctuations
      • 13:45
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))

         

        AB-stable and AB-dev images have been updated. AB-dev has all the latest versins of uproot, awkward, dask-awkward. Image building has been updated so to always get correct dask workers.

        htcondor worker auto scaling has been configured for both the long and short queue workers. either queue can now scale to more workers when needed. This should help with one workers nodes being left idling while on the other hand jobs are pending on resources. This does increase scaling activities so we are hoping to make that as lightweight as possible, currently the workers has a user provision step that takes a few minutes, we are hoping that in the future user account will be backed by ldap to avoid the cost of user proovision step.

    • 13:55 14:10
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Fred and I are beginning to track OS and other updates as in previous years (spreadsheet)
      • CVMFS mount errors now identifiable using new wrapper message on Harvester dashboard; worker node information to be added in next pilot update
      • Hiro and Mark have updated and deployed sitewide networking script (corrects direction which was previously flipped in/out)
      • XRootd 5.6.9 deployment for ATLAS production - held up by SWT2_CPB_K8S
      • SWT2_CPB, OU site network monitoring?  (GGUS,GGUS)
      • ATLAS considering site exclusions based on unavailability of a certain fraction of data
      • 13:55
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
        • Tape downtimes - should be tested.
        • Proposing policy for sites: All queues should be down if % of files are inaccessible (for example > 20%) (to be discussed on WLCG coord meeting)
        • CVMFS check
        • Post DC24 T0-T1 tests conducted
      • 14:00
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
        • XCache 
          • still issues with several nodes.
          • restarts help temporarily 
          • will be testing the new version at MWT2 and AGLT2
        • VP - all working fine
        • Varnish caches - all working fine
        • analytics and monitoring
          • working on getting back FTS stream
          • some improvements to the Alarm And Alert Service
        • ServiceX
          • improvements in reliability, performance, logging, user interface
          • testing new client
        • ServiceXLite
          • now full time running at FAB, River, NRP.
      • 14:05
        Facility R&D 5m
        Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        Kubernetes Tutorial/Hackathon- Please sign up by Friday, April 5 especially if you plan to attend in person.  Will send an email out to this effect. 

         

        Multi-site stretched cluster assembled with Kubespray, and using Wireguard as the fundamental network layer.

         

        Wireguard is a VPN technology. We can assemble a VPN mesh that encrypts all interal cluster traffic and requires a site to only expose 1 UDP port to the public internet for the most essential connectivity. Wireguard is built into the Kernel (above v5.6?) creates a private interface on each node. To Kubernetes it appears to be all on 1 private network. However we need to understand what it looks like to expose public services. Public-facing services where we can, and tunneled private traffic where we have to? 

         

        Wireguard config example:

        [Peer]
        PublicKey = xxjmp6WyT7IU/9hffUjyV0uj8sfYzR6G3C/I3yt+Qxk= # Elliptic curve public key
        AllowedIPs = 192.168.0.6/32 # INTERNAL IP assigned to the 'wg0' interface 
        Endpoint = 192.41.231.216:51820 # EXTERNAL IP and UDP port assigned for negotiating Wireguard traffic
        PersistentKeepalive = 30 # Periodic ping between nodes to keep the conenction alive

        [Peer]
        PublicKey = oVVQuMR2hHCW+a5y0w4BS9ySOQK2pp8Tkba4RP5TByM=
        AllowedIPs = 192.168.0.7/32
        Endpoint = 192.41.237.213:51820
        PersistentKeepalive = 30

        [Peer]
        PublicKey = BFh6AaxOf8rmDE68BtRcdcEIrQRrx6TklfZozLm3d28=
        AllowedIPs = 192.168.0.8/32
        Endpoint = 206.12.98.227:51820

         

        Kubespray config sample - each site has a label corresponding to its site in CRIC as well as the institution where it sits:


        # ...
           uchicago005.hl-lhc.io:
             ansible_host: 192.168.0.5
             ip: 192.168.0.5
             access_ip: 192.168.0.5
             node_labels:
               site: mwt2
               institution: uchicago
           umich001.hl-lhc.io:
             ansible_host: 192.168.0.6
             ip: 192.168.0.6
             access_ip: 192.168.0.6
             node_labels:
               site: aglt2
               institution: umich
           msu001.hl-lhc.io:
             ansible_host: 192.168.0.7
             ip: 192.168.0.7
             access_ip: 192.168.0.7
             node_labels:
               site: aglt2
               institution: msu
           uvic001.hl-lhc.io:
             ansible_host: 192.168.0.8
             ip: 192.168.0.8
             access_ip: 192.168.0.8
             node_labels:
               site: uvic
               institution: uvic
        # ...

         

        Kubectl:

        [root@uchicago002 ~]# kubectl get nodes
        NAME                    STATUS   ROLES           AGE     VERSION
        msu001.hl-lhc.io        Ready    <none>          6d21h   v1.28.6
        uchicago002.hl-lhc.io   Ready    control-plane   6d21h   v1.28.6
        uchicago003.hl-lhc.io   Ready    control-plane   6d21h   v1.28.6
        uchicago004.hl-lhc.io   Ready    control-plane   6d21h   v1.28.6
        uchicago005.hl-lhc.io   Ready    <none>          6d20h   v1.28.6
        umich001.hl-lhc.io      Ready    <none>          6d21h   v1.28.6
        uvic001.hl-lhc.io       Ready    <none>          6d21h   v1.28.6

         

         

    • 14:10 14:20
      AOB 10m