Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      No major issue

      Preparing for the pre-scrubbing

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Pretty good two weeks.
        • MWT2 drained while testing new ALRB release due to  issue with passing environmental variables to the jobs. Now fixed and the new ALRB seems good. Asoka wants to test for another few days.
        • There was an incident last night affect all ATLAS sites except those on NorduGrid.
           
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

         

        UM site had IPv6 issues after the hardware maintenance from Merit, we had to put the UM condor cluster offline to prevent failing more jobs. The issue was resolved the next day by Merit. 

        We found out a  condor ce sub directory  ownership issue on the condor-ce which had been causing 20% SAM test jobs fail(Site has only75% Reliability and Availabilty in May). That ownership issue was introduced in late April when we were trying to fix the ownership for the gratia directories. 

         

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • Still troubleshooting downed storage machine.
          • Declared some more lost files.
        • Retiring ANALY_MWT2_GPU
        • Waiting on updating to condor 9.0.13 until we work out an issue with the condor-externals RPM removal causing jobs to fail.
        • Working on fixing an issue where the IU squid restarts periodically.
        • Running ALRB testing version on condor
      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef
      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        Deployed new storage (3.2PB) and retired ~600TB of oldest storage

        Cabled, installed and characterized the R6525 nodes, bringing into batch system now. Provides 48 Nodes x 96 slots (1455 HEPSpec per machine)

        Testing beginning on IPV6

         

        OU:

        Nothing to report, all running well.

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))

      NERSC

      • Cori running fine
      • Perlmutter integrated last week & running fine over the weekend
        • 1 job per node for right now
        • issue with Globus 5 credentials in the last day - fixed now.
        • Not clear how to renew endpoint credentials with Globus 5 before they expire - investigating..
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Preparing for pre-scrubbing
        • Doug, Lincoln, Ofer met with EOS team at CERN
          • Okay to expand usage of fuse mounts
        • Doug, Ofer met with Oksana Shadura and Alex Held at CERN
          • Will support efforts for development standards to support OKD
          • Will collaborate to get demo analyses running at BNL for testing/benchmarking
          • Oksana has already gotten access using FNAL login to federated jupyterhub
        • Doug, Ofer met with Elena Gazzarini at CERN (taking over for Riccardo)
          • Discussed collaborating on development of DLAAS tools
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
        1. We upgraded Kubernetes two major versions from 1.21 to 1.23
        2. We upgraded HTCondor to 9.0.13 with OSG 3.6 on head/login nodes
        3. We did a yum update of all packages on the AF login and head nodes, including the latest mainline Kernel from ELRepo
        4. We are still upgrading workers in the background 
        5. We deferred the CephFS upgrade from v16 (Pacific) to v17 (Quincy) - we found 1 node (c001) with what seems to be a hardware error - all disks are reporting "I/O error" trying to mount them. Need to get the cluster clean before we upgrade major version.
    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Working on pre-scrubbing slides and OTP
      • Quiet week, except for ~2h downtime last night due to Panda/Harvester servers going offline
      • Less than 2 weeks to physics collisions....
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache 

        • all working fine
        • LRZ in downtime
        • waiting for 4.5

        VP

        • NET2 is running a lot of VP jobs

        Investigating some new http caching tools.

         

        Squid Fed Ops

        • SLATE team working on testing new federation controller.  Will migrate to using this in the next couple of weeks.

         

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        In the Calico network configuration the modification of the parameter IP_AUTODETECTION_METHOD (which was the possible suspect) was going through, but looking in the master node Calico pod, it was not showing that the update was propagating correctly (looks like something was overriding the change).   
        Lincoln suggested that it might be Calico operator, running in the background, and indeed, making the update on the operator level flipped that pod to healthy. Right now all K8s components are healthy. Though I still have submitted jobs waiting at the ContainerCreating state. I think I know what's the reason - working on a fix.

    • 14:55 15:05
      AOB 10m