Indico celebrates its 20th anniversary! Check our blog post for more information!

US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (hopefully tomorrow)

      • Gratia Probe 2.6.1
      • HTCondor 9.0.13
      • HTCondor-CE 5.1.5
      • XCache 3.1.0
      • xrootd-multiuser 2.0.4

      Miscellaneous

      • Contact for AGIS/CRIC? OSG Central Collector AGIS compat layer has been down for quite some time
      •  
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      Smooth running

      Chris Hollowell gave a talk about experience with Kubernetes at BNL at the pre-GDB meeting on 6/7 https://indico.cern.ch/event/1096043/

       

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonably good running since last meeting especially in the last week.
        • Central ADC issues: RUCIO DB outage 5/27 for a few hours. Planned ~1 hour Oracle upgrade today (6/8) with production system paused for the duration.
        • AGLT2 Second incident with NFS server on 5/27
        • MWT2 Job starvation on 5/29 & 5/30.
        • NET2 Recover from annual power 5/26 & 5/27. 5/30 Transfer issues. Several small drainings in June.
        • SWT2 Transfer errors(?) and/or job starvation caused draining from 5/27-5/31.
      • Draft document for entering FY22 & FY23 equipment purchases (FY23 not included yet) is at:

        https://docs.google.com/document/d/10zRzY8yWXCUY3CVG6T4pZk091raN8qkIUJaxnWn1nx0

        The charts in this document are links from a reworked WLCG-v60 tab which is where you should enter numeric data.

        https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4

        This will be part of the preparation for the pre-scrubbing.
      • I'll be asking everyone about their readiness for Run 3 data taking.
         
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        Incidents

            05/25/2022
            2nd instance of umfs02 NFS server VM problem (used for osghome and our management files)
            lost accessibility again with same error
            “NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! ”.
            Migrated it to different Vmware host

            05/27
            Trouble with the nfs server umfs02 again
            Now suspecting latency from VMware iSCSI storage (TrueNAS), so moved VM to local NVMe storage
            Unfortunately this problem causes apparent high "load" and BOINC job control throttles back
         
            06/02
            umfs02 in trouble again
            realized we were missing NFS server tuning after transition physical to VM
            (/etc/nfs.conf, changed threads from default 8 to 512).
            Problem solved

        Hardware

            06/03
            UM site received the 10 R6525 work nodes ordered in Sep 2021,
            nodes racked/cabled/labeled and provisioned and put in production in 2 days.

        Software

            6/07
            Update dCache from 7.2.15 to 7.2.16,
            and also updated kernel and firmware (rebooted to install BIOS updates).
            The process went well.

            6/08
            Update condor from 9.0.12 to 9.0.13 from the osg testing repository.
            This will cause an automatic rolling draining and condor restart.
            We also set condor to drain and wait on the C6420 work nodes, so we could reboot them to apply the new BIOS updates.

            Update HTCondor-CE from 5.1.3 to 5.1.5 from the osg testing repository
            on the test gatekeeper gate04.

         

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Continuing to debug hardware issues on one of our dCache pool nodes.

        Upgrading condor on our workers to 9.0.13.

        Fourth gatekeeper in production and receiving jobs.

        UC and IU Squids reconfigured for multicore (4 cores/squid) and have heartbeats added.

         

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef
      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        OSG 3.6

        • All production jobs are running on OSG 3.6 CE
        • Backup CE needs to be updated

        IPV6

        • Moved all network connections to IPV6 Switch
        • Build procedure for hosts in place
        • Requesting IPV6 address (w/o DNS entries, for the moment)
        • Testing still in progress

        GridFTP/LSM

        • Disabled GridFTP as WAN transfer protocol
        • Had to enable root protocol as WAN/1 to keep LSM
        • Testing the removal of LSM is tricky (managed to offline production queue twice)
        • Rucio mover/pilot may not be able to use internal ROOT door due to URL being registered. 

         

        OU:

        - Still working with Dell to get RAID6 array fixed on cstore13. In the mean time, xrootd is working fine using the copy of the data from that data server on ourdisk ceph scratch partition.

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
      • Cori operating normally
      • Still waiting on validation samples for Perlmutter
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Multiple topics (metrics, eos mounting, discourse,...) on tap for tomorrow's 2.3/5 meeting
        • Interesting presentation by Ricardo at last week's HSF AF Forum
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • GridFTP protocols removed from remaining RSEs
        • Nevis_SE_0, NERSC-PDSFSRM_SE_0 now set to HTTP
        • NERSC-PDSF_SE_0 now set to GLOBUS
        • SWT2_CPB_SE_0 write_wan/1 now XROOTD
      • No change in settings for non-Rucio pilot copytools
      • ADCR Database intervention this morning.  Rucio, Panda, CRIC all down from 9-10 CEST.
      • Pre-GDB on k8s earlier this week (talks by Lincoln on SLATE, Chris on BNL OKD deployments)
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
        • BNL xcache updated to 5.4.3rc4
        • other caches work fine
        • vp works fine
        • a lot of VP jobs running at NET2

        Starting preparations for ES upgrade.

        Performance upgrade for ServiceX getting ready.

         

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Last week the network of the cluster was still misbehaving, and at the end of the week Patrick replaced the switch which was locking up. That resolved the issue. In the Calico network configuration I modified the IP_AUTODETECTION_METHOD which was the possible suspect, and the system responded that it is updated. The process recreated the Calico pod for that node, but not clear that it did the trick (could be something overrode the parameter), and at least id didn't resolve the connectivity issue for that Calico pod. 

    • 14:55 15:05
      AOB 10m