US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Shawn and Rob have been re-appointed for another round of facilities management.

      First big task will be to review each L3 area in prep for the pre-scrubbing.

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBN 30m
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • The efficiency of the sites was lower than usual,
        • AGLT2 nfs server failure on 5/21 (also updated condor to 9.0.12 on 5/12)
        • MWT2 Problems keeping site full: Partly drained over the weekend of 5/14 and 5/15 when 1 of 2 GK failed.  To add margin turned on 3rd gatekeeper on 5/20 and working on getting a 4th gatekeeper in service today.
        • NET2 series of expired certificates caused site to drain. Offline for last 2 days for power work. Not refilling.
        • SWT2 CPB storage issues caused SLURM to not start jobs (twice?)
        • ADC Pilot issue caused jobs to be killed at 48 hours - work around in place.
      • Run 3 data taking readiness:
        • AGLT2 and MWT2 are ready (both sites are waiting for hardware delayed hardware deliveries)
        • NET2 Needs to update to OSG 3.6, XRootD 5.4.3?, enable IPV6, enable new storage and retire GPFS
        • SWT2  OU Need to setup OSG 3.6 gk, setup SLATE server, XRootD 5.4.3?, finish enabling IPV6
        • SWT2 CPB Finish OSG 3.6 update, XRootD 5.4.3, finish enabling IPV6, remove LSM, still waiting for delivery of some servers,  need to put new network hardware into service.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        05/12/2022

        Updated condor from 9.0.11 to 9.0.12

        Updated gratia probes on all gatekeepers. Gratia probe stopped working for a day after the upgrade, and it was fixed by reconfigure, and then manually restart condor-ce and  run 

        su - condor -c " /usr/share/gratia/htcondor-ce/condor_meter"

         

        05/17/2022

        We migrated the Tier2 NFS server umfs02 to a virtual machine without having downtime. This nfs server provides the home directory for all grid users. The migration hit some problems: 1) the MSU work nodes could not mount the new NFS server because of routing issues . We added the routing rules as a workaround. 2) This nfs server also serves as the archive directory for the dCache postgresql databases’ hot standby replication. For one of the database servers (head01), the hot standby replication did not have a smooth transition during the 20minutes downtime when the NFS servers were swapping, so we ended up reseeding the database from head01 to its hot standby server d-head01.

         

        We converted all 26 remaining SL7 servers at UM site to CentOS7, this includes all the dCache pool nodes and lustre storage nodes. 

         

        05/21/2021

        The new nfs server(virtual machine) umfs02 lost accessibility, increasing the memory and CPU restored the service.The site drained to 10% usage on 21st because of this incident. 

         

        05/23/2022

        Gratia on the OSG gatekeeper (gate02) stopped working for 2 days. Restarting condor-ce service fixed that

        MSU finished installing and phasing in the 3x new VMware AMD host nodes (ordered Sept 2021).
        But still using old direct-attach SAS storage. Last step will be to start using new NVMe storage via iSCSI (also received from 2021 order) 

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • Storage node went down again. Troubleshooting further with vendor. Needed to declare a handful of files as lost as a result.
        • Production switches arrived at UC. Working on getting times to install them and take out the temporary switches.
        • Second gatekeeper at IU is set up and in production. Currently running 3 active GKs.
        • Working on setting up a second gatekeeper at UC for a total of 4 gatekeepers for MWT2.
      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef
      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - One of our 7 xrootd storage servers is having RAID6 issues, so we copied all of its contents to the OSCER ceph scratch and then pointed xrootd there, while we are re-creating the RAID6 array from scratch with two new drives, and then we'll copy everything back. Should take a few days.

        - xrootd pointed at the ceph copy seems to work fine.

        - We prevented new data from being stored on that server during this maintenance.

        SWT2_CPB:

        - Installing compute nodes from our purchase earlier this year (48 nodes total).

        - Still awaiting delivery of WN's from the previous purchase! Dell claims it's imminent...

        - Working to finalize scheduling for our remaining to-do's in Fred's list.

        - The partition holding the slurm DB filled up on 5/20 (ggus 157319). Took a while to clean the area and remove the debris. We'll implement some configuration changes to avoid a recurrence.

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
      • Perlmutter CPU queues are online and "free" currently.
        • NERSC_Perlmutter_Test has been working well in the CPU queue. 
        • Only running 1 job/node (128 cores) right now (~50% CPU efficiency).
          • 30 min to complete a job this way
          • May be worth it to be more inefficient if there will be a preemptable queue for CPU? Usually there are guarantees e.g. 2 hrs walltime before preemption.
      • Asking production coordinators to send some real tasks to NERSC_Perlmutter. 
      • Cori working well
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Analysis Ecosystems Workshop this week
          • Summary report being written
        • AF Forum last week
          • Ilija spoke about integrating xcache with AF - will try this out at BNL AF
        • Discussion of EOS at AF at tomorrow's 2.3/5 meeting
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        it appears user's jobs running on AF were unaware of Squid caches so these accesses went directly to CERN squid.

        Fengping and Lincoln working on making these jobs use correct squids.

         

    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        • works fine

        VP

        • works fine
        • integration goes (slowly)

        Analytics

        • work on separating code base and hardware used by AF and ATLAS Analytics

         

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Started looking into Calico network configuration, and also Lincoln suggested what to modify from parameters to try first, see if that will fix. As I was working on that, noticed a general networking problem. 
        Waiting for that to be fixed before moving forward with any network related configuration changes on the K8S side. 

    • 14:55 15:05
      AOB 10m