US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Token transition

      • Register for the Oct 14 & 15 workshop! https://opensciencegrid.org/events/Token-Transition-Workshop/
      • Update your CEs using packages from 3.5 upcoming
        • HTCondor 9.0.5 contains an important change that improves bearer token + GSI support
        • There are known bugs with the new job router config transform syntax and fixes are expected HTCondor-CE 5.1.2. The old job router config syntax still works fine, though!
        • Docs for installing a new CE: https://opensciencegrid.org/docs/compute-element/install-htcondor-ce/
        • Docs for updating an existing CE: https://opensciencegrid.org/docs/release/updating-to-osg-35/#updating-to-htcondor-ce-5
        • CEs still on versions that don't support token submission:
          atlas-ce.bu.edu
          bgk01.sdcc.bnl.gov
          bgk02.sdcc.bnl.gov
          gate01.aglt2.org
          gate02.grid.umich.edu
          gate03.aglt2.org
          gk01.atlas-swt2.org
          gk04.swt2.uta.edu
          grid1.oscer.ou.edu
          gridgk01.racf.bnl.gov
          gridgk02.racf.bnl.gov
          gridgk03.racf.bnl.gov
          gridgk04.racf.bnl.gov
          gridgk06.racf.bnl.gov
          gridgk07.racf.bnl.gov
          gridgk08.racf.bnl.gov
          iut2-gk.mwt2.org
          mwt2-gk.campuscluster.illinois.edu
          ouhep0.nhn.ou.edu
          spce01.sdcc.bnl.gov
          tier2-01.ochep.ou.edu
          uct2-gk.mwt2.org
      • FaHui verified token-based submission to the MWT2 CE and is working on making necessary changes to Harvester

       

    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 10m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
      • working on SRM+HTTPS,
        • once we get successful tests on testbed then we test BNLLAKE LOCALGROUPDISK
      • removing SRM from disk only endpoints
      • waiting on DDM ops to setup BNLLAKE DATADISK. need that setup to meet pledge.
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • It was a very good running period for the last two weeks. Prod job failures were low:
        • AGLT2 1.0%
        • MWT2 1.2%
        • NET2 4.1%
        • SWT2 OU 0.8%
        • SWT2 CPB 7.3%
      • Could each site report on their order status - I think most sites have their orders in but would like to confirm.
      • IPV6 Status at NET2 and SWT2?
      • HTTP TPC at NET2 and SWT2
      • MWT2 (and BNL) starting investigation of HTConder CE 5.1.1 / Condor 9.x using OSG 3.5 upcoming (preparation for move to OSG 3.6 )
      • Still working on SLATE / Frontier-Squid: OU Setting up server and CPB moving OSG jobs to the SLATE Squid.
      • Moves at UC and UTA proceeding.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        09/01/2021

        A fiber cut between the University of Michigan and Chicago happened yesterday and, because of our network resiliency, connectivity for MOST networks failed over to an alternate path.  However, there was a problem with the LHCONE peering and it did not fail over correctly, so it caused transferring failure (over 15000 jobs in transferring status) and SQUID monitoring issues (both slate squid servers appear to be down in the CERN squid monitoring while they are actually running).  This has just been fixed (Sep 2 2021 around 10:35 AM Eastern time).

         

        09/06/2021 

        One of the dCache pools (umfs11_6) became offline which resulted in some file inaccessible ( ggus 153689), the pool was put online once the problem was identified. 

         

        09/07/2021

        Retired rack 117 (became MSU T3), 39xR610 with dual 16 HT core E5520@2.27GHz, 624 HT core toral 

         

        AGLT2_UM has equipment ordered with Dell but delivery times are from Dec 1, 2021 to January 5, 2022. MSU site order is still waiting for approval. 

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        Tested gatekeeper + queue with HTCondor-CE 5.1.1 and Condor 9.0.5. Were running 350 jobs with Fahui's test harvester instance

        Applied mitigation for CVE-2021-3715 to all three sites

        Running network tests between the old and new UChicago datacenters before putting transitional storage online and starting datacenter transfers

        Brief power outage last weekend in the UChicago datacenter. No critical services were affected, but a number of compute rebooted

        POs submitted for all three sites

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        o 1 GGUS ticket, some failures from NESE Ceph endpoints due to overloading.  

        o  P.O. for 88 worker nodes at DELL, + 2 for Tier 3

        o 8TB disks failing in out-of-warranty GPFS, evacuating LUN behind the scenes. 

        o Webdav with Xrootd 5.3.0 passing smoke tests. Ramping up for production. ETA end of this week, NESE endpoints to follow.  

        o Ramping up work on NESE Tape endpoints.  Working with IBM, MIT, NESE operations team.  

        o Preparing for NET2 expansion including UMass, large new compute resources.  

        o MIT has purchased equipment for upgrading MGHPCC WAN to 1800 Gb/s total on the MGHPCC-Boston-NYC-Albany loop. 

        o Met with Mark Sosebee re: spiffing OIM entries.  

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2:

        • Starting the retirement process for the cluster
          • Created UTA_SWT2_UCORE_RET panda queue with remote I/O
          • HC worked initially but went offline this morning
          • Once this is in place, will empty UTA_SWT2_DATADISK

        SWT2_CPB:

        • Deployed HTTP-TPC service on gk06.atlas-swt2.org
          • Passing DTEAM and ATLAS smoke tests, moving to DOMA smoke tests
        • In process of deploying ~2PB of storage and new compute nodes
          • equipment racked, working on configs / burn in
        • Starting to redeploy the K8 Cluster for Armen's use, hopefully something by next meeting.

        OU:

        - Not much to report, site running well.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speaker: Lincoln Bryant (University of Chicago (US))
    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))

        David: Nothing new to report. Working on getting GPU machine up. Still on track for the bootcamp in October.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Webdav protocol added to SWT2_CPB SE; Alessandra issued PR to add SWT2 to DOMA smoke tests after manual testing was successful.
      • NET2 should be in production with HTTP-TPC by end of this week
      • Proposal to move ATLAS completely off gsiftp transfers (including tape) by end of the calendar year, with cleanup and removal of gsiftp from all configurations by start of Run 3
      • Upcoming downtimes for all ADC services due to Oracle DB upgrade
        • Mon. 9/20, 10:00-10:30 CET
        • Mon. 9/27, 10:00-18:00 CET
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        • issue with TLS now used by two sites. Solved.
        • All the SLATE xcaches updated to 5.3.1 and redeployed.  
        • Otherwise working smoothly

        VP

        • ALGT2 VP is again getting jobs
        • all working fine

        ES

        • Soon will start migration to new computing center. 
        • Will be done in two steps. 
        • Should not need any downtime

        Rucio

        • a bug found in how client location was cached. fixed and will be in testing on Sep 27th. In production week later.
        • will need to switch to use Lat, Long for sites as found in CRIC 
      • 14:30
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Information on status with UTA_SWT2 hardware move to CPB, and start of redeploying the hardware for K8 cluster in Patrick's report.

    • 14:40 14:45
      AOB 5m