Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 1
      WBS 2.3 Facility Management News
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 2
      OSG-LHC
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • OSG 3.5 enters critical/bug fix only support at the end of this month: https://opensciencegrid.org/technology/policy/release-series/#life-cycle-dates.
      • Now is the time to start updating HTCondor-CEs to 3.5 upcoming to get ready to accept tokens
        • AGLT2
          gate01.aglt2.org
          gate02.grid.umich.edu
          gate03.aglt2.org
        • BNL
          gridgk01.racf.bnl.gov
          gridgk02.racf.bnl.gov
          gridgk03.racf.bnl.gov
          gridgk04.racf.bnl.gov
          gridgk06.racf.bnl.gov
          gridgk07.racf.bnl.gov
          gridgk08.racf.bnl.gov
        • BU
          atlas-ce.bu.edu
        • MWT2
          iut2-gk.mwt2.org
          uct2-gk.mwt2.org
        • SWT2
          gk01.atlas-swt2.org
          tier2-01.ochep.ou.edu
      • OSG will host another token hackathon this month – admins are free to come with questions, for help updating their CEs, etc.
      • Pre-GDB token workshop in October (11-12?)
    • Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 3
        TBD
    • 4
      WBS 2.3.1 Tier1 Center
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      writing to Tape from ATLAS has been paused so we can reduce the number of tapes used for writing. We have reduced the number of tape drives used for writing to 4 drives (each capable of writing 200 MB/s) we can write 67 TB a day to tape and we have almost 400 TB to write from internal HPSS disk cache to tape. Can not reduce the number of drives any more

      We having a problem staging files from Tape to dCache disks. dCache is not pulling files from the HPSS Cache and we are seeing a lot of bad dCache restore requests. We are investigating and actively trying to clear up the situation so that data can flow.

      This problem was triggered by a large request on Saturday mid-day >200k files. Exact source of the request is under investigation

       

       

       

    • WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Good running with scheduled downtimes at AGLT2, MWT2, and NET2.
        • A few mysterious periods of draining that I will ask the relevant sites about.
      • Nearing (I think nearing!) getting final quotes for the FY21 purchases.
        • Dell will raise the prices on Sep 1 and this means we need POs by Aug 31.
        • Prices are up - way up.
        • Almost no choice in CPUs - most are back ordered to ~January 2020. For compute servers the only viable AMD processor is the EPYC 7302 (16C/32T @ 3.0 GHz) which forces either 2 GB or 4 GB per thread. (Last year used the EPYC 7402 (24C/48T @ 2.8 GHz which yielded 2.7 GB per thread.)
        • I need to know how many compute servers each site wants and how much money each site has to spend on compute servers.
      • 5
        AGLT2
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        MSU

        - last of 4 waves done 11-Aug-2021, mostly MSU T3, thus finishing  equipment move from dept bldg to data center.

        - purchase planning: 3x ESXi hosts + 1x NVMe storage + ~3 dcache storage nodes + as many 1U WNs as budget allows.

        - FYI@MWT2: we may be decommissioning the EX9208 this week, to be confirmed. 

        UM:

        - draining gate03 for condor-ce update, (we lost 1000 jobs when we did update on gate01 with running jobs, to be cautious, we drain the  gatekeeper first)

        - patched a bunch of nodes with ipv6 issues (adding ipv6 neigh rules manually). 

        - did 2 condor update (8.4.13->8.4.14->8.4.15)

        - rebuilt all Tier2 WN with CentOS7

        - finished rebooting all nodes to the new 1160.36 kernel

      • 6
        MWT2
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        Network downtime at UChicago this past Monday. Line card replaced on our EX9208 in preparation for our move to our new datacenter

        Elasticsearch upgraded to 7.14

        Continuing to work on migrating our squid traffic onto the new SLATE squids

        Discussing purchasing for new compute

      • 7
        NET2
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        MGHPCC scheduled maintenance.  

        NESE_DATADISK was down for an additional day for Harvard re-networking.  

        xrootd is working now, passing smoke tests, HTTP-TPC. 

        High priority items:
         

        1) Prepare for worker node purchase
        2) Expand xrd cluster, switch over from gridftp to xrootd 
        3) OSG 3.5 update 
        4) ipv6 finish 

      • 8
        SWT2
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, running well.

        - Upgraded xrootd proxy (se1) to 5.3.1, seems to run well.

        UTA:

        - Preparing to install the compute nodes + storage from our last purchase. Logistically this will allow us to move forward with the move / retirement of UTA_SWT2.

        - About to install the latest version of XrootD on the HTTP-TPC test instance. Need to verify the ROCKS recipe for building the host as a final step prior to production deployment.

        - Need to schedule a downtime to install the LAN networking upgrade hardware. Many needed software updates will occur during this outage.

        - Recent operations generally stable, smooth.

    • 9
      WBS 2.3.3 HPC Operations
      Speaker: Lincoln Bryant (University of Chicago (US))
    • WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 10
        Analysis Facilities - BNL
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        NTR

      • 11
        Analysis Facilities - SLAC
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        A DTN w/ 100gbps is online

      • 12
        Analysis Facilities - Chicago
        Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • GPU arrived, not yet set up. Otherwise nothing to report on hardware side.
    • WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • XRootd checksum issue at BU is resolved, running smoke tests successfully
      • F-S DevOps mtg
        • Fred has draft of installation documentation from installing test server at IU
        • Jess deploying new server at Illinois (will further refine documentation)
        • Ilija tracked old squid usage to OSG jobs using CVMFS
      • 13
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14
        Service Development & Deployment
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
        • ES upgraded yesterday to 7.14. Changing it's monitoring and preparing it for big upgrade (8.0). Manually indexing missed data.
        • XCache - all up and running. Now SLATE xcaches auto-configured through GitOps. 
        • VP - all working fine. Working on Rucio integrations. Rucio Oracle upgrade only in September.
        • Squids - all working fine. Preparing things for retirement of "old" squids.
        • Rucio GeoIP still not tested due to unrelated changes that make it python 2.7 incompatible. Now version 1.26.2
      • 15
        Kubernetes R&D at UTA
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Information on status with hardware move to CPB in Mark's report.

    • 16
      AOB