US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 1
      WBS 2.3 Facility Management News
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 2
      OSG-LHC
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (this week)

      • HTCondor 9.3.0 in OSG 3.6 upcoming
      • osg-ca-certs-updater for EL8

      Other

      • Update your CEs to HTCondor-CE 5 and HTCondor 9 from OSG 3.5 upcoming!
      • FaHui is updating central Harvester infrastructure to support token-based pilots
      • For pilots, we expect to be able to consolidate token -> user mappings to a single user!
      • Working with the HTCondor team to figure out a solution for mapping SAM/ETF tests to a separate user (scope based mappings?)
    • Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 4
      WBS 2.3.1 Tier1 Center
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      Smooth operation\

      FTS upgrade on Nov 2, 4h downtime

       

    • WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Relatively rough period over the last two weeks:
        • Several problems on the CERN side
          • Network outage on 10/15 (Friday)
          • Draining problem over the weekend (10/17-10/18)
          • Yesterday someone at CERN removed an "unnecessary" DB link.
          • Other decreases were site maintenance amd various site networking problems.
          • Large increase in job slots for AGLT2 is a redefinition of BOINC Job as 8 slots rather than 1 job.
      • 5
        AGLT2
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)


        Overall smooth 2-week operation (until today)

          - new problem : 60% of pilots failing, but other 40% payload normal (2.4% failure)
          Being asked to check condor config, do not locally retry jobs that fail

          Also 2 events in BOINC queue, presumably unrelated
          - CERN operation error with DBRelease yesterday seemed to have caused job failures
          - monitoring: job count has jumped from ~2k to ~12k.  
          Presumed to be proper accounting of multicore payload.

        Storage purchase

          - UM received 5x R740xd2 last week, installed, in production, adding 1180 TB usable
          Retired 2x older storage nodes (678 TB) with 4x MD3xxx shelves (including umfs11 which had the recent hardware problems)

          - MSU received 3x R740xd2 this week, racked, soon into production, adding 708 TB usable

          - net + 1210 TB

         

      • 6
        MWT2
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        Brief scidmz network outage at UChicago Oct 15

        IU downtime to reorganize racks and reconfigure network

        ICC PM downtime

        Updated vo-client on the MWT2 dCache nodes

        Removed SRMv2 from TPC list in CRIC

      • 7
        NET2
        Speaker: Prof. Saul Youssef
      • 8
        SWT2
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Overall, site running well.

        - xrootd hung up on proxy (se1.oscer.ou.edu) again last night, restarted. Andy and Wei are looking into it.

        - Problem exacerbated by rucio copytool (site mover) still not doing write_lan correctly, meaning all stage-outs are still routed through se1 instead of directly to the local redirector, That really needs to get fixed soon, since it causes huge inefficiencies.

        SWT2_CPB:

        - XRootD on the webDAV host is more stable following the update to the curl libraries

        - Installed more storage from the most recent purchase (Summer)

        - setup new perfsonar BW host. Both machines now new hardware.

        UTA_SWT2:

        - Retirement progressing

        - ddm ops re-started the cleanup of the SE. As of this morning ~10 TB of data known to rucio remain

         

    • 9
      WBS 2.3.3 HPC Operations
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))

      TACC 

      • Needed to update local copy of ALRB etc to get the VO Client update at TACC (same as what affected Tier 2s over the weekend)
      • System maintenance yesterday, some job failures because of it.
      • Priority is looking better overall, jobs are generally going through.
      • 32% of allocation remaining

       

      NERSC

      • Filesystem degraded on Monday causing job failures.
      • 60% of 20M additional node hour allocation remaining.
      • Another 10-20M hours possibly coming. Will need to use or lose it before end of year.
    • WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 10
        Analysis Facilities - BNL
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 11
        Analysis Facilities - SLAC
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 12
        Analysis Facilities - Chicago
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        ML platform running fine.

        AF

        • Andrew working on getting ML-platform like thing reimplemented on AF
        • Both xAOD and Uproot instances on AF running fine. A lot of improvements. 
        • Got new SLATE deployed XCache dedicated to ServiceX. Works great and only limited by the NIC. Will get 2x25Gbps this week. 
    • WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 13
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14
        Service Development & Deployment
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCaches

        • all running fine.
        • 5.3.2 is out and I am testing it. Will update everything this week.
        • I am adding a container that will send rucio heartbeats.

        VP

        • all running fine
        • only issues with RAL. Should be solved with that 5.3.2 update.

        Rucio

        • work on integrating VP
          • json database for placements (PR is ready)
          • adding heartbeats (almost ready)
          • work on the placement engine not started yet

        ServiceX

        • upgraded uproot xrootd plugin version
        • much better performance with the big fast xcache.
        • will try deploying in FABRIC.
      • 15
        Kubernetes R&D at UTA
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        UTA_SWT2 decommissioning nearing completion, for hardware to be used for Kubernetes cluster at CPB (see Mark's report).

    • 16
      AOB