US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (next week?)

      • XCache 2.0.1
      • vo-client 114 containing IAM LSC files (vomses updates targeted for October)
      • HTCondor 9.0,2/blahp 2.1.0 (upstream release expected today)
      • XRootD 5.3.0 RC4 available in 3.5 upcoming-testing

      Other

    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 10m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      BNL staff are required to take excess of vacations days before July 20

      BNL HPSS downtime scheduled for 2-Aug-21 12:00 UTC to 6-Aug-21 00:00 UTC. SRM services affected. OIM used to schedule downtime

      Working through the FTS-dCache-HPSS local site monitoring plots and ATLAS DATA Carousel monitoring to ensure we can spot errors or inefficiencies. The later is very simple and has no time series available.

      In the middle of splitting up BNLLAKE into DATADISK part and LOCALGROUP disk part. determining the proper paths for the HPSS system to ensure accurate accounting of tape usage and make tape recycling easier. Working with DDM Ops to rationalize space reporting since these are Disk/Tape endpoints.

       

       

       

    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • OK Running over the last 2 weeks
        • Major RUCIO issue at CERN yesterday leading to massive blacklisting (Oracle overload)
        • AGLT2: Still recovering from infrastructure upgrade (nearly there),
        • MWT2: had difficulty last weekend because of a jobset exhausting the dCache connections to various pools.
        • MWT2: aipanda158 has been withdrawn from service and replaced with aipanda159 - hopefully this will solve longstanding job starvation issue.
        • SWT2: Condor and AC issues.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        MSU site:

        1) All WNs online (still at the old location).  
        2) All switches online at Data Center.  Next move will include all recent T2 WNs, now set for July 21st.  

        UM site:

        Still in the recovery process. 

        1) having network issues (IPV6)for a few work nodes, for now use fixed ipv6 config resolved the problem

        2) All C6420 work nodes with Intel NICs do not light up.

        3)  About 2/3 UM work nodes are back online, the other 1/3 need network config/debug. 

        upgraded dcache from 6.2.21 to 6.2.23, the update went smooth. 

         

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
        • Lost files reported for MWT2_UC_LOCALGROUPDISK. Marked as lost in rucio
        • dCache overloaded due to analysis jobs with directio. Increased number of pool movers
        • Power outage at the UIUC datacenter last week. Most workers are back online now
        • Building new IU perfsonar nodes
        • Benchmarking Dell CPUs for upcoming compute purchase
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))


        No GGUS tickets.

        Production dip due to global PanDA issue affecting all sites. 

        Normal operations otherwise.

        Hardware for NESE Tape endpoints has arrived. 

        Upgrading to xrd 5.3.0 and setting up final configs is our highest priorty.

        Annual Power Maintenance Down day is August 9.

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Upgraded xrootd proxy (se1) to 5.3.0.rc2, which fixed xrootdfs and http-tpc bugs. Will add site to http-tpc testing when I'm back from vacation on 7/19.

        UTA:

        - Discovered a problem with Condor spool directory access on four compute nodes. (Nodes were batch system "black holes.") Fixed, and implemented monitoring to prevent a recurrence.

        - Campus facilities fixed a problem with the A/C in the machine room.

        - Awaiting delivery of the equipment from our most recent purchase (compute nodes, storage, LAN re-fresh).

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speaker: Lincoln Bryant (University of Chicago (US))

      NERSC changed to use 'low' priority queue to preserve allocation.

       

      TACC changed from 50 node to 100 node jobs to try to push more throughput, so far looking good. We've used just over a quarter of the allocation.

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        NTR

      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • Possibly 2 more compute/storage machines have either arrived, and a third will be arriving soon (need to verify if they are for analysis or something else).
        • 2 currently hooked up machines have hardware issues. We have replacement parts for one and are ready to start a case for the second.
          • Will open a case this week and hopefully take care of by end of next.
        • Working on temperature control still.
          • There was a meeting with the vendor yesterday regarding.
        • Working on getting condor queue set up.
    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • XRootd 5.3.0 rc4 released, expect final release by Friday
      • New monitoring dashboard for upcoming data challenges
      • OSG topology validation/update ongoing
      • Working on scrubbing slides
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        Ilija is on vacation.  

        Work is needed to describe procedures and responsibilities for the federated operation team, to formalize the management of frontier-squids.  A goal is to implement a GitOps-like procedure to help manage updates and record history.  Another is to create an alarms and alert service.

      • 14:30
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Nothing new to report.

    • 14:40 14:45
      AOB 5m

      One very important (and now overdue) item:

      OTP numbers for ATLAS for January - June 2021

      Needs to be finished TODAY if at all possible (Scrubbing work should provide the numbers?)

      Tier-1s:
      https://docs.google.com/spreadsheets/d/1zd2buKvvySIFn4MrAa1FBYY8KVrO3h9lDDZVZYeKdEc/edit#gid=0

      Tier-2s:
      https://docs.google.com/spreadsheets/d/125LFywoKP2cighPfpDuCoPkQO05bND39vicVMRokeMQ/edit#gid=0

      Cloud Operation and Management:
      https://docs.google.com/spreadsheets/d/15VN6Rfo4fTKo7KOH-A9V0so9pw1r2QwSU7vJbu2UMMI/edit#gid=1008601579