US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

      Quarterly reports due ASAP (deadline Friday)

          - All milestones updated?

      Next week is ATLAS S&C week.  See https://indico.cern.ch/event/998138/ for some of the agenda.

       

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      3.5.29 (tomorrow)

      • XRootD 4.12.6
      • osg-configure 3.11.0
        • Make fetch-crl success optional (though give a warning on failure)
        • Don't try to resolve the Squid server, since it only needs to be visible from the workers, not the CE
      • IGTF 1.109

      Other

    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • dCache incident over the new year holiday
        • dCache disk was full, causing job stage-out failures. Discrepancy between space reporting and real space usage, stop-gap solution is in place, long term solution under investigation. 
      • data transfer failures for BNLHPC_DATADISK
        • Globus online endpoint overloaded. Solved by replacing NFS with Lustre as the shared FS. Long term solution under discussion.
       
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Please get your quarterly reports for Oct-Dec in today!
      • The last 2 weeks of running were OK:


        • AGLT2 had a dCache upgrade and various knock-on effects afterwards.
        • MWT2 has a downtime today at the Illinois site.
        • NET2 had some storage outages.
        • SWT2 had an incident at OU where no jobs were scheduled and storage issues at CPB.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Incident:

        The job failure rate rose to over 40% with file stage in and out errors.
        These errors affected dcache file replications between UM and MSU which then caused jobs to fail.

        We spent several days investigating it (restarting all dcache nodes,
        upgrading FW, testing network between UM and MSU).
        We then noticed a weird problem which didn't immediately seem related.
        We had packet loss of a few percents between some pair of MSU-UM pool nodes.
        The problem was not correlated to any specific node or site.
        One other oddity/key observation was that some pairs of nodes
        showed errors on the public path but not the private path
        sharing the same interfaces.

        This moved the suspicion to some effectg of hashing on multiple cables between switches.
        We could not locate this problem on any of our on-site switches.
        Then UM IT notified us of link errors on one of the 2 links (2*40Gpbs) between MSU and UM.
        The bad half-link was set offline and all ping and dcache errors were resolved.
        But the cause of the bad link is still under investigation.

        Job failure rate for pas 12h was 1.6%

        Software:

        Updated htcondor from 8.8.11 to 8.8.12 (latest version in osg repo) on head node and gatekeepers,
        and updated work nodes to 8.9.11 from the osg-upcoming repo.

        After the gatekeeper update, we saw running jobs started to drop,
        but no error message was spotted in log files.
        Restarting condor/condor-ce services fixed this issue.

        Hardware:

        Updated the firmware with reboot of all our R740xd2 and C6420 machines.
        (Dell sent a warning email about a critical bios update)

        Called Dell support to update FW on some older pool nodes
        where the command line dsu method was failing.

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        One of the dCache storage pools went offline. Working on identifying cause and bringing back up

        Power issues in the NCSA server room. PDU replacements were made last week, and affected systems will be brought back up during today's PM

        NCSA quarterly PM today. All UIUC workers are in downtime until 8pm for system updates

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))


        Problems:

        Brown out at MGHPCC caused us to lose ~1 hour of useful operation time.  Solved without GGUS ticket :) 

        We're currently getting controller errors in the GPFS system pool.  To repair, we needed to free up some space, evacuate and rebuild.  In progress without interrupting production. 

        Smooth operations otherwise. 

        The main things we're working on:

        Testing xrootd 5.0.3 endpoints with help from Wei.  

        OSG 3.5.

        NESE prep.

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, running well.

        SWT2_CPB:

        Power loss at SWT2_CPB on 1/14 during campus electrical work due to generator failure. Awaiting clarification from Physical Plant.

        Rack level switch locked up yesterday isolating two storage hosts and an NFS host producing a GGUS Ticket 150261

        Starting to work on moving Xrootd door to OSG 3.5.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), lincoln bryant

      NERSC ran out of allocation so it has been off until new allocation period starts tomorrow.

           - Will test FastCaloGan container on NERSC GPU when NERSC is back from downtime

      TACC recent running all jobs failed according to PanDA when Lincoln is back will need to debug errors.

      ALCF Theta - got a large amount of CPU hours - over loaded Globus Endpoint at BNL.

           Solution switch storage back end from NFS server to Lustre.

          Longer time test dCache w/ Globus

       

       

       

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        All working fine.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind
      • Adding "Ops Round Table" to ADC Weekly starting Feb. 2nd with shared doc for topics
      • Event Service configuration issues at OU - Horst to follow up with developers
      • Met yesterday to discuss Frontier/Squid deployments - planning a topical presentation for next meeting (in two weeks), updated associated milestone
      • Need input for QR
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache:

        * a lot of changes to the OSG based image configuration.

        * Testing xcache 5.1.0rc4 at UC River cluster and in production at SWT2_CPB

        VP:

        * All running fine. 

        * developments related to heartbeats

        * still working on opening CERN k8s loadbalancer

         

    • 14:40 14:45
      AOB 5m