US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      3.5.28 (this week)

      • HTCondor 8.8.12
      • XRootD 4.12.5
      • HTCondor 8.9.10 (upcoming)

      Miscellaneous

    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 15m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Generally good running over the past two weeks but not in the last few days.
        • Hammercloud bug caused lots of sites to be put offline unnecessarily.
        • Currently AGLT2 and NET2 have high failure rates.
        • Yesterday a problem with Harvester instance on a VM at CERN (ai156) partially drained MWT2. A different issue with priorities drained OU.
        • CPB was in test even though the failure rate is not so high.
        • Sites have reported jobs consuming way too much memory.
        • I reported a job that was consuming ~175 GB on a server that had 256 GB of memory.
      Site Total Done Failed Canceled Closed %Failed
      AGLT2 (no BOINC) 93060 68754 6751 210 4643 9%
      MWT2 276422 224597 12613 991 8422 5%
      NET2 71892 49350 12264 744 3934 20%
      OU_ATLAS 8015 4324 58 31 638 1%
      OU_ATLAS_OPP 3797 3219 17 17 514 0%
      SWT2_CPB 97828 79512 2922 352 3818 3%
      UTA_SWT2 32183 26396 414 413 1273 2%
      • Issue with the SWT2_ATLAS_UTA accounting showing low CPU efficiency was detected by the OSG team
        • I am still struggling with CRIC to reproduce the plots of the official numbers that I knew how to make with the LCG/EGI accounting website. Ofer pointed out to me that using monit might be simpler.
        • We need to do a better job watching these numbers for all sites.
        • I will validate the November numbers for the US Tier 2s in the next day or two.
      • Did finally receive permission to change the CRIC information for Lucille and marked all queues, services, SEs, and the site itself as disabled. Thus the site is hidden while retaining the historical record.
      • At Ofer's request I reran Judith's stuck/suspended script for one RSE (MWT2_UC_LOCALGROUPDISK) and found that many transfers seemed to have gone back into a SUSPENDED state where they can't be deleted the automatic procedure.
        • To be clear nearly: all of these transfer issues are not caused by the destination site and all LOCALGROUPDISKs are affected.
        • Don't know about the other classes of endpoints...
        • We need to follow up on this as it probably wastes storage space.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Incident:

        One ISCSI storage devices used by our vmware clustre to store vm images fail completely, 1/3 of the vms became unresponsive, including dCache door nodes, htcondor head node and gatekeepers, the site remained downtime for 1.5 days. The Dell storage was recovered without losing data, and we migrated the vm images to other storage locations. 

        gatekeeper had very few incoming jobs, it was recovered after we restored the ISCSI vmware storage device. 

        Site was flooded (over 60% job slots) with high memory jobs requesting 3G to 6GB RAM, most of our work nodes do not have that much RAM per core, hence some became unresponsive due to heavy swap usage. This is because of the misunderstanding about the high memory queue. We thought it was set 3GB/core maxrss, still working on job routing rules to adapt to this change, also set a limit of jobs on the high memory queue.

         

        Ticket

        closed ticket 149378: dcache transfer/deletion error, deletion error was caused by a down dcache door node which was caused by vm storage issue. declared lost files which lost metadata in dCache, so we missed them when we were summarizing the lost files on 4th Oct due to losing 2 virtual disks. 

         

         

         

         

         
         
         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Upgraded dCache to 5.2.35 and changed CAs to use osg-ca-certs instead of OASIS

        Updated all UC and IU machines using XL710s to kernel-ml, which appears to have fixed the 1099 errors

        New UIUC purchase received and in the process of being installed

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        GPFS issue => DDM errors => GGUS ticket.  Resolved this morning.

        Production smooth otherwise.

        Installing NESE Tape system.

        Preparing for xrootd HTTP-TPC

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2:

        May need to shrink OSG pool if fewer COVID jobs are running

        SWT2_CPB:

        Lingering problem with data transfers (ticket 149701).  Suspect nearly empty dataserver is cause.  Will reevaluate when server is drained.

         

        OU:

        - No site problems, running well

        - Site was drained yesterday, but Rod fixed that by fudging weights

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), lincoln bryant

      NERSC down to less than 5M MPP hours. Might not get any more time.  We have been given 50M hours above our intial allocation of 104 M MPP hours.

         NERSC down 15-20 Dec. 

      TACC ramped up to 7K concurrent slots before outage.   In last week simulated 7.8M events

      ALCF is ramping up.

      Raythena debugging continues

       

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        Lincoln created a nice nsf volume for the platform. I will be adding option to the frontend.

        Analytics sites had a bit of downtime due to one node running out of space.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        * had issues with full ephemeral storage on NET2 and AGLT2.

        * agreed with Andy and Matevz on XrootD CCM plugin for  sending heartbeats from servers. Should be ready for 5.2.0

        VP

        * agreed with Rucio folks on xcache/CRIC/Rucio/VP communication.  

        ServiceX

        * Testing deployments at different k8s clusters.

    • 14:40 14:45
      AOB 5m