US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

      US ATLAS Computing Facility Capacity Spreadsheet: https://bit.ly/usatlas-capacity

      Through March 2020 (FY20Q2):

      • V52: CPU capacity increments & retirements
      • WLCG-v52: Pledge figures from REBUS available (fill in as needed)
      • WLCG-v52, Table 1: Installed storage capacity 
      • WLCG-v52, Table 2: FY20 Procurement plans
      • WLCG-v52, Table 3: Retirements
      • WLCG-v52, Table 4: AUX equipment (non-CPU, non-disk) 

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      3.5.16 and 3.4.50

      • Frontier Squid
      • XRootD 4.11.3-1.2 for 3.4 (already released in 3.5), including a fix for a core dump seen at OU
      • HTCondor 8.8.9 and 8.9.7

      Other

      • We've built the osg-wn-client and relevant packages for EL8!
      • XRootD 5 RC and plugins have successfully passed internal tests
    • 13:20 13:40
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • Working fine
      • dCache upgrade scheduled in 3 weeks
      • Intel CPU delivered end of June 
      •  
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))

      For all the sites that see small percentage of jobs fail with timeouts on input/output:

      we are investigating interaction between rucio mover, gfal2 and xrootd. In a number of cases actual transfer was not even attempted and the reason seems to be the way rucio mover tries to stat file and get checksum. Hopefully fix will come soon, once ready we will try to get it expressly tested and deployed. This does not exclude possibility there are other issues lurking there.

      Fred:

      It was an OK week for production.

      • There were a number of  tasks that had high failure rates but from the submission side.
        • Most recently in the last day looping event generation jobs that killed as a group.
      • I was going to mention the Rucio transfer issue but Ilija beat me to it by providing the notes above.
      • The was also an unintended Rucio release which caused trouble for about 1 day.
      • Several sites had short-term issues.
      • Covid jobs seemed to run OK but of course reduced ATLAS production.
        • NET2 had some stage-out issues with the covid jobs.
      • Looks like recovering just over a month (Feb 28 to Apr 8) of accounting data for CPB will be hard. Right now CPB is not reporting anything to the official GRACC/APEL system for the entire month of March.
      • Port scanning form LHCONE????
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        incidents:

        21th April, one of our new R740x2d dcache server died, the daughterboard was burnt, we got it replaced within 48 hours with dell sending an onsite technician.  Before that, we submitted a JIRA ticke to declare the unavailability of the files.

        Services:

        We still see jobs get killed due to OOM, 200 jobs/2 weeks. This mostly happens to work nodes with less than 2GB/core, we are in the process of 1) adding more memory to work nodes with retired parts 2) disable HT for work nodes witout spare DIMM parts.

        We see 60% of the cluster is being used by the analysis jobs, this might be caused by our recent reconfigurtion of condor and gatekeeper in order to balance giving enough cores to covid-19 jobs and having less fragementation in condor cores.  Too many analysis jobs seem to increase the failure rate of jobs in the site.  

        Condor is updated to 8.8.8

         

        Hardware:

        Retired 20TB usable space from dCache to get spare parts to cover the storage enclosures not under warranty anymore. 

         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • Fixing storage issues at UC. Two of our older out-of-warranty servers have been having controller issues. Currently draining the pools that are still online and trying to recover data from the pools that are failing
        • The root disk on the UC gatekeeper filled up, causing job failures this morning
        • NVIDIA drivers updated on the ML platform
        • LOCALGROUPDISK filled up last Friday. Cleanup ongoing, now down to 97% full
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2:

        • GGUS-Ticket-ID: #146691 Concerns SAM test for Frontier setup.  Only the test is affected, jobs are fine.   Test probably needs to be updated.
        • Ramping up OSG Covid-19 jobs

        SWT2_CPB:

        • GGUS-Ticket-ID: #146694  Same issue as seen above.
        • GGUS-Ticket-ID: #146387  now closed.
        • Met with networking staff for IPV6 discussions.  They are evaluating options before committing to timeline.

        OU:

        - Not much, all running well

        - Upgraded xrootd to 4.11.3, which fixed space reporting and logging, and were able to delete some old data from OU_OSCER_ATLAS_LOCALGROUPDISK

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speaker: Doug Benjamin (Duke University (US))

      issues with credential expiration delayed processing at NERSC but NERSC is running again. Did ramp up over 2K jobs. 

      Work continues in integrating TACC. Lincoln and Doug will work together tomorrow.

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        Running smoothly.

        Opportunistic folding at home got us to sixth place:

        https://stats.foldingathome.org/team/38188

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))

      Ofer, Fred, Johannes will meet on Friday to follow up on Fred's report from last week and discuss monitoring & procedures for the US cloud.

      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        ES running smoothly. 

        Changes in collection at CERN. Added Jedi task parameters data source. Very complex but gives possibilities we did not have before. Ivan and Mayuko are working on it. Now investigating site exclusion by users.

         

      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        Slowly ramping up with XCaches and VP.

        AGLT2 - replaced their node with the new one with more storage. Change them to direct access.

        Prague - running smoothly. Will upgrade further this or next week.

        LRZ - issue with the clean up, managed to cross HWM.

        ROOT TChain bug discovered and fixed. Waiting for the LCG build to get it in production.

         

    • 14:40 14:45
      AOB 5m