US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.5.20

      • HTCondor-CE 4.4.0
      • Frontier Squid 4.12-2
      • CVMFS 2.7.3
      • scitokens-cpp 0.5.1

      Other

    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        Kubernetes Activities at SWT2 15m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • normal operation in general
      • RPVLL reprocessing started last Thursday. BNL staging rate, as shown on DDM dashboard, is 2~3GB/s. A temporary glitch between HPSS and dCache affected retrieval of staged files from HPSS disk cache to dCache DATATAPE, for a couple of hours, fixed by restarting HPSS batch, root cause under investigation. 
      • 40 new WNs added to the T1 farm
        • Supermicro SYS-6019U-TR4 servers. 2 x Xeon Cascade lake 6252 CPUs (96 logical cores total). 12 x 16 GB (192 GB total) DDR4-2933 MHz DIMMS. 4 x 2 TB SSDs. 2 x 1 Gbps LACP link. 1141 HS06 per node. 
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))

      Tier 2 Notes:

      1. Please complete your additional funds requests and site reporting before the COB today. (SWT2 missing in https://docs.google.com/document/d/170cSG0BVKzntxVaa0gGrj1TClmwd4FQAUQMAxFEM2Hw/edit# )
      2. Problems over the last 2 weeks:
        1. AGLT2: dCache issues - working to find a good version. Transfer success rate ~40%.
          • The 40% success rate was apparently monitoring issue with the new Site Oriented Dash board. The plot shows 90%-100% over the last week today and looks much different than it did yesterday.
        2. MWT2: Troubles transfers with largest issue traced ipv6 issue at UIUC. Transfer issues remain at a low level after the ipv6 issue was fixed. (All sites have a low rate of transfer issues.)
        3. NET2: Ongoing transfer issues.
        4. SWT2: OU had me memory issues and is implementing cgroups. DNS issue at UTA affected CPB
        5. External issues particularly a DNS issue at CERN lowered production.
      3. Had a nice conversation with Lincoln, Johannes, and Ofer last Friday about preserving the error information and putting it into the log when a Rucio transfer fails. Johannes gave some good ideas for debugging Rucio transfer issues. We are checking to see if the missing error information/logs turn up as dark data.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

         Tickets:
           
           open 147805: Continued issues with dcache (see previous report for details)
           Tried installing 5.2.24 when it became available.
           This caused a high rate of transfer errors where the ftp connection is dropped,
           seemingly after the transfer has been negotiated but before the first byte of payload is sent.
           We downgraded back to the patch version of 5.2.22 where we still see that issue but with lower rate.

           closed 147784: catching up on updates for squid servers

           closed 147769: files not accessible.  One dcache server had most of its pools offline.
           
         Services:

           updating AFS servers to CentOS7, ongoing.

           BOINC: incremental improvements

           Condor: some misbehaving T3 jobs used more memory than should have been allowed ~10G instead of 2G
           and caused ~50 worker nodes to become unresponsive.
           The condor configuration on the submit nodes was updated to protect against this problem.

           Working on updating/securing ELK at AGLT2.  Complete except that base OS is  SL6 and ELK 7.8 needs CentOS 7+

        Hardware:

          Ordered 26x C6420s (20 for UM, 6 for MSU) and 7x R740XD2 (5 for UM, 2 for MSU)

         

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        UC

        • Analytics cluster upgraded to Elasticsearch 7.8
        • Benchmarked Dell nodes for new purchase

        IU

        • Working on IPv6 configuration

        UIUC

        • ICC Quarterly PM July 15. All worker nodes updated to the latest kernel and GPFS client
        • IPv6 issues on a number of workers were causing problems connecting to both CERN and the UC storage. Fixed after reboots
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Smooth operations, no tickets, site is full... other than low level DDM timeouts, mainly to NL cloud.  

        NESE_DATADISK now used for job staging as well as general I/O.

        Planning for trip to MGHPCC to replace fans, broken disks, re-cable NET2 6PB in NESE for expanding NESE_DATADISK.

        NESE Tape Tier solution will be IBM TS4500 (same as BNL).  Configs and quotes are close to finalized.  Space power and cooling are being prepared at MGHPCC.  Pod will be dedicated to tape libraries.  Large enough to hold 4 18' libraries.  Neighboring pod will hold front end system and ATLAS DDM nodes.  Protocols will be posix (GPFS) with the file system also covered as S3.  

        We've been in touch with Lincoln, re: SLATE.  No particular security issues are a problem for BU Research Computing.  Following Lincoln's instructions and then will likely have a session with Lincoln to get things going.  

        We've ordered 16 new AMD worker nodes from DELL. 

        Additional infrastructure requests set up in Shawn & Fred's document.

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • No significant issues over the last two weeks
        • Previous issue with campus/Internet DNS when changing registrar for altas-swt2.org domain
        • Working in earnest on deploying SLATE machine, with Lincoln's help

         

        OU:

        - Overall no problems, running well

        - Today OSCER maintenance

        - OSG downtime apparently not propagated to WLCG, investigating

        - SAM3 CE tests submitted without maxWallTime, causing them to be submitted with UNLIMITED WallTime to SLURM, causing timeouts because of scheduled cluster maintenance window. Opened GGUS ticket, will be fixed by SAM developers.

        - Benchmarked Gold 6230 with a lot of Fred's help: 946.39 total, for a benchmark of 11.83 HS06/HT-Core

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), Lincoln Bryant (University of Chicago (US)), lincoln bryant
      • 500,000SU allocation on Frontera

        • (1 SU = 56 physical Xeon Platinum cores * 1 hr)

      • Jobs execute without CVMFS, running athena:21.0.15_31.8.1-noAtlasSetup container

      • ALRB setup and maintained via Cron on the login nodes

      • Have been working to understand best job "shape" for optimal throughput

      • Testing number of parallel nodes (1, 5, 10, 20, 50, 100) and varying number of events (250, 500, 1000)

      • Overall: TACC is working, slowly ramping up utilization & consulting with TACC support as we go.

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
      • Fairly smooth operations (minor dCache issue at T1, various at T2, CERN DNS and FTS problems)
      • Everyone should be using new SSB dashboard
      • Thanks to everyone who worked on PQ unification
      • Discussed Rucio error logging issue with Fred, Johannes, Lincoln; gathering info for follow-up
      • Updates to downtime declaration procedure
      • SLATE tutorial at PEARC next Friday: https://pearc20.sched.com/event/cnXu
      • Working on quarterly report
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        ES upgraded to 7.8.0

        Working on cross monitoring of clusters at UC and UM. Still not finished.

        Added Tier0 Oracle data indexing. Will take a week to setup everything.

        Issue with Panda Oracle DB. Two days worth of data missing from it.

        Regular helping ML platform users.

      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        xcache at v5.0.0-1.

        Issues with g-stream. Hopefully will be fixed in a month or so.

        Rewrote my cinfo reporter to support v3 of cinfo files.

        will be trying new OSG images produced yesterday. They should address several important issues.

        VP running but queues still in brokeroff.

    • 14:40 14:45
      AOB 5m