US ATLAS Computing Facility

US/Eastern
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.4.32

      • Singularity 3.2.1 with patches for the following issues:
        • https://github.com/sylabs/singularity/pull/3803
        • https://github.com/sylabs/singularity/pull/3803
      • Gratia probe with Slurm v18 support
      • XCache 1.1 (with ATLAS XCache RPM)
      • Various osg-configure fixes and improvements

      OSG 3.5

      Planning for a July/August release; Q4 2020 tentative end of OSG 3.4 support. Highlights for 3.5:

      • Drop EL6 support
      • Drop CREAM support from HTCondor
      • Leverage GridFTP, MyProxy, GSI OpenSSH from EPEL
      • Retire RSV
    • 13:20 14:00
      Topical Report
      • 13:20
        TBN 15m
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • normal operation
        • continue work on setting up a PanDA queue to use GPUs on BNL IC  (Doug@BNL)
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Incidents:

        2 GGUS tickets:

        1) over 50% failure on the analysis queue, it was auto excluded. We verified that there was nothing wrong with the site, the job failure was related to the job itself. We requested to closed the ticket, but no body closed it, it was auto closed eventually. 

        2) one worker node which is having a different hardware/software configuration for testing purpose causes jobs failing due to using up its cache space in cvmfs.. We excluded the node from HTCondor and closed the ticket. The node will not be put back online until more memory is added. 

        Service:

        In order to solve a bug in the Condor quota which affected our Tier 3 users with small submissions, we upgraded condor from 8.4.11 to 8.6.13 (the most recent stable one). 

        We did not schedule downtime, but did the upgrade in 4 batches on the cluster. For the work nodes, it requires to retire and drain the condor jobs first before doing the upgrade. The overall process took about 5 days to finish. 

        Hardware:

        MSU had one of the PDUs replaced. 

         

         

         

      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Continued mystery surrounding access mode selected w/ rucio-mover, etc.  

        Harvester problems - faulty submit host blocking jobs.  Rod triaging.

        Working on network reconfiguration at IU.

        Working on Frontier-analytics deployment MOU w/ Nurcan

      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        We had a couple of issues with CRL updates.   Fixed, but still working on it. 

        Outstanding problem with excessive squid failover.  Still investigating.

        C7 and rucio-mover transition done.

        Lots of NESE related work.

         

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        SWT2_CPB:

        • Updated to CentOS 7, where we have moved to Slurm as the batch manager
        • Tweaking the Slurm configuration for priority considerations
        • Recent problems with chilled water supply in machine room
        • Submitted a purchase order for 1.6PB (raw) and 10 compute nodes (560 slots)

        UTA_SWT2:

        • Will move to CentOS 7, when Slurm config at SWT2_CPB is finalized.

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))
      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    • 14:25 14:30
      AOB 5m