US ATLAS Computing Facility

US/Eastern
    • 1:00 PM 1:10 PM
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 1:10 PM 1:20 PM
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG All Hands 2020 postponed!

      3.5.10-2 and 3.4.44-2

      VO client update for FNAL VOs, GLOW, and OSG due to InCommon subject DN format changes

      3.5.11 and 3.4.45

      OSG 3.4 has entered critical bug/security fix only support; EOL scheduled for November 2020. Last release series that supports EL6! https://opensciencegrid.org/technology/policy/release-series/

      Most package updates from here on out will only be available in OSG 3.5!

      • XRootD 4.11.3
      • XCache 1.3.0 with data integrity tool
      • Singularity 3.5.3 (OSG 3.4 only, otherwise available in EPEL)
      • CVMFS 2.7.1
    • 1:20 PM 1:35 PM
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 1:20 PM
        Xrootd vs Http protocols in TPC 15m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 1:35 PM 1:40 PM
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • dCache downtime scheduled for 03/24~03/26 (48h), for upgrade to version 5.2, for support of SRR and TPC, plus many other bug fixes and improvements
      • CPU utilization fluctuated recently, stable now though
        • not enough pilots
        • job router changes on CEs
      • draining and rebooting partial WNs in the farm, in a rolling fashion
        • upgrade cvmfs to 2.7.0
        • add cvmfs-x509-helper package for LIGO jobs
      • R&D 
        • Data Carousel exercise/RPVLL reprocessing 
          • going well. BNL staging throughput 3GB/s+, best among T1s
          • need more requests to stress the system
        • MAS
          • "moving" instead of "copying" unused datasets from DATADISK to BNL_LAKE
          • running jobs on the BNL_LAKE_UCORE PQ 
    • 1:40 PM 2:00 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Will need input from the Tier 2 sites if we do hold an ATLAS meeting
      • Please close tickets and respond to items sent to the US cloud mailing list.
      • The source of the low cpu efficiency at SWT2 is believed to be understood.
        • The issue was at SWT2_CPB and involved RUCIO Mover vs LSM
        • I leave SWT2 to explain the details in their report
      • Two issues in the past week appear to have been caused by settings in AGIS
        • We have to damn careful with AGIS.
        • ===>> CHECK THAT AGIS IS SETUP FOR YOUR SITE AS YOU EXPECT!!!
      • Please make sure that requested upgrades like dCache and ipv6 are getting attention.
      • 1:40 PM
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
        1. had 2 intermittent downtime, to fix the hardware of  the storage enclosure of a dCache pool node
        2. 2 ggus ticket: 145774 for rucio dataset replication being stuck between AGLT2 and DESY_HH, after some investigation, we found out the problem is not at the AGLT2 site, the ticket was being reassigned to a few other sites;  ticket 145772 to upgrade dcache to latest release and enable SRR at AGLT2. We enabled SRR, and will update dCache to the latest release of 5 soon.
        3. On 25th Feb, because of the CERN production issue, we noticed our site did not get any jobs for 8 hours, because there was any notice from ADC, we thought there was a problem with our gatekeeper, ended up spending a lot of time debugging, restarting services/ nodes.. Could we request notice/update on such incidents in the future?
        4.  BOINC accounting to OSG: I got the Gratia API document from Derek, and still in the process of reading through the condor accounting example.. 

         

         
      • 1:45 PM
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        UC

        • dCache upgraded to 5.2
          • Everything 5.2.15 except for the xrootd door, which is 5.2.7 due to Protocol Xrootd-4.0:72.36.96.247:60394 is not supported errors in 5.2.8 and above
          • Network interface errors after rebooting our SL6 nodes during the downtime, fixed by cable reseating
          • In the process of upgrading our remaining SL6 storage nodes and doors to SL7
          • Added SRR, updated CRIC
        • Stuck replication rule from MWT2_DATADISK to DESY-HH_LOCALGROUPDISK
          • It looks like the FTS transfers stalled while we were debugging network issues post-upgrade
          • Is there a regular procedure or contact to fix this? DDM?

        UIUC

        • 24 new workers online (1960 cores)
        • PDU issues after onlining the new workers, fixed for now
      • 1:50 PM
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 1:55 PM
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • Preparing to deploy new storage (2PB Raw), around 1PB will be used to cover retirements
        • Working on condor-ce/NFS issue that is preventing pilots from being accepted.  Looks like NFS server issue.  Temporary workaround now in place
        • We believe we have identified the low efficiency problems at SWT2_CPB
          • Rucio mover was placed as primary mover by ADC although it would not work
          • LSM would be used after rucio mover failed
          • Rucio mover took significant time to fail, lowering CPU efficiency
          • Now mostly solved with adoption of rucio mover on reads, some work still needed for writes.

        OU:

        - Nothing to report, running fine.

        - Had temporary OSCER authentication (LDAP/IPA) hickup Monday night which caused some stage-out failures.

         

    • 2:00 PM 2:05 PM
      WBS 2.3.3 HPC Operations 5m
      Speaker: Doug Benjamin (Duke University (US))
    • 2:05 PM 2:20 PM
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:05 PM
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        No downtimes, normal operation.

        Implemented on shared-pool monitoring of hepspec in machine classad, so we can now get per-group HS06-provided

      • 2:10 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:15 PM
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
    • 2:20 PM 2:40 PM
      WBS 2.3.5 Continuous Operations
      Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
      • 2:20 PM
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 2:25 PM
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 2:30 PM
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 2:40 PM 2:45 PM
      AOB 5m