US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-03-04T13:00:00-05:00
End: 2020-03-04T14:45:00-05:00
Location: No location set

Wednesday 4 Mar 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG All Hands 2020 postponed!
  
  3.5.10-2 and 3.4.44-2
  
  VO client update for FNAL VOs, GLOW, and OSG due to InCommon subject DN format changes
  
  3.5.11 and 3.4.45
  
  OSG 3.4 has entered critical bug/security fix only support; EOL scheduled for November 2020. Last release series that supports EL6! https://opensciencegrid.org/technology/policy/release-series/
  
  Most package updates from here on out will only be available in OSG 3.5!
  
  XRootD 4.11.3
  
  XCache 1.3.0 with data integrity tool
  
  Singularity 3.5.3 (OSG 3.4 only, otherwise available in EPEL)
  
  CVMFS 2.7.1
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Xrootd vs Http protocols in TPC 15m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  dCache downtime scheduled for 03/24~03/26 (48h), for upgrade to version 5.2, for support of SRR and TPC, plus many other bug fixes and improvements
  
  CPU utilization fluctuated recently, stable now though
  
  not enough pilots
  
  job router changes on CEs
  
  draining and rebooting partial WNs in the farm, in a rolling fashion
  
  upgrade cvmfs to 2.7.0
  
  add cvmfs-x509-helper package for LIGO jobs
  
  R&D
  
  Data Carousel exercise/RPVLL reprocessing
  
  going well. BNL staging throughput 3GB/s+, best among T1s
  
  need more requests to stress the system
  
  MAS
  
  "moving" instead of "copying" unused datasets from DATADISK to BNL_LAKE
  
  running jobs on the BNL_LAKE_UCORE PQ
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Will need input from the Tier 2 sites if we do hold an ATLAS meeting
  
  Please close tickets and respond to items sent to the US cloud mailing list.
  
  The source of the low cpu efficiency at SWT2 is believed to be understood.
  
  The issue was at SWT2_CPB and involved RUCIO Mover vs LSM
  
  I leave SWT2 to explain the details in their report
  
  Two issues in the past week appear to have been caused by settings in AGIS
  
  We have to damn careful with AGIS.
  
  ===>> CHECK THAT AGIS IS SETUP FOR YOUR SITE AS YOU EXPECT!!!
  
  Please make sure that requested upgrades like dCache and ipv6 are getting attention.
  - 13:40
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    had 2 intermittent downtime, to fix the hardware of the storage enclosure of a dCache pool node
    
    2 ggus ticket: 145774 for rucio dataset replication being stuck between AGLT2 and DESY_HH, after some investigation, we found out the problem is not at the AGLT2 site, the ticket was being reassigned to a few other sites; ticket 145772 to upgrade dcache to latest release and enable SRR at AGLT2. We enabled SRR, and will update dCache to the latest release of 5 soon.
    
    On 25th Feb, because of the CERN production issue, we noticed our site did not get any jobs for 8 hours, because there was any notice from ADC, we thought there was a problem with our gatekeeper, ended up spending a lot of time debugging, restarting services/ nodes.. Could we request notice/update on such incidents in the future?
    
    BOINC accounting to OSG: I got the Gratia API document from Derek, and still in the process of reading through the condor accounting example..
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC
    
    dCache upgraded to 5.2
    
    Everything 5.2.15 except for the xrootd door, which is 5.2.7 due to Protocol Xrootd-4.0:72.36.96.247:60394 is not supported errors in 5.2.8 and above
    
    Network interface errors after rebooting our SL6 nodes during the downtime, fixed by cable reseating
    
    In the process of upgrading our remaining SL6 storage nodes and doors to SL7
    
    Added SRR, updated CRIC
    
    Stuck replication rule from MWT2_DATADISK to DESY-HH_LOCALGROUPDISK
    
    It looks like the FTS transfers stalled while we were debugging network issues post-upgrade
    
    Is there a regular procedure or contact to fix this? DDM?
    
    UIUC
    
    24 new workers online (1960 cores)
    
    PDU issues after onlining the new workers, fixed for now
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 13:55
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Preparing to deploy new storage (2PB Raw), around 1PB will be used to cover retirements
    
    Working on condor-ce/NFS issue that is preventing pilots from being accepted. Looks like NFS server issue. Temporary workaround now in place
    
    We believe we have identified the low efficiency problems at SWT2_CPB
    
    Rucio mover was placed as primary mover by ADC although it would not work
    
    LSM would be used after rucio mover failed
    
    Rucio mover took significant time to fail, lowering CPU efficiency
    
    Now mostly solved with adoption of rucio mover on reads, some work still needed for writes.
    
    OU:
    
    - Nothing to report, running fine.
    
    - Had temporary OSCER authentication (LDAP/IPA) hickup Monday night which caused some stage-out failures.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Doug Benjamin (Duke University (US))
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    No downtimes, normal operation.
    
    Implemented on shared-pool monitoring of hepspec in machine classad, so we can now get per-group HS06-provided
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-2_26_20.pdf
    
    US-cloud-summary-3_4_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:40 → 14:45
  
  AOB 5m