US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-06-09T13:00:00-04:00
End: 2021-06-09T14:45:00-04:00
Location: No location set

Wednesday 9 Jun 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  The pre-GDB Data Center Network Architectures workshop was held Monday and Tuesday of this week: https://indico.cern.ch/event/1028690/ (Some details about AGLT2 network update/downtime are in Shawn's presentation from Monday).
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  gratia-probe-1.23.2 bug affecting non-HTCondor sites, preventing pilot record accounting record upload. Any sites with this version installed should update immediately to gratia-probe-1.23.3 and grep for mapped VO Unix names in /var/lib/gratia/data/quarantine/. If any records are found, please contact help@opensciencegrid.org
  
  HTCondor-CE 5.1.1, HTCondor 9.0.1, and XRootD 5.2.0 are available in osg-upcoming-testing
- 13:20 → 13:35
  
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  - successful dCache downtime at 10am ET to improve deletion performance
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs_20210609.png
  
  Success_20210609.png
  
  Trans_20210609.png
  Generally OK running (some issues at MWT2 and SWT2):
  
  MWT2 IPV6 issue
  
  UTA_SWT2 and SWT2_CPB put offline for short periods.
  
  Could NET2 and SWT2 report on IPV6 status
  
  XRootD 5.2.0 testing underway
  
  Pre-scrubbed successfully
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
        Getting ready for downtime and infrastructure update starting 8am Monday 14-Jun-2021
    
        - Updating the topology in OSG/git to prepare downtime (thanks to Brian, Ofer and Mark)
    
        - UM will (finish to) replace all the main switches, all cabling (all fiber where possible), and configuration.
          Some servers and worker nodes will need to be relocated.
    
        - MSU will be moving all services and dcache storage to the MSU Data Center Monday-Tuesday (Wave 1)
          to coincide with UM downtime.
          Our public and private networks are now extended from our old EX9208 to our 2 new QFX5120 at the DC.
          2 nodes (Wave 0) were moved this Monday to iron out networking issues.
          Some multicast (for ganglia) and stability issues were discovered and fixed.
          The T2 WNs (and the MSU T3) will be moved over time (Wave 2, etc).
          Moving the last set of worker nodes will need to be synchronized with the move
          of the department servers sharing the same cooling (otherwise CRACs would fail on too-cold air return).
    
        - The UM-MSU link will unfortunately not be switched over to the new State Research Coridor Triangle at this time.
          The MSU multi 100G Research Network will also not be ready for cut-over until at least July.
    
        - Optimistically we may have dCache back on Wednesday
  - 13:45
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jessica Lynn Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC - loss of IPv6 connectivity to PanDA took site offline last Thursday until Sunday. Monday had loss of IPv6 connectivity from IU/UIUC to UC.
    
    IU - new head node and perfSONAR servers are racked and ready to be brought online. Squid degradation for an expired k8s certificate on iut2-slate.
    
    UIUC - working towards adding ICC's HTC resources.
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Dealing with an issue right now....
    
    GPFS getting overloaded => Hammercloud bounced us a couple of times yesterday. Currently ~30% ddm failures. Having to reboot one of our gridftp endpoints. Possibly DAOD physics validation jobs?
    
    xrd 5.2.0 with clustering and custom containers is working for the GPFS storage.
    
    Preparing to buy worker nodes.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Problem with CE shared directory caused issues, but now working fine
    
    Seeing some issues with cooling in for SWT2_CPB, working on getting it resolved
    
    Major purchase submitted to procurement
    
    OU:
    
    Nothing to report, running well.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  
  Pilot is failing to create log files at NERSC for a small percentage of jobs. This appears to block stage out for the entire job, I am investigating. Have been in communication with Tadashi regarding the stager. Still not understood why the pilot is failing to create a log file - I will also email Paul if there are issues after updating the pilot to latest version. Have reduced Harvester to a single queued job until I can reliably get jobs to complete again.
  
  TACC is down for maintenance which is taking much longer than expected.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    All running fine.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  WebDAV and XRootD service added to OSG topology; WebDAV and XRootD service type mappings now added to CRIC as well.
  
  A number of questions about recent AGLT2 downtime settings and site configuration have been resolved
  
  MWT2 issue encountered trying to shorten downtime
  
  Testing with BNL dCache downtime today as well
  
  Mark is auditing site configurations
  
  Also looking at SRR settings
  
  XRootD 5.2 HTTP-TPC testbed updated at BNL and SWT2
  
  Status of deployment plan at OU?
  
  F-S DevOps meeting (minutes)
  
  Recent MWT2 squid downtime/cert issue
  
  Moving to normalize SLATE squid configuration going forward
  
  Discussion of support structure/processes
  
  Squid failover configurations (in general) - future topical presentation?
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-6_2_21.pdf
    
    US-cloud-summary-6_9_21.pdf
  - 14:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    
    all updated to 5.2.0, moved to gStream monitoring
    
    since Squid wants one dedicated disk, I will be removing one disk from all xcaches and restarting everything
    
    SLATE deployment in Beijing should be ready by the end of month.
    
    Squids
    
    created everything needed for alarm/alert generation, should be in chart before end of the week
    
    Alarms & Alerts
    
    working fine, updates on Frontier alarm generating code.
    
    Rucio devs:
    
    fixes to how Rucio gets client IP, how it calculates replica ordering
    
    work on setting up part of that infrastructure in Cloudflare.
- 14:40 → 14:45
  
  AOB 5m