US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-06-23T13:00:00-04:00
End: 2021-06-23T14:45:00-04:00
Location: No location set

Wednesday 23 Jun 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week or next)
  
  HTCondor-CE 5.1.1 (3.5 upcoming, 3.6)
  
  HTCondor 9.0.1 (3.5 upcoming, 3.6)
  
  Patched XRootD 5.2.0
  
  Fix 'anygroup' config backwards compat segfault https://github.com/xrootd/xrootd/commit/316300c9b2176153fe50ce766079f701d0d10a25#diff-81e843035b5adc6547b1280bc821da90eb628934d57456d4e9b8cb0f497627f2
  
  Fix HTTP TPC issue due to empty cert/CRL files https://github.com/xrootd/xrootd/issues/1467
  
  frontier-squid 4.15-2.1 (no log rotation if log compression disabled)
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Squid Failover discussion 10m
    
    Speakers: Dave Dykstra (Fermi National Accelerator Lab. (US)), Fred Luehring (Indiana University (US))
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  cooling-incident.png
  
  DATADISk.png
  
  dCache-deletion.png
  
  Large scale data deletion from life time model showed the dCache configuration changes to speed up deletion was highly successful. 11 Hz -> 61 Hz.
  
  Seeing issues with PnfsMangers. We are using 6.2.23.
  
  Chasing down transfer issues from BNL LAKE.
  
  Chilled water issues caused an automated shutdown for ATLAS compute nodes. see plot for details
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20210623.png
  
  Success_20210623.png
  
  Trans_20210623.png
  Various problems and down times reduced production over the last 2 weeks.
  
  AGLT: Long downtime for network hardware/cabling upgrade
  
  MWT2: IPV6 troubles with UC enterprise switches
  
  NET2: Staging / transfer issues?
  
  SWT2: OU downtime, CPB had utility issues and gatekeeper inefficiencies.
  
  Please provide updates on supporting IPV6 and HTTP-TPC for sites not supporting them.
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
        Infrastructure upgrades
    
    - MSU moved first wave of equipment to Data Center on 14 & 15-Jun.
        Included VMware and all dcache pool nodes.
        Scope of first wave limited by (lack of) availibility of Juniper 10/25G switches.
        Timing and choice of items was made to synchronize with UM site downtime.
        Now have all switches.
        Next wave will include about half of MSU T2 WNs.
        Last wave(s) need to be synchronized with physics dept servers in neighbor room.
    
    - UM replaced (almost) all rack switches (repurposing a few old ones).
        Redundant uplinks at 100G.
        Redundant paths to main campus and WAN.
        Mix of Cisco (campus) and Dell (in room) brings some limitations.
        Also replaced all network cabling, using fibers and transceivers everywhere possible.
        Increased KVM coverage.
        Increased switch serial port access.
    
    - Despite extensive preparations and pre-labeling and sorting of cables,
        and despite the 2 admin, 2 staff and 5 student task force
        and despite UM Networking support
        and despite working over the weekend
        ... this took much longer than anticipated.
    
    - Status
        Inter switch infrastructure was finished on Friday
        Solving bugs and problems solved over the weekend.
        dCache back online on Monday.
        Queues in TEST mode on Tuesday.
        Queues now back on AUTO on Wednesday (running mostly on MSU WNs)
    
    - Remaining: More cabling of T2 WNs and UM T3
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC
    
    HW failure on one of the UC SLATE nodes caused a squid outage over the weekend. The service has been moved for now, and David is working on getting the hardware replaced
    
    IU
    
    New perfSONAR machines (x4) received and in the process of building
    
    UIUC
    
    IPv6 issues last Friday, fixed by a switch reboot
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    No GGUS tickets.
    
    We've had a couple of power sag incidents in the past two weeks which has caused some problems. This may have caused a dip in production over the weekend, but it's unclear. Still investigating...
    
    Xrootd 5.2.0 cluster, containerized is working. I guess we'll upgrade to 5.3.0 before going into production, re: Brian's notes.
    
    Preparing for new worker nodes.
    
    Preparing to upgrade networking to NESE.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    SWT2_CPB had an issue with CE due to remote spool directory. Combined with minor power work, we drained and reconfigured the spool directory to be local. So far, so good.
    
    No issues at UTA_SWT2, but we are making progress with logistics necessary to retire cluster. Will start looking at retiring the storage first while using SWT2_CPB_DATADISK as storage location.
    
    OU:
    
    - xrootd proxy gateway on se1.oscer.ou.edu (OU_OSCER_ATLAS_SE) upgraded to 5.2.0, and http-tpc enabled. http transfers work, but tpc is failing because of a bug with TmpCA file creation which affects xfs file systems (which we have here). Will be fixed in 5.3.0.
    
    - Today is OSCER cluster maintenance, so jobs are held, no jobs currently running. They will automatically start after maintenance is concluded.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  
  Screenshot 2021-06-22 150227.jpg
  
  TACC running well.
  
  10M more hours added to NERSC. Some instability lately - Cori degraded over the weekend, "lost heartbeat" jobs.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    1. Power cut didn't affect the Tier-3 very much (a 60% reduction in available slots for 2-3 hours) when many weren't used anyway.
    
    2. GPFS vulnerability may require drain & reboot soon
    
    3. Request from DOE / NPP to give large fraction of HTC resources to a short-deadline analysis until August. Tier-3 will have 40% reduction in quota through August
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    atlas-ml.org - all running fine
    
    UC Analysis Facility
    
    Everything is racked. All but one machine has power and is networked.
    
    Will get that machine online and ready to be benchmarked either today or tomorrow (6/23-24).
    
    Having some temperature concerns and running benchmarks again to stress test temps now that more mitigations have been put in place on the racks.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Hit a few XRootd deployment issues with setup at OU; some will be fixed in 5.3 release expected to come out this week.
  
  Status of BU/SWT2 deployment?
  
  Mark is working through the OSG Topology settings for sites to check compatibility with CRIC configuration
  
  VOMS-IAM testing ongoing (John DeStefano)
  
  Federated DevOps meeting
  
  Most SLATE squids redeployed to use disk space, BU change pending, AGLT2 when they're up and running fully again
  
  Further discussion of support procedures - monitoring/alerts; documentation (US ATLAS website); communication (RT, MM)
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-6_16_21.pdf
    
    US-cloud-summary-6_23_21.pdf
  - 14:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    AGLT2 downtime, getting XCaches back up today
    
    MWT2 one of the nodes went down and everything still worked (as expected), added another node.
    
    BNL xcache node used by Doug, off-lined BNL VP queue
    
    Doing in depth analysis of the xcache usage data collected
    
    VP
    
    Running fine
    
    Prague VP is now running "storage-less".
    
    Alarm & Alert
    
    Running fine
    
    Updates as asked by CERN
    
    ServiceX
    
    Stress tests continuing
    
    Rucio changes merged
  - 14:30
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Preparations ongoing for moving the UTA_SWT2 to CPB. Fraction of those machines will be used for our k8s cluster. Reinstall k8s, and proceed with setting up k8s queue.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder