US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-01-20T13:00:00-05:00
End: 2021-01-20T14:45:00-05:00
Location: No location set

Wednesday 20 Jan 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Minutes
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  
  Quarterly reports due ASAP (deadline Friday)
  
  - All milestones updated?
  
  Next week is ATLAS S&C week. See https://indico.cern.ch/event/998138/ for some of the agenda.
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  3.5.29 (tomorrow)
  
  XRootD 4.12.6
  
  osg-configure 3.11.0
  
  Make fetch-crl success optional (though give a warning on failure)
  
  Don't try to resolve the Squid server, since it only needs to be visible from the workers, not the CE
  
  IGTF 1.109
  
  Other
  
  HTCondor submit hosts, including CEs, on 8.9.x should update to 8.9.11
  
  Targeting OSG 3.6 for the end of February. Details here: https://opensciencegrid.atlassian.net/browse/SOFTWARE-4282
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Minutes
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  dCache incident over the new year holiday
  
  dCache disk was full, causing job stage-out failures. Discrepancy between space reporting and real space usage, stop-gap solution is in place, long term solution under investigation.
  
  data transfer failures for BNLHPC_DATADISK
  
  Globus online endpoint overloaded. Solved by replacing NFS with Lustre as the shared FS. Long term solution under discussion.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Minutes
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  Eff-20210120.png
  
  N_Jobs-20210120.png
  
  Success-20200120.png
  
  Transfer-20210120.png
  Please get your quarterly reports for Oct-Dec in today!
  
  Check that your totals on the V54 capacity sheet are current as of 2020.
  
  https://bit.ly/usatlas-capacity
  
  Check that your service versions on the services sheet are listed correctly.
  
  https://tinyurl.com/yytakxdr
  
  And certain PIs need to fill in their networking purchase status the tracking sheet..
  
  https://preview.tinyurl.com/2020-Infra
  
  While the drop dead date is Friday, various levels of managers need to see your report to enter theirs.
  
  The last 2 weeks of running were OK:
  
  AGLT2 had a dCache upgrade and various knock-on effects afterwards.
  
  MWT2 has a downtime today at the Illinois site.
  
  NET2 had some storage outages.
  
  SWT2 had an incident at OU where no jobs were scheduled and storage issues at CPB.
  - 13:40
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    Incident:
    
    The job failure rate rose to over 40% with file stage in and out errors.
    These errors affected dcache file replications between UM and MSU which then caused jobs to fail.
    
    We spent several days investigating it (restarting all dcache nodes,
    upgrading FW, testing network between UM and MSU).
    We then noticed a weird problem which didn't immediately seem related.
    We had packet loss of a few percents between some pair of MSU-UM pool nodes.
    The problem was not correlated to any specific node or site.
    One other oddity/key observation was that some pairs of nodes
    showed errors on the public path but not the private path
    sharing the same interfaces.
    
    This moved the suspicion to some effectg of hashing on multiple cables between switches.
    We could not locate this problem on any of our on-site switches.
    Then UM IT notified us of link errors on one of the 2 links (2*40Gpbs) between MSU and UM.
    The bad half-link was set offline and all ping and dcache errors were resolved.
    But the cause of the bad link is still under investigation.
    
    Job failure rate for pas 12h was 1.6%
    
    Software:
    
    Updated htcondor from 8.8.11 to 8.8.12 (latest version in osg repo) on head node and gatekeepers,
    and updated work nodes to 8.9.11 from the osg-upcoming repo.
    
    After the gatekeeper update, we saw running jobs started to drop,
    but no error message was spotted in log files.
    Restarting condor/condor-ce services fixed this issue.
    
    Hardware:
    
    Updated the firmware with reboot of all our R740xd2 and C6420 machines.
    (Dell sent a warning email about a critical bios update)
    
    Called Dell support to update FW on some older pool nodes
    where the command line dsu method was failing.
  - 13:45
    
    MWT2 5m
    
    Minutes
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    One of the dCache storage pools went offline. Working on identifying cause and bringing back up
    
    Power issues in the NCSA server room. PDU replacements were made last week, and affected systems will be brought back up during today's PM
    
    NCSA quarterly PM today. All UIUC workers are in downtime until 8pm for system updates
  - 13:50
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Problems:
    
    Brown out at MGHPCC caused us to lose ~1 hour of useful operation time. Solved without GGUS ticket :)
    
    We're currently getting controller errors in the GPFS system pool. To repair, we needed to free up some space, evacuate and rebuild. In progress without interrupting production.
    
    Smooth operations otherwise.
    
    The main things we're working on:
    
    Testing xrootd 5.0.3 endpoints with help from Wei.
    
    OSG 3.5.
    
    NESE prep.
  - 13:55
    
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Nothing to report, running well.
    
    SWT2_CPB:
    
    Power loss at SWT2_CPB on 1/14 during campus electrical work due to generator failure. Awaiting clarification from Physical Plant.
    
    Rack level switch locked up yesterday isolating two storage hosts and an NFS host producing a GGUS Ticket 150261
    
    Starting to work on moving Xrootd door to OSG 3.5.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Minutes
  
  Speakers: Doug Benjamin (Duke University (US)), lincoln bryant
  
  NERSC ran out of allocation so it has been off until new allocation period starts tomorrow.
  
  - Will test FastCaloGan container on NERSC GPU when NERSC is back from downtime
  
  TACC recent running all jobs failed according to PanDA when Lincoln is back will need to debug errors.
  
  ALCF Theta - got a large amount of CPU hours - over loaded Globus Endpoint at BNL.
  
  Solution switch storage back end from NFS server to Lustre.
  
  Longer time test dCache w/ Globus
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    Analysis Facilities - Chicago 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    All working fine.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Minutes
  
  Convener: Ofer Rind
  Adding "Ops Round Table" to ADC Weekly starting Feb. 2nd with shared doc for topics
  
  Event Service configuration issues at OU - Horst to follow up with developers
  
  Met yesterday to discuss Frontier/Squid deployments - planning a topical presentation for next meeting (in two weeks), updated associated milestone
  
  Need input for QR
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-1_13_21.pdf
    
    US-cloud-summary-1_20_21.pdf
  - 14:25
    
    Service Development & Deployment 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache:
    
    * a lot of changes to the OSG based image configuration.
    
    * Testing xcache 5.1.0rc4 at UC River cluster and in production at SWT2_CPB
    
    VP:
    
    * All running fine.
    
    * developments related to heartbeats
    
    * still working on opening CERN k8s loadbalancer
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Share this page

Direct link

Social networks

Calendaring