US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
2
OSG-LHCSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
3.5.29 (tomorrow)
- XRootD 4.12.6
- osg-configure 3.11.0
- Make fetch-crl success optional (though give a warning on failure)
- Don't try to resolve the Squid server, since it only needs to be visible from the workers, not the CE
- IGTF 1.109
Other
- HTCondor submit hosts, including CEs, on 8.9.x should update to 8.9.11
- Targeting OSG 3.6 for the end of February. Details here: https://opensciencegrid.atlassian.net/browse/SOFTWARE-4282
-
3
WBS 2.3.1 Tier1 CenterSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
- dCache incident over the new year holiday
- dCache disk was full, causing job stage-out failures. Discrepancy between space reporting and real space usage, stop-gap solution is in place, long term solution under investigation.
- data transfer failures for BNLHPC_DATADISK
- Globus online endpoint overloaded. Solved by replacing NFS with Lustre as the shared FS. Long term solution under discussion.
- dCache incident over the new year holiday
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Please get your quarterly reports for Oct-Dec in today!
- Check that your totals on the V54 capacity sheet are current as of 2020.
- Check that your service versions on the services sheet are listed correctly.
- And certain PIs need to fill in their networking purchase status the tracking sheet..
- While the drop dead date is Friday, various levels of managers need to see your report to enter theirs.
- The last 2 weeks of running were OK:



- AGLT2 had a dCache upgrade and various knock-on effects afterwards.
- MWT2 has a downtime today at the Illinois site.
- NET2 had some storage outages.
- SWT2 had an incident at OU where no jobs were scheduled and storage issues at CPB.
-
4
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Incident:
The job failure rate rose to over 40% with file stage in and out errors.
These errors affected dcache file replications between UM and MSU which then caused jobs to fail.We spent several days investigating it (restarting all dcache nodes,
upgrading FW, testing network between UM and MSU).
We then noticed a weird problem which didn't immediately seem related.
We had packet loss of a few percents between some pair of MSU-UM pool nodes.
The problem was not correlated to any specific node or site.
One other oddity/key observation was that some pairs of nodes
showed errors on the public path but not the private path
sharing the same interfaces.This moved the suspicion to some effectg of hashing on multiple cables between switches.
We could not locate this problem on any of our on-site switches.
Then UM IT notified us of link errors on one of the 2 links (2*40Gpbs) between MSU and UM.
The bad half-link was set offline and all ping and dcache errors were resolved.
But the cause of the bad link is still under investigation.Job failure rate for pas 12h was 1.6%
Software:
Updated htcondor from 8.8.11 to 8.8.12 (latest version in osg repo) on head node and gatekeepers,
and updated work nodes to 8.9.11 from the osg-upcoming repo.After the gatekeeper update, we saw running jobs started to drop,
but no error message was spotted in log files.
Restarting condor/condor-ce services fixed this issue.Hardware:
Updated the firmware with reboot of all our R740xd2 and C6420 machines.
(Dell sent a warning email about a critical bios update)Called Dell support to update FW on some older pool nodes
where the command line dsu method was failing. -
5
MWT2Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
One of the dCache storage pools went offline. Working on identifying cause and bringing back up
Power issues in the NCSA server room. PDU replacements were made last week, and affected systems will be brought back up during today's PM
NCSA quarterly PM today. All UIUC workers are in downtime until 8pm for system updates
-
6
NET2Speaker: Prof. Saul Youssef (Boston University (US))
Problems:Brown out at MGHPCC caused us to lose ~1 hour of useful operation time. Solved without GGUS ticket :)
We're currently getting controller errors in the GPFS system pool. To repair, we needed to free up some space, evacuate and rebuild. In progress without interrupting production.Smooth operations otherwise.
The main things we're working on:
Testing xrootd 5.0.3 endpoints with help from Wei.
OSG 3.5.NESE prep.
-
7
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Nothing to report, running well.
SWT2_CPB:
Power loss at SWT2_CPB on 1/14 during campus electrical work due to generator failure. Awaiting clarification from Physical Plant.
Rack level switch locked up yesterday isolating two storage hosts and an NFS host producing a GGUS Ticket 150261
Starting to work on moving Xrootd door to OSG 3.5.
- Please get your quarterly reports for Oct-Dec in today!
-
8
WBS 2.3.3 HPC OperationsSpeakers: Doug Benjamin (Duke University (US)), lincoln bryant
NERSC ran out of allocation so it has been off until new allocation period starts tomorrow.
- Will test FastCaloGan container on NERSC GPU when NERSC is back from downtime
TACC recent running all jobs failed according to PanDA when Lincoln is back will need to debug errors.
ALCF Theta - got a large amount of CPU hours - over loaded Globus Endpoint at BNL.
Solution switch storage back end from NFS server to Lustre.
Longer time test dCache w/ Globus
-
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
9
Analysis Facilities - BNLSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
10
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 11
-
9
-
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind
- Adding "Ops Round Table" to ADC Weekly starting Feb. 2nd with shared doc for topics
- Event Service configuration issues at OU - Horst to follow up with developers
- Met yesterday to discuss Frontier/Squid deployments - planning a topical presentation for next meeting (in two weeks), updated associated milestone
- Need input for QR
-
12
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13
Service Development & DeploymentSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14
AOB
-
1