- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
3.5.29 (tomorrow)
Other
Updates on US Tier-2 centers
Incident:
The job failure rate rose to over 40% with file stage in and out errors.
These errors affected dcache file replications between UM and MSU which then caused jobs to fail.
We spent several days investigating it (restarting all dcache nodes,
upgrading FW, testing network between UM and MSU).
We then noticed a weird problem which didn't immediately seem related.
We had packet loss of a few percents between some pair of MSU-UM pool nodes.
The problem was not correlated to any specific node or site.
One other oddity/key observation was that some pairs of nodes
showed errors on the public path but not the private path
sharing the same interfaces.
This moved the suspicion to some effectg of hashing on multiple cables between switches.
We could not locate this problem on any of our on-site switches.
Then UM IT notified us of link errors on one of the 2 links (2*40Gpbs) between MSU and UM.
The bad half-link was set offline and all ping and dcache errors were resolved.
But the cause of the bad link is still under investigation.
Job failure rate for pas 12h was 1.6%
Software:
Updated htcondor from 8.8.11 to 8.8.12 (latest version in osg repo) on head node and gatekeepers,
and updated work nodes to 8.9.11 from the osg-upcoming repo.
After the gatekeeper update, we saw running jobs started to drop,
but no error message was spotted in log files.
Restarting condor/condor-ce services fixed this issue.
Hardware:
Updated the firmware with reboot of all our R740xd2 and C6420 machines.
(Dell sent a warning email about a critical bios update)
Called Dell support to update FW on some older pool nodes
where the command line dsu method was failing.
One of the dCache storage pools went offline. Working on identifying cause and bringing back up
Power issues in the NCSA server room. PDU replacements were made last week, and affected systems will be brought back up during today's PM
NCSA quarterly PM today. All UIUC workers are in downtime until 8pm for system updates
Problems:
Brown out at MGHPCC caused us to lose ~1 hour of useful operation time. Solved without GGUS ticket :)
We're currently getting controller errors in the GPFS system pool. To repair, we needed to free up some space, evacuate and rebuild. In progress without interrupting production.
Smooth operations otherwise.
The main things we're working on:
Testing xrootd 5.0.3 endpoints with help from Wei.
OSG 3.5.
NESE prep.
OU:
- Nothing to report, running well.
SWT2_CPB:
Power loss at SWT2_CPB on 1/14 during campus electrical work due to generator failure. Awaiting clarification from Physical Plant.
Rack level switch locked up yesterday isolating two storage hosts and an NFS host producing a GGUS Ticket 150261
Starting to work on moving Xrootd door to OSG 3.5.
NERSC ran out of allocation so it has been off until new allocation period starts tomorrow.
- Will test FastCaloGan container on NERSC GPU when NERSC is back from downtime
TACC recent running all jobs failed according to PanDA when Lincoln is back will need to debug errors.
ALCF Theta - got a large amount of CPU hours - over loaded Globus Endpoint at BNL.
Solution switch storage back end from NFS server to Lustre.
Longer time test dCache w/ Globus