ceph/erin outage: (15 Jan)
Around 13:30 there was a network failure in the datacenter. The osds couldn't peer with their neighbours, and after an hour they started marking themselves as failed.
This caused a lot of operations to be on hold for about an hour.
The nodes where restarted again and the operations continued as normal within 30 minutes.
An interesting scsi glitch, mdraid reassembly seen on AFS: https://its.cern.ch/jira/projects/AFS/issues/AFS-508 -- what would we do if such a glitch happened on a Ceph server?
Kopano servers have dual 10Gig-E NICs -- Jose configured them in LACP mode -- link aggregation. In theory this should give 20Gig-E per machine, but my rados bench showed only 1000MB/s. Jose is checking with network people (Vincent).
Just before the meeting, the cephkopano cluster lost (several?) nodes at once, so the cluster is now currently down.
Upgraded ceph/dwight to v14.2.6 (from v13.2.7). The mon/mgr/osd upgrades all went smoothly. The MDS upgrade was not too smooth when I enabled msgr v2 -- the active MDS reported that the leading mon "lost contact" at the moment I enabled v2. Details here: https://tracker.ceph.com/issues/43596
I don't think this is a showstopper for flax upgrade. I'll be doing more multi-MDS testing on dwight now before scheduling the flax upgrade.