Re-signing whitelist with YubiKey Tuesday to Thursday (OTG0053606)
Approx. 10 repositories per day
Big repositories left untouched for now
atlas.cern.ch volume extension failed
Attempts to fix it resulted in a corrupted partition table
Data has been replicated via zfs send/rcv and via cvmfs_snapshot to two independent 8TB volumes
Will coordinate with repo owners to do the switch-over
cms-ib requires execution of `cvmfs_server eliminate-hardlinks` due to migration to CC7
Needs to walk the whole file catalog from the root -- Can be time-consuming
The first attempt failed (inodes limit?)
Second attempt scheduled for tomorrow (Tue Dec 3)
cms.cern.ch needs to be migrated to CC7
`cvmfs_server eliminate-hardlinks` is blocking
projects.cern.ch accessible from CERN only
Now also S3 returns 403 Forbidden to non-CERN IPs
ams.cern.ch unable to complete transaction due to root full
On CC7, spooling area is on the root partition (hypervisor SSD)
150 GB usable on the root partition used as spool area
Currently, 1TB volume attached to let the transaction go through (failed over the weekend due to AFS)
Needs debriefing and understanding why the run such huge transactions
Ceph Upstream News
Releases, Tickets, Testing, Board, ...
Dan van der Ster
Mimic 13.2.7 released.
Notable new feature is `slow ping detection` -- the usual OSD heartbeats trigger "osd failed" messages if they stop pinging. Now ceph will generate a health warn if the ping time is longer than 5% of the heartbeat timeout. (see mon_warn_on_slow_ping_time, mon_warn_on_slow_ping_ratio)
Default bluefs allocator changed from "stupid" to "bitmap" -- this gives consistent object create latency at cost of ~100MB of ram.
MDS has the gradual recall caps that Teo backported.
Planning to upgrade ceph/dwight to 13.2.7 this week then assess for other clusters if we upgrade directly to nautilus or to 13.2.7.
(National and Kapodistrian University of Athens (GR))
ceph/erin had one PG inactive (from test pool, so not critical). I found the pool had size=2, min_size=2, which is a misconfiguration? Set to min_size=1 so it could activate, then set to size=3, min_size=2 for a permanent solution.
ceph/beesly/osd/critical had multiple failures this morning, and loadavg much higher than usual. p05798818b00174 particularly bad, ssh not working (bu mco is working). Maybe related to user activity -- still investigating.
ceph/erin: second to last rack being reformatted, waiting for a ticket about a disk failure to be resolved so the newly formatted rack can enter the cluster.
Ceph Disk Management
OSD Replacements, Liaison with CF, Failure Predictions
FILER-120: All filers in critical power barn will need to be recreated next year. Hardware is being decommissioned, and because they are on LCG network, migration is not possible.
CephFS (HPC - jim)
HPC team asked if MDS's in jim are busy, can they reduce (to get a worker node back into the cluster). We found that 1 should be sufficient, so changed to max_mds=1.
Jose Castro Leon
(CERN), Julien Collet
(CERN), Roberto Valverde Cameselle
(Universidad de Oviedo (ES))
Kopano (from Dan):
CephFS is being used for 3x the expected use-cases (and space): attachments, backup staging area, and folder indices. Need to review with CDA the expected space usage of all. (And review individual share quotas on ceph/kelly).