Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Description

Zoom: Ceph Zoom

    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)

      Enrico:

      • Overload on cvmfs-stratum-zero (single VM with httpd ProxyPass-ing)
        • Caused replication delays on the Stratum 1
        • Mitigation:
          • Hot-spare S1 reading from hpc-zero and replicating every 10 minutes (was 2)
          • Backup replicating every hour
        • Fabrizio has a new script that runs at most N parallel snapshots and stores time needed to replicate
    • 14:15 14:35
      Ceph: Operations Reports 20m
      • Teo (cta, erin, kelly, levinson) 5m
        Speaker: Theofilos Mouratidis (CERN)
        • Ceph/Filer depedencies done
        • /mnt/projectspace has some troubles extending
        • ceph-kelly and ceph-erin have memory leaks:
          • cephkelly-mon-39bee08afe enabled profiling to check for known issue in https://tracker.ceph.com/issues/48381
        • ceph-erin
          • ran long smart tests on osd.386 which uses sdu sdv in p05972678k48823.cern.ch
            • sdv showed smart errors
              Device Model:     HGST HUS726060ALA640
              Serial Number:    1EJBJENJ
            • will ask for replacement
      • Julien (hw repair) 5m
        Speaker: Julien Collet (CERN)
        • Pretty busy weeks with lots of drives being replaced (
          • slowly catching up on erin christmas break failures
          • a drive re-seat in prevessin this thursday

         

        • Live testing of the new script in progress
          • Roll out this week
          • Writing of the procedure in progress

         

        • Feedback on the new repair channel?

         

      • Enrico (beesly, gabe, merideth, nethub, vault) 5m
        Speaker: Enrico Bocchi (CERN)
        • Meredith installation completed. Should appear in OpenStack as `io2` soon
        • Intervention on NetHub (memory replacement on 3 hosts) went fine
          • Some slow ops and long ping times on few OSDs fixed by restarting them
        • Request 40TB on NetHub -- Backup of media contents (RQF1717470)
        • New capacity for Beesly and Beesly'==Barn
          • CEPH-1043
          • 3 machines do not want to install -- console reports mysterious dracut messages
          • Issue with SSDs for systemdisk -- mdraid
            sdaw       67:0    0  1.8T  0 disk  
            ├─sdaw1    67:1    0    1G  0 part  
            │ └─md126   9:126  0 1022M  0 raid1 /boot
            ├─sdaw2    67:2    0 64.3M  0 part  
            ├─sdaw3    67:3    0  256M  0 part  
            │ └─md125   9:125  0  256M  0 raid1 /boot/efi
            └─sdaw4    67:4    0  1.8T  0 part  
              └─md127   9:127  0  1.8T  0 raid1 /
            sdax       67:16   0  1.8T  0 disk  
            ├─sdax1    67:17   0    1G  0 part  
            │ └─md126   9:126  0 1022M  0 raid1 /boot
            ├─sdax2    67:18   0  256M  0 part  
            │ └─md125   9:125  0  256M  0 raid1 /boot/efi
            └─sdax3    67:19   0  1.8T  0 part  
              └─md127   9:127  0  1.8T  0 raid1 /
        • Review down/out OSDs with Julio (thanks!) starting today
        • Massive network intervention on Mon Jan 25 at 09 am impacting NetHub
          • Cluster fully down -- 100%
          • OTG0061575 -- IP 100/1/2, wrongly reported as EOS nodes
      • Dan (dwight, flax, kopano, jim) 5m
        Speaker: Dan van der Ster (CERN)
        • Flax:
          • https://cern.service-now.com/service-portal?id=outage&n=OTG0061582
            • Failing over mds.0 to standby so that the HV intervention on Jan 25 is transparent.
          • "IT Opencast" granted a large quota extension (+58TB to 100TB total).
          • CodiMD has truncated files again (INC2671188) -- recovery ongoing and asking them to do a full audit on their side to see why this is happening.
            • I believe it is a known kernel client bug fixed in el7 kernels since March 2020. I have asked them to upgrade their kernel and we will debug further.
              • The bug is that files can be zero sized when written via splice(2). I don't expect this to be common.
            • Recommending all users to upgrade: https://cern.service-now.com/service-portal?id=outage&n=OTG0061587
        • Dwight:
          • Rebooted all MDS's after enabling swap. Failover was transparent.
        • Jim:
          • I advised HPC team to move the standby MDS to a new rack -- currently both MDS's are in the same rack.
        • Erin (not my cluster but..)
          • enabled bluefs_buffered_io = true to help reduce slow requests
          • Some osds have high latency (shown by new osd-perf dashboard) -- there is some problem with the disk, so they need to be replaced. (e.g. osd.431 was stopped/drained because of this).
        • Cephadm updated to v14.2.16 and rebooted today.
    • 14:35 14:45
      Ceph: Operations Tools (ceph-scripts, puppet, monitoring, etc...) 10m
      • hg_ceph:
        • all clusters should have switched from cephmirror.cern.ch to linuxsoft.cern.ch
        • I want to change clusters to install from koji instead of the upstream mirror, by default. thoughts?
      • ceph-scripts:
        • CEPH-1047: New script to add a swapfile to a VM (e.g. for MDS, so that they are less likely to go OOM during failover).
          • Swapfiles added to dwight and flax mds's.
        • new reboot-mds.sh helper
        • scripts to apply LVM extent bug fix when creating OSD (seen in meredith)
    • 14:45 14:55
      Ceph: R&D Projects Reports 10m

      Reva (theo):

      • cbox/ceph test bench just decided to become full after months of inactivity
        • couldn't run anny command
        • rebuild the instance into C8
        • setting up the tests again for the presentation
      • finishing the slides for the CS3 flash talk
    • 14:55 15:05
      Ceph: Upstream News 10m
    • 15:05 15:10
      AOB 5m