Ceph/CVMFS/Filer Service Meeting

600/R-001 (CERN)



Show room on map
Videoconference Rooms
Dan van der Ster
Auto-join URL
Useful links
Phone numbers
    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN), Fabrizio Furano (CERN)


      • Added the snapshot_error alarm for all the backed up repos. Interestingly enough the alarm got triggered during the weekend. This occurrence was interpreted as a glitch, as subsequent snapshots went fine:

      Starting ganga.cern.ch at Sun May 24 07:32:01 CEST 2020
      CernVM-FS: replicating from http://cvmfs-stratum-zero.cern.ch/cvmfs/ganga.cern.ch
      CernVM-FS: using public key(s) /etc/cvmfs/keys/cern.ch/cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it1.cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it2.cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it4.cern.ch.pub, /etc/cvmfs/keys/cern.ch /cern-it5.cern.ch.pub
      Failed to contact stratum 0 server (9 - host returned HTTP error)
      ERROR from cvmfs_server



      • Merged Fabrizio's changes for collectd plugins and alarms to qa
      • Preparation for alice.cern.ch and lhcb.cern.ch migrations (not scheduled yet)
      • Mail for lhcb.cern.ch for replication and backup strategy
      • atlas.cern.ch and atlas-nightlies.cern.ch notified about the migration to CC7
      • Disk replacement (zfs raid) on hot-spare stratum1
    • 14:15 14:30
      Ceph: Operations 15m
      • Cluster Upgrades, Migrations 5m
        Speaker: Theofilos Mouratidis (CERN)


          Down OSDs warning on Ceph Bot channel
          • Ignore OSDs which are on hosts in an intervention state (In Progress, finishing)
        • Ceph Guide
          • Add steps for handling monitoring (cephadm, grafana)
          • Plans to move to central monitoring?
            • Ceph project on central monitoring already created
            • SSD durability graph is already in there, maybe try to migrate others 
      • Hardware Repairs 5m
        Speaker: Julien Collet (CERN)


        • Handling of 3 disk replacements for nautilus scripts validation
      • Incidents, Requests, Capacity Planning 5m
        Speaker: Dan van der Ster (CERN)
      • Puppet and Tools 5m


        • hg_ceph puppet wip (with spec testing coverage) is now building, so I will aim to merge that all this week.
        • in ceph-scripts/tools/bluestore there are scripts for doing an offline rocksdb compaction. There is a version for parallel, a version for serial, and also a script to show the bluefs statistics for all the osds on a machine.
          • Useful for CEPH-898 (gabe has lots of slow_used_bytes)


        • New hostgroups for ceph/gabe/radosgw
          • ceph/gabe/radosgw/hashi : formerly known as ceph/gabe/radosgw
          • ceph/gabe/radosgw/bare: new bare rgw subhostgroup
    • 14:30 14:45
      Ceph: Projects, News, Other 15m
      • Kopano/Dovecot 5m
        Speaker: Dan van der Ster (CERN)
      • REVA/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)


        • Preparations about Cernbox on CephFS
          • Reading and understanding Hugo's thesis
          • Evaluating current CephFS features
        • Designing a plan
    • 14:45 14:55
      S3 10m
      Speakers: Julien Collet (CERN), Roberto Valverde Cameselle (CERN)


      • CEPH-871
        • ceph-radosgw-bare-0 operational
        • Puppet changes merged (new subhostgroups)
      • CEPH-891 / Cleanup of s3 logs in ES:
        • There seems to have a drop in es usage - tbc on the long run


      • Upnext this week:
        • CEPH-845: ES Preparation for nethub:
        • CEPH-899: "no osd left behind"
        • CEPH-898: gabe cluster block.db slow bytes used


      • CEPH-898: the 8 new gabe osds with large OSDs have lots of bluefs slow_bytes_used. (block.db not big enough, spilling over onto the hdd). This can have an effect on slowing down bucket index performance.
        • We can try compacting the osd's (offline to be non-distruptive) -- see new tools in ceph-scripts/tools/bluestore
        • Or, better, we can get some new ssd-only osds to store the S3 bucket indexes, which remove all the omap burden from the mixed ssd/hdd osds.
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN), Theofilos Mouratidis (CERN)


      • CEPH-901: ceph/flax: one MDS went OOM because a client was stat'ing 100k's of files, causing the MDS lru inode cache to fill up more quickly than it could trim the cache.
        • New config (in nautilus this is 64k but the trim is done every 1s, in luminous the trim is every 5s):

      -  mds cache trim threshold: 200000 # default 65536. trim LRU space more quickly
      +  mds cache trim threshold: 400000 # default 65536. trim LRU space more quickly

      • ceph/jim: lots of caps release warnings coming from hpc004.cern.ch. I think that the MDS is too aggressive to ask this client to trim caps. (The workload naturally has lots of files). So I'm testing this config change:

      -  mds max caps per client: 20000  # default 1 million. Limits memory consumption of single clients.
      +  mds max caps per client: 100000  # default 1 million. Limits memory consumption of single clients.

      • CEPH-902: On ceph/flax we've seen a substantial growth in the number and activity of clients over the past few months. Some of the clients seem to have highly unoptimized workloads – with a quick check I noticed a cmsweb-test workload is generating thousands of `lookup` calls per second without end. It is a therefore good moment to identify the heavy cephfs clients, and give some guidance in cases where we see suboptimal workloads.



      CEPH-573: Test setup of NFS Ganesha over CephFS:

      • Vanilla NFS server works nicely by exporting local directory with krb5 authentication
      • NFS Ganesha (2.7.1) with NFS_KRB5 uses client hostname as krb5 principal and mapping to uid fails
      • ganeshas 2.8 and 3.2 showed incompatibilities with installed nfs-ganesha-ceph (likely Linux issue) and config files are not backward compatible with 2.7 
      • Plan is to upgrade ceph and nfs-ganesha to latest, try with sys (or none) auth first, then proceed with krb5
    • 15:05 15:10
      AOB 5m
      • Users are reporting rocksdb corruptions in v15.2.2: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CX5PRFGL6UBFMOJC6CLUMLPMT4B2CXVQ/
        • The issue is now understood, and caused by a regression following the change from true to false in the default for `bluefs_buffered_io`.
          • bluefs_buffered_io = false + bluefs_preextend_wal_files = true were giving a corruption due to a race.
        • This doesn't affect us, but underlines the need to wait a few weeks after each release before upgrading!
        • CEPH-903 to track these bluefs issues before we upgrade to anything.
      • The UEFI booting epic is not resolved. On Friday RedHat sent this:
        I apologize for the delay in the updates.
        Post discussion with backline engineering, and the feedback is the /boot partition must be created on a partition outside of the RAID array.
        Also EFI System partition on RAID is not supported.
        There maybe possibility if the RAID is broken or not functioning as expected it may lead to inconsistency  with  respect to bootloader and may cause failure for server booting.
        Hence the recommendation will be to proceed with your RAID scheme partitioning for all mount-points except _/boot_ and _/boot/efi_ which need to be on a separate partition.
        I apologize for the delay and less positive note, but let me know if there are any additional queries that I can assist with.
      • I am now trying to reproduce this kind of corruption. And we have sent a response to RedHat:

      If EFI system partitions on RAID are not supported, then why does anaconda create /boot/efi a RAID1 with metadata=1.0? Anaconda is doing the right thing here, surely not by accident: https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/modules/storage/platform.py#L145

      If anaconda will be updated to explicitly not support /boot/efi on a RAID1, then would RedHat consider fixing their kernel/grub tooling to support two boot disks?

      Another thing: we had assumed that the weekly raid-check cron would let us know if the 2 disks get out of sync (e.g. following an external write during boot).
      However, it seems that /usr/sbin/raid-check simply ignores mismatches for raid1 and raid10! [1] We're not sure if that false positive scenario applies to /boot/efi (since it is so rarely written to on a running system). Would it be safe for us to monitor mismatch_cnt ourselves?