Ceph/CVMFS/Filer Service Meeting

600/R-001 (CERN)



Show room on map
    • 2:00 PM 2:15 PM
      CVMFS 15m
      Speaker: Enrico Bocchi (CERN)

      Last week:

      • Removed wlcg-clouds.cern.ch
      • Migrated grid.cern.ch to CC7 + S3
      • Dedicated caches for Alice, Atlas, CMS, LHCb repos now in QA (on bagplus)
      • Weird request (RQF1549750) of 20TB on CVMFS to "share them on the internet a set of gaugefield configurations that can be used as basis for multiple physics projects in lattice QCD (part of TH)"


      • LHCb wants 3 release managers and gateway for lhcbdev.cern.ch
      • Remove lhcbdev-test.cern.ch on lxcvmfs94 (early test repo on SLC6 + S3)
      • Sending out emails for more migrations SLC6+Cinder --> CC7+S3
      • Reproduce ZFS issue on volume extension


      • Network intervention on Mar 25th: Stratum 1 backend offline
      • Network intervention on Apr 1st: several Stratum 0s, fronts, and caches offline
      • Need to be onsite for whitelist signing?
    • 2:15 PM 2:30 PM
      Ceph: Operations 15m
      • Notable Incidents or Requests 5m


        Brocade -> Juniper router migrations will carry on as planned (times changed to 06h30):

        • Mon 23 Mar 2020             OTG0054830
        • Wed 25 Mar 2020             OTG0055147
        • Mon 30 Mar 2020             OTG0055154
        • Wed 01 Apr 2020              OTG0055159

        CEPH-826: I have carried out a procedure on all ceph machines to make sure they do not have a static IPv6 address. This was a prereq for above.

      • Repair Service Liaison 5m
        Speaker: Julien Collet (CERN)
      • Backend Cluster Maintenance 5m
        Speaker: Theofilos Mouratidis (CERN)


        Ceph Erin upgrade to 14.2.8

        After the upgrade, an osd's rockdbs became slow and it triggered a lot of slow ops.
        Dan did an offline compaction of the rocksdb and started the osd again. The cluster's health is now ok.

        ceph osd blocked-by
        ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-464/ compact


    • 2:30 PM 2:45 PM
      Ceph: Projects, News, Other 15m
      • Backup 5m
        Speaker: Roberto Valverde Cameselle (CERN)
        • Started adding backup jobs to the new S3 zone. 
        • Will continue to add a dual backup of our accounts so we can compare performance against the current one. 
        • If everything  goes fine, we can start moving new users / new migrated projects to the new zone. 
      • HPC 5m
        Speaker: Dan van der Ster (CERN)

        Due to the COVID-19 situation, the workshop foreseen for nextweek is postponed.

        The tentative date is the 9th of June.


      • Kopano 5m
        Speaker: Dan van der Ster (CERN)
        • ceph/kopano: v14.2.8 upgrade (Teo) and pool reconfiguration todo (Dan)
      • Upstream News 5m
        Speaker: Dan van der Ster (CERN)


        • v14.2.8 has improved mgr upmap balancing. By default it will now balance to within 5 PGs stddev across all OSDs. In our clusters please do the following after an upgrade:
          • ceph config set mgr mgr/balancer/upmap_max_deviation 1
        • There is new activity on the osdmap trimming issue: https://github.com/ceph/ceph/pull/19076
          • The issue is coming up in the context of slow ops during pool creation. 
    • 2:45 PM 2:55 PM
      S3 10m
      Speakers: Julien Collet (CERN) , Roberto Valverde Cameselle (CERN)


      • cleanup of jira tickets:
        • improving the s3-scanner (group-by owner, blacklisting)
        • improving the no-osd-left-behind script
      • ongoing:
        • SSL Certificate for S3/nethub
      • gabe usage:
        • ceph/gabe usage
        • last 30 days:
          • ~300TB monthly increase
        • last 7 days:
          • ~24TB weekly increase (daily increase slowing down)
        • Will be 70% full in around 2-4w, depending the pace:
          • Currently 61% full
    • 2:55 PM 3:05 PM
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN) , Theofilos Mouratidis (CERN)


      • Preparing the CentOS 8 clients. The aarch64 build is not working for some unknown reason -- Alex I. is helping, and we've kicked off one last build this morning. If it doesn't succeed, we will disable aarch64.
      • FILER-120: last filers for MIC scheduled this week Thursday at 9am. This involves several filers, so Theo and Dan to coordinate beforehand how to divide and conquer.


      • Migrated twiki Thursday 12 March (07:30) on itnfs23c
      • itnfs23b to be decomissioned
    • 3:05 PM 3:10 PM
      AOB 5m


      • Dan: Mobile working, Yubikey works for ssh to aiadm-multi, not SSO.
      • Teo: Yubikey working, 
      • Enrico: Mobile and Yubikey working, both for ssh and SSO.
      • Julien Mobile working ssh+sso
      • Fabrizio: ?
      • Roberto: Yubikey ssh+sso