Ceph/CVMFS/Filer Service Meeting

600/R-001 (CERN)



Show room on map

Zoom: Ceph Zoom

    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)


    • 14:15 14:30
      Ceph: Operations 15m
      • Incidents, Requests, Capacity Planning 5m
        Speaker: Dan van der Ster (CERN)
        • CEPH-972: EC and RJ racks to be decommissioned. Neither are urgent. RJ is gabe+flax, and we have the hw already to go ahead. (It would be best to wait for flax to be upgraded to nautilus before migrating that cluster). EC is CASTOR -- I pinged Cano and the deadline is distant enough that it is not a priority yet. (Unclear what the future space reqs for CASTOR Ceph are in 2021)
      • Cluster Upgrades, Migrations 5m
        Speaker: Theofilos Mouratidis (CERN)
      • Hardware Repairs 5m
        Speaker: Julien Collet (CERN)


        • Relatively quiet week
        • Beesly bluestore conversion:
          • Relatively smooth so far, including on corner cases
          • Status:
            • 765 bluestores
            • 457 filestores
      • Puppet and Tools 5m
    • 14:30 14:45
      Ceph: Projects, News, Other 15m
      • Kopano/Dovecot 5m
        Speaker: Dan van der Ster (CERN)
      • REVA/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
        • mtime propagation works on the ceph module
        • localhome doesn't even notice the file changes
          • ceph module is based on localhome
        • start of investigation of file versions with snapshots
    • 14:45 14:55
      S3 10m
      Speakers: Julien Collet (CERN) , Roberto Valverde Cameselle (CERN)
      • CEPH-970: scrubbing delayed on gabe. Setting osd_max_scrubs to 5 unlocked the contention, and now the scrubbing is caught up to 2020-09-26.
      • CEPH-974: large omap obj in gitlabartifacts bucket is a result of a bug where the sharding key is unset after an object is "to be purged" but not yet deleted. So those objs get sharded to '0'. Still looking for how to clean this up.



      • lbcheck for broken traefik config on its way
        • base script will dump traefik config and look for usual configs
        • puppetization in progress
      • S3 account cleanup in progress
        • another batch of personal accounts has been rm'ed
        • 3 empty user accounts have been suspended, will be purged next monday
        • 3 remaining non-zero personal accounts remaining
          • last sync end of 2018...
      • S3 email check
        • Script ready, expected to be pushed today
        • Will perform a daily check and notify someone that quota is above 95%
      • S3 pubsub
        • The pubsub zone configuration seem more or less correct
        • Able to "list" topics (as in tracking that the request is correctly handled by the rgw)



    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN) , Theofilos Mouratidis (CERN)
      • CEPH-973: dwight metadata cleanup still ongoing after the crash last week. We now get a few entries of duplicated inodes, e.g. 

      2020-10-02 10:13:34.590 7fc8a143a700  0 mds.0.cache.dir(0x607) _fetched  badness: got (but i already had) [inode 0x10041176fd1 [2,head] /volumes/_nogroup/b2951883-e2f0-433a-b0b8-005a1026ed58/spark-sam3-enrichment-qa_4/commits/.cfad6d61-a099-4b3b-9763-33425d2b3ff2.tmp auth v92095819 s=0 n(v0 rc2020-09-30 14:21:29.495555 1=1+0) (ifile lock) (iversion lock) 0x5605240dc700] mode 33188 mtime 2020-09-30 14:21:29.491796
      2020-10-02 10:13:34.590 7fc8a143a700 -1 log_channel(cluster) log [ERR] : loaded dup inode 0x10041176fd1 [2,head] v2632141162 at ~mds0/stray7/10041176fd1, but inode 0x10041176fd1.head v92095819 already exists at /volumes/_nogroup/b2951883-e2f0-433a-b0b8-005a1026ed58/spark-sam3-enrichment-qa_4/commits/.cfad6d61-a099-4b3b-9763-33425d2b3ff2.tmp

      • The fix is to take this cephfs down for maintenance and run `cephfs-data-scan scan_links`.
      • CEPH-971: levinson cluster caps recall tuning. One heavy user was stating inodes faster than the mds was reclaiming, so this workload could lead to a future oom of the mds.
        • Fixed with tuning the mds to more aggressively recall caps. Levinson needs caps recall set to 8x the default to neutralize this workload.
      • FILER-134: hg_filer is not yet puppet6-ready.



    • 15:05 15:10
      AOB 5m