Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Description

Zoom: Ceph Zoom

    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)

      Enrico:

      • Migrated sft and atlas-nightlies -- OTG0059306OTG0059450
        • atlas-nightlies is special as the Stratum 0 was accessed directly by clients/caches
        • Clean-up campaign through Atlas folks partially fixed it. Few requests/clients still using it.
        • For now, we have httpd that redirects to S3 also on the Stratum0. Plan to remove it.
      • compass and ams planned for Thursday (OTG0059532, OTG0059533)
      • Failed to migrate sft-nightlies
        • Problem with catalogs files not copied over
        • Investigating with developers on CVMFSOPS-271
      cvmfs probe (Fabrizio)
      • there has been an outage of the IT new monitoring infrastructure, lasted a few days
      • the new CERN cvmfs probe coped correctly with it, giving the expected errors/alerts
      • we decided to keep the old, historical serviceid even if it's a bit strangely worded for a CERN probe
      • Fab is about to start rewriting the other probe, the one for the external sites
      • We would like to remove the old probe and do the substitution
    • 14:15 14:30
      Ceph: Operations 15m
      • Incidents, Requests, Capacity Planning 5m
        Speaker: Dan van der Ster (CERN)
      • Cluster Upgrades, Migrations 5m
        Speaker: Theofilos Mouratidis (CERN)
      • Hardware Repairs 5m
        Speaker: Julien Collet (CERN)

        Giuliano

        • Filestore to Bluestore migration:
          • bluestore: 874 (was 765)
          • filestore: 348 (was 457)
      • Puppet and Tools 5m
    • 14:30 14:45
      Ceph: Projects, News, Other 15m
      • Kopano/Dovecot 5m
        Speaker: Dan van der Ster (CERN)
      • REVA/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
        • Sync test done
        • Most of them pass
          • Only when the filesize of objects gets too large they fail (cernboxcmd fails)
          • All the tests that fail on the cephfs module fail as well on the localhome
        • Version system to be reimplemented:
          • Current: data/.shadow/.versions/<user path>/v<timestamp>
          • TODO: data/<user path>/.snap/_<timestamp>_<data dir inode>
    • 14:45 14:55
      S3 10m
      Speakers: Julien Collet (CERN) , Roberto Valverde Cameselle (CERN)
      • CEPH-980: gabe osds using too much memory:
        • We had several OSDs going OOM with a huge amount of buffer_anon usage and then lots of osd_pglog memory after restarting them. (>1.5GB each, when this hsould be <20MB normally).
        • I tuned down the max entries on the pg log from 3000 to 500, so pglog entries get trimmed more agressively.
        • I created a new dashboard to watch the mempool usage over time: https://filer-carbon.cern.ch/grafana/d/000000108/ceph-osd-mempools
        • The issue seems stable for now.
      • While debugging above, I noticed that gitlabregistry bucket gets thousands of list operations per second. I asked the Gitlab team and they noticed that some registry cache was not enabled -- they enabled last week and things seem stable in that area too. (Not obvious if it had any effect).

      Julien

      • S3 accounting:
        • Will need some change to accomodate API v3 of the central accounting
        • Possibly central accounting dashboard is broken
          • We're pushing the data and the system is acknowledging it
          • Dashboard shows a red icon like we don't push anything
      • S3 accounts cleaning:
        • The remaining personal accounts (4+3) will be purged after this meeting
        • More to follow on that topic

      Enrico

       
       
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN) , Theofilos Mouratidis (CERN)
      • Dwight issues:
        • CEPH-973: few metadata issues were fixed with an offline `cephfs-data-scan scan_links`, which took 2 hours on this cluster.
        • CEPH-985: in preparation testing for flax upgrade, I re-enabled 2 active mds's and did the manual pinning. 
      • Flax update scheduled for Weds all day. Procedure at CEPH-855
        • I have emailed P. Donnelly to see if this issue might effect us: https://tracker.ceph.com/issues/46648  (mds cannot handle thousands of subtrees)
          • It has not effected us yet even though we have 1500 pinned subtrees in luminous. I asked if nautilus added something new there which is worsening things.
      • CEPH-959: random mds crashes on flax caused by some corruption in the messaging layer. Hoping these go away with the upgrade, but so far at least they seem transparent to the users.
      • CEPH-984: added warning for the cephfs puppet module if users don't have purge enabled -- once they enable, then unmanaged ceph mounts will be removed from /etc/fstab
    • 15:05 15:10
      AOB 5m