Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Description

Zoom: Ceph Zoom

    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)
      • Long email exchange due to problem at RAL Stratum 1
        • cache-control header sent to 3 days instead of 61 seconds
        • Post-mortem tomorrow at CVMFS coordination meeting
      • cms-ib complained about slowness in deleting one folder
        • Need to gather statistics. It is likely millions of files
      • Collectd complains about skipped packages on zero machines, but there is no evidence in `/var/log/distro_sync.log`
    • 14:15 14:30
      Ceph: Operations 15m
      • Notable Incidents/Requests 5m
      • Upgrades, Migrations, (De-)Commissioning 5m
        Speaker: Enrico Bocchi (CERN)
        • Gabe OSDs restarted Friday afternoon (needed to feel alive)
          • Mempools are now for bluestore cache instead OSD maps

        • Flax draining is completed.
        • Gabe RJ* draining continues.
        • CEPH-1005: Optimized PG removal PR has been merged into master; now discussing if it will be backported to nautilus. (non-trivial cherry pick).
          • Realistically, we're not going to make any more upgrades in 2020, so we can target 2021 for this.
      • Hardware Repair Liaison 5m
        Speaker: Julien Collet (CERN)
      • Puppet, Tools, Monitoring 5m
        Speaker: Theofilos Mouratidis (CERN)
        • CEPH-1013: New OSD mempools charts from metrictank showed a problem with osdmap mempool usage on gabe.
          • osdmap memory not trimmed following a pool pg merge exercise in Nov. bug filed upstream, and gabe osds were restarted.
        • Erin had ~500MB pg logs at the middle of last week; reconfigured to keep only 500 entries down from 3000 default. (usage decreased to <150MB).
          • Asked Teo to investigate PGlog memory usage (CEPH-1012) so we can better understand this.

         

        (Theo)

        • fixed memory limits for scylladb, things look normal now
          • had to restart scylla, empty swap, restart metrictank
        • puppet class for metrictank ready
          • if we verify the instance looks good
          • we can create a better VM on our production Openstack Project for VMs
      • Upstream News 5m
        Speaker: Dan van der Ster (CERN)
    • 14:30 14:45
      Ceph: Ongoing Projects 15m
      • Kopano/Dovecot 5m
        Speaker: Dan van der Ster (CERN)
        • Only 3 users left on Kopano, with 231 on Dovecot.
        • CEPH-1015: kcephfs crash seen on the dovecot-backend-00. Fix is in latest el7.9 kernel but mail team will wait for january before they upgrade.
      • REVA/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
    • 14:45 14:55
      S3 10m
      Speakers: Enrico Bocchi (CERN) , Julien Collet (CERN)

      Enrico:

      • New S3 frontend ready in CC7 + C8
        • Not running RGW yet, redirecting to existing backends
        • Some fixes on Prometheus might be required (it uses `consul_sd_configs` for  "retrieving scrape targets")
      • Struggling to make access logs ingested by Elastic Search
        • Meeting with Pablo at 3pm today
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN) , Enrico Bocchi (CERN) , Theofilos Mouratidis (CERN)
      • Jim is seeing lots of slow requests (OSD + MDS) the past week. It corresponds to some HPC user job which use all CPUs plus write at several GBps at the same time.
        • Currently ceph-osd and user jobs have the same priority. We are now testing with `renice -10 -u ceph` on all the hpc nodes to see if that prevents cpu starvation of the osd processes.

       

    • 15:05 15:10
      AOB 5m
      • Giving a talk at "WD Live Hack" this Thursday:
        • Organized by our Ceph friends in Turkey.
        • https://register.gotowebinar.com/register/6959370254321102352