Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Description

Zoom: Ceph Zoom

    • 14:00 14:20
      Ceph: Operations Reports 20m
      • Teo (cta, erin, kelly, levinson) 5m
        Speaker: Theofilos Mouratidis (CERN)
        • CTA objectstore
          • CTA team will setup a four node cluster for us and give access to ceph-admins
          • They will use cephkelly for production in the interim
          • They will move their dev part to cephdwight
            • we upgrade cephdwight first so we will encounter any problems from before
          • Once we get the hosts we will create a cluster for them and the namespaces etc
        • Castor
          • EC racks don't have replacement HDDs
            • 2 HDD raid0 --> 1 HDD OSD
            • operators will still drain (follow first part of replacement procedure)
          • Ceph public will be set to readonly by Steve
          • Once he deploys his own machines:
            • we will decomission the EC racks
            • we will give them back the machines that Giussepe gathered for us
      • Enrico (barn, beesly, gabe, meredith, nethub, vault) 5m
        Speaker: Enrico Bocchi (CERN)
        • Barn:
          • Doing some benchmarking and finalization today for enrollment in OpenStack
          • Requires discussion for volumes labeling and volume migration from Beesly
        • Beesly:
          • MGR cephbeesly-mon-2a00f134e5 not reporting (twice) ceph-exporter stats to Prometheus
          • It ships ~16MB metrics every 30 seconds
          • It frequently takes more than 15 seconds (scrape interval)
        • Gabe:
          • Uptime alarm set to 3y (applies to all ceph machines)
          • Rebooted 4 OSD nodes last week -- very smooth.
          • 2 nodes redire reinstallation to get raid1 on system disk (CEPH-1045)
        • Meredith, Nethub, Vault: NTR
      • Dan (dwight, flax, kopano, jim) 5m
        Speaker: Dan van der Ster (CERN)
        • Learned from Sean Crosby at UniMelb about a critical fw issue on Toshiba 12TB drives: https://www.dell.com/support/home/fr-fr/drivers/driversdetails?driverid=0942y
          • RQF1747025 opened with HW Procurement to check with our vendor.
        • Flax:
          • CEPH-1022: trying to reproduce mds locking crash. It doesn't reproduce on the cephoctopus cluster; i will test this week on a nautilus cluster.
          • CEPH-1075: new instance of loaded dup inode issue
          • CEPH-1076: osd.431 was just replaced but gets slow ops, running some tests there.
    • 14:20 14:30
      Ceph: Operations Tools (ceph-scripts, puppet, monitoring, etc...) 10m
      • CEPH-1074: new ceph-scripts/tools/find-slow-osds.py shows OSDs with very high latency compared to the average, e.g:

      [10:05][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# ceph-scripts/tools/find-slow-osds.py
      Mean Commit Latency: 28.950
      Std Commit Latency: 18.443
      osd.720 has high commit latency: 264.000 ms
      osd.573 has high commit latency: 169.000 ms
      osd.865 has high commit latency: 143.000 ms
      osd.912 has high commit latency: 127.000 ms
      osd.1045 has high commit latency: 129.000 ms

      • ...
      •  
      • S3 monitoring scripts moved to service account 'cephacc' (thanks @Jose!)
        • Contacted by Jan (Iven) on CEPH-848 -- need some improvements for central accouting
    • 14:30 14:40
      Ceph: R&D Projects Reports 10m
    • 14:40 14:50
      Ceph: Upstream News 10m
    • 14:50 15:05
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)
    • 15:05 15:10
      AOB 5m
      • Enrico (likely) absent Thursday + Friday