Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
    • 14:00 14:05
      CVMFS 5m
      Speaker: Enrico Bocchi (CERN)
    • 14:05 14:10
      Ceph Upstream News 5m

      Releases, Tickets, Testing, Board, ...

      Speaker: Dan van der Ster (CERN)
    • 14:10 14:15
      Ceph Backends & Block Storage 5m

      Cluster upgrades, capacity changes, rebalancing, ...
      News from OpenStack block storage.

      Speaker: Theofilos Mouratidis (National and Kapodistrian University of Athens (GR))

      ceph/flax: now 75% bluestore

      ceph/erin: upgraded to mimic

      nfs-ganesha:

      • created a vm with a working example
      • trying to figure out kerberos auth
         
    • 14:15 14:20
      Ceph Disk Management 5m

      OSD Replacements, Liaison with CF, Failure Predictions

      Speaker: Julien Collet (CERN)

      Julien

      • Disk replacement (almost) fully handled by Paul and Remy now
      • Modified scripts to handle updated procedure.
    • 14:20 14:25
      S3 5m

      Ops, Use-cases (backup, DB), ...

      Speakers: Julien Collet (CERN), Roberto Valverde Cameselle (Universidad de Oviedo (ES))

      (Dan)

      • gabe cluster is 80% full -- need to add spare servers to this cluster and balance (change crush rule, enable balancer).
        • default.rgw.meta (obsolete pool) has been deleted (removing 10M objects)
        • script to clean ~13 million obsolete bucket indexes is running now on cephgabe0.
          • all have size=0, num omap keys = 0
          • 347 from 2019, 2975 from 2018, 13.65 M from 2017.
        • Roberto paused cbox backups (60TB used).
        • Kubernetes RECAST Higgs demo using 70TB until mid-May.
        • Deleted 7TB of my test data, 9TB of old Oracle tests.
        • CEPH-697 S3: Contact legacy users, move to openstack accounts
      • S3 outage this weekend: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0049284
        • An OSD had locked up which caused 404's in the traefik lb's. Restarting the OSD unblocked the gateways.
        • (Likely related to the obsolete bucket index cleanup, and compaction in the osd leveldb's. I re-compacted all OSD's after this and resumed the index deletion this morning.)

      Julien

      • Updated TLS certificates (see CEPH-670) with Dan
      • Started S3 multi-region preparation (see CEPH-682) with Roberto
    • 14:25 14:30
      CephFS/Manila/FILER 5m

      Filer Migration, CephFS/Manila status and plans.

      Speaker: Dan van der Ster (CERN)

      (Dan)

      • https://its.cern.ch/jira/browse/CRM-3101: (Fixing a small but which created file /root/1 instead of piping to /dev/null)
      • Ceph/flax remaining FileStore machines clearly more highly loaded now that 75% of cluster is bluestore. I caught some live OSD flapping on Sunday and found that disabling deep scrub on those osds helped.
    • 14:30 14:35
      HPC 5m

      Performance testing, HPC storage status and plans

      Speakers: Alberto Chiusole (Universita e INFN Trieste (IT)), Pablo Llopis Sanmillan (CERN)

      (Dan)

      • Attended HPC Users Workshop: https://indico.cern.ch/event/805674/
        • Users happy to get more perf from kernel mount.
        • One user asked how they could access cephfs from lxbatch. (We explicitly do not want to mount on lxbatch.) They can copy files between via EOS -- they will get back in touch if they need something else.
          (Once we have NFS-Ganesha working, would could enable some low-performance access to the HPC files from outside the cluster.)
    • 14:35 14:40
      HyperConverged 5m
      Speakers: Jose Castro Leon (CERN), Julien Collet (CERN), Roberto Valverde Cameselle (Universidad de Oviedo (ES))

      (Dan)

      • CEPH-698 kelly: redeploy osds with bluestore_min_alloc_size=4096

       

      Julien & Jose

      • CEPH-688: testing of cache configurations is ongoing

       

    • 14:40 14:45
      Monitoring 5m

      Julien

           - Prophetstore: one disk failed "driver error count value" and was not predicted by prophetstore as bad...

           - Currently investigating what had the alarm triggered.

    • 14:45 14:50
      AOB 5m

      Kopano:

      • CephFS will be used for attachments.
        • Currently all attachments stored. They will test dedup.
        • Most attachments around 4kB.
        • Currently 2 levels nesting -- [0-10]/[
      • Current PoC HW is HC machines (14*1TB SSD * 20 servers)
        • CDA want all HW to be in the barn. Need to find space and probably new hardware.
        • Eric Bonfillou will get price for 4x4TB SSD that could go into new arriving batch nodes.
      • Ceph Kelly is now dedicated to Kopano (including DBoD tests)
        • CTA can stay there because they're small, exceptionally.

      ceph/scripts: Change scripts to be closer to Python3 and avoid linter warnings.

       

       

      Julien:

      • Conference call last Thursday with the French Ministere de l'interieur.
        • Want to/are operating a Ceph cluster and are really interested in having tips/help/guidance
        • Planning to come mid-june for a day:
          • They will present they architecture/use-case/etc...
          • We'll probably do a "Ceph at CERN" sort of day
      • Confcall with UNIGE later this week

       

       

Your browser is out of date!

If you are using Internet Explorer, please use Firefox, Chrome or Edge instead.

Otherwise, please update your browser to the latest version to use Indico without problems.

×