Ceph/CVMFS/Filer Service Meeting

600/R-001 (CERN)



Show room on map
    • 14:00 14:15
      Visitor Talk: CephFS Benchmarking for ATLAS 15m
      Speaker: Adam Abed Abud (Universita and INFN (IT))
    • 14:15 14:20
      CVMFS 5m
      Speaker: Enrico Bocchi (CERN)


      • WIP atlas-nightlies on S3
        • Issues with CC7 (they have never used ayum with CC7)
        • Extension of the Manila share for $HOME
        • Long publication time due to many 'HEAD' requests against S3
      • New set of squids for Atlas repos
      • Extension of atlas.cern.ch volume (4.5TB)
      • Memory allocation problem with cvmfs-server-2.5.2 (`_CVMFS_SERVER_PIPELINE_MB=64`)
      • Migrated {belle, boss, geant4, glast}.cern.ch to more powerful VM
    • 14:20 14:25
      Ceph Rota Report 5m


       - 3 disk replacements

       - meeting with the repair service this week on improving procedures?

    • 14:25 14:30
      Ceph Upstream News 5m

      Releases, Tickets, Testing, Board, ...

      Speaker: Dan van der Ster (CERN)

      From Dan

      • Mimic 13.2.5 was released. This *should* be the good one to start our big upgrades. Will prepare some jira's to test and start upgrades.
      • Nautilus stable nearly released. The final release notes are a good read: https://github.com/ceph/ceph/pull/27019
        Interesting to note is that RH and SUSE enterprise Ceph will update directly to Nautilus from Luminous.
    • 14:30 14:35
      Ceph Backends 5m

      Upgrades, capacity changes, rebalancing, ...

      Speaker: Theofilos Mouratidis (National and Kapodistrian University of Athens (GR))

      ceph/flax: 2nd host, first 6+1 disks formatted to bluestore, now backfilling

      ceph-scripts: avoid usage of `ceph-volume lvm zap` on `ceph-disk` formatted disks, because they start the osds and then they try to delete them. (i use ceph-disk zap now, it may need improvement because it creates a unusable gpt partition table)

      Didn't see any problems with restarting hosts with reformatted disks.
      They booted normally and started all processes.




      • cephmon0 being deleted next week. 
        Need to identify new mons for beesly. Phys/VM is being debated. (Bootstrapping issue)
    • 14:35 14:40
      S3 5m

      Ops, Use-cases (backup, DB), ...

      Speakers: Julien Collet (CERN), Roberto Valverde Cameselle (Universidad de Oviedo (ES))

      Roberto (Backup):

      •  Working on a restic puppet module to easily install restic and schedule a backup to S3.



      • s3.cern.ch upgraded to the latest Ceph v12.2.11.
      • setting up and testing of cosbench to measure current S3 performance

      Herve (by Dan)

      • OTG0048860: S3 had only one IP in s3.cern.ch, following an updated lbclient rpm and our incompatible configuration. https://gitlab.cern.ch/ai/it-puppet-hostgroup-ceph/commit/5209d601b1eaccb096f7363e6a33423db7ba238d


      • We need to prepare the configuration for multi-region S3.
        Roberto tested multi-region previously.. any documentation?
        We could create a tiny VM cluster now (with exact same zone config as gabe cluster), then create a 2nd tiny VM cluster and work out the procedure to add the 2nd zone.
    • 14:40 14:45
      Block Storage 5m

      OpenStack Cinder, Beesly, Wigner Decommissioning, ...

      Speaker: Theofilos Mouratidis (National and Kapodistrian University of Athens (GR))


      • beesly balancing of RA21 is getting there: currently 66% used (from peak of 85% used roughly 1 month ago)
    • 14:45 14:50
      CephFS/FILER 5m
      Speakers: Alberto Chiusole (Universita e INFN Trieste (IT)), Dan van der Ster (CERN)


      • OpenShift filer needed a hard reboot late Friday evening: OTG0048862
      • HPC 1m
        Speaker: Alberto Chiusole (Universita e INFN Trieste (IT))

        Running an HPC application (RegCM, Regional Climate Model - https://gforge.ictp.it/gf/project/regcm/) on CEPH /bescratch (kernel mounted), on different # procs.

        Strange "patterns": investigation with IOR in different ways of writing (MPI-IO, HDF5, NetCDF, etc)

    • 14:50 14:55
      HyperConverged 5m
      Speakers: Jose Castro Leon (CERN), Julien Collet (CERN), Roberto Valverde Cameselle (Universidad de Oviedo (ES))


      • Mail and DB guys are testing the hyperconverged setup
      • We may need to test enabling some tunables that are currently not there
      • Look at the impact of the cache configuration on the Client side
    • 14:55 15:00
      Monitoring 5m


      • Prophetstore set up on a couple of erin hosts: presumably one disk is to fail this or next week.
      • Willing to increase the free trial period
      • Larger-scale tests? (currently 3 erin hosts)


      • Updated today to thanos v0.3.2, should improve memory usage by the compactor.
      • Some gaps observed in long-term metrics, need to check.


      • New tickets needing a volunteer:
        • Create KPI/SLA dashboards: https://its.cern.ch/jira/browse/CEPH-679
        • rationalize the ceph-health-cron: https://its.cern.ch/jira/browse/CEPH-680
    • 15:00 15:05
      AOB 5m


      • Guys from IBM sent a ping about the LinuxOne machine

      RAL workshop:

      • Was useful -- they have a 5000 OSD cluster and enjoy some of the issues we also see (balancing, how to handle inconsistent PGs)
      • Slides available here: https://indico.cern.ch/event/803456/overview
      • Some dev ideas coming out:
        • better handling for weak writes (which leads to inconsistent PG): repair the PG immediately, and *overwrite* the object in place so that it may fix the PendingSector.
        • add pg-upmap-items-force (and rm-pg-upmap-items-force) which do not do any crush rule validation. This would be useful when migrating to new crush topologies. 
Your browser is out of date!

If you are using Internet Explorer, please use Firefox, Chrome or Edge instead.

Otherwise, please update your browser to the latest version to use Indico without problems.