Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
    • 14:00 14:05
      CVMFS 5m
      Speaker: Enrico Bocchi (CERN)

      Enrico:

      • Quiet Christmas break
        • atlas.cern.ch hanging on publish
        • cvmrepo.web.cern.ch down (again) for 1+ day
        • BNL complaining about new signing key for 'belle.cern.ch'
      • Collectd process spinning 100% on machines in 'cvmfs/one/backend' hostgroup: INC2261294
      • Working on frontier-squid on personal non-puppetized (yet) VM
      • Restarting YubiKey signing campaign. Will circulate list of repos, dates, and warn owners and Stratum operators well in advance
    • 14:05 14:10
      Ceph Upstream News 5m

      Releases, Tickets, Testing, Board, ...

      Speaker: Dan van der Ster (CERN)

      Version recap:

      • nautilus: v14.2.5 has ~one known issue on large clusters, with ceph-mgr using lots of cpu: https://tracker.ceph.com/issues/43364 Fix is available and v14.2.6 should be released soon. We could also build a quick hotfix.
      • mimic: v13.2.7 should have the same issue as above (the issue is coming from the new network ping monitoring). We're running this version already on large clusters and the cpu usage is high but not causing any problems.
      • luminous: v12.2.12 is stable but there are a lot of fixes on the luminous backports branch so there should be a .13 release at some point. I don't see any strong reason to stay on luminous much longer -- nautilus looks good.
    • 14:10 14:15
      Ceph Backends & Block Storage 5m

      Cluster upgrades, capacity changes, rebalancing, ...
      News from OpenStack block storage.

      Speaker: Theofilos Mouratidis (National and Kapodistrian University of Athens (GR))

      Incidents over break:

      • SSD failed: 6 osds in beesly
      • Dec 29 incident: possible network or block storage glitch.
      • CephFS slow ops.

      Maintenance after break:

      • osdmaps beesly were piling up: restarted leader
      • stray dir growing on flax: 200k entries. ls -lR fixed.
    • 14:15 14:20
      Ceph Disk Management 5m

      OSD Replacements, Liaison with CF, Failure Predictions

      Speaker: Julien Collet (CERN)

      Julien

      • Disk replacement procedures improvements (CEPH-790):
        • Current procedure when machine goes into intervention:
          • Setting a hardcoded routine to prevent the script from running
          • Probably not very good practice
        • Proposition:
          • The person responsible for the intervention sets roger to intervention with a message
          • Scripts checks the roger status before actually running
          • This is being tested, could go in production today.
    • 14:20 14:25
      S3 5m

      Ops, Use-cases (backup, DB), ...

      Speakers: Julien Collet (CERN), Roberto Valverde Cameselle (Universidad de Oviedo (ES))

      Julien

      • CEPH-782: cern-s3-scanner.py
        • Most of the available tools (S3scanner, slurp, ...) just only do requests and test the return status code.
        • Difficult to adapt their code for us, since it uses a lot of amazon-specific things we don't need/use.
        • First python implementation here: https://gitlab.cern.ch/jcollet/code-misc/tree/master/cern-s3-scanner
        • Questions:
          • Where do we want this to be running?
          • How often?
    • 14:25 14:30
      CephFS/HPC/FILER/Manila 5m

      Filer Migration, CephFS/Manila, HPC status and plans.

      Speakers: Dan van der Ster (CERN), Pablo Llopis Sanmillan (CERN)
    • 14:30 14:35
      HyperConverged 5m
      Speakers: Jose Castro Leon (CERN), Julien Collet (CERN), Roberto Valverde Cameselle (Universidad de Oviedo (ES))
    • 14:35 14:40
      Monitoring 5m
    • 14:40 14:45
      AOB 5m