Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
31/3-004 - IT Amphitheatre (CERN)

31/3-004 - IT Amphitheatre

CERN

15
Show room on map
Videoconference
Ceph Daily Standup
Zoom Meeting ID
707092061
Host
Dan van der Ster
Useful links
Join via phone
Zoom URL
    • 3:00 PM 3:15 PM
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN), Fabrizio Furano (CERN)
      • Fabrizio 5m
        Speaker: Fabrizio Furano (CERN)
      • Enrico 5m
        Speaker: Enrico Bocchi (CERN)
        • Minor fixes due to modulesync and C8-->Cs8 migration of test machines.
        • We should organize a cvmfs coffee // lunch now that Radu is here and Jakob is not gone yet.
    • 3:15 PM 3:35 PM
      Ceph Operations Reports 20m
      • Teo (cta, kelly) 5m
        Speaker: Theofilos Mouratidis (CERN)
        • CTA:
          • This week with Julien we will do a test migration with test data
          • If the operation succeeds with export/import,
          • We will plan an intervention to migrate the production to the new cluster
        • Ceph/Kelly upgraded to 15.2.14-7 (CTA is 15.2.14)
        • Metrictank:
          • carbon-relay-ng leaks memory:
            • should be less than 1GB, can reach up to 8GB
            • until it is fixed, I will add 2GB memorymax in systemd
          • some issues with prrof profiling half of memory used, will create a github ticket for this
          • help FTS team to use Metrictank to transition from Elasticsearch
            • Consult on how to collect metrics, send them
            • How to create dashboards and use the graphite query language
            • Create variables, etc
      • Enrico (barn, beesly, gabe, meredith, nethub, vault) 5m
        Speaker: Enrico Bocchi (CERN)

        Barn:

        • Upgraded to Octopus last Monday, and doing ok since then.

        Beesly:

        • MGRs running only on two C8 VMs (cephbeesly-mon-*) -- Required for upgrade to Octopus
        • Upgrade to Octopus (OTG0066572) went much faster than expected
          • Several slow requests, for the rest uneventful
        • Will start enrolling new hw and drain RA machines (OTG0066404)

        Nethub:

        • Bytes keep on flying from old hw to new hw in HA racks
        • Moving monitor from old (almost-empty) machine to new hw for decommissioning
        • Same for MGRs, but C8 boxes in one rack only for now (acceptable)

        Gabe, Meredith, Vault: NTR

        -----

        Commissioning/Decommissioning;

        • Confirmed with CF, RA racks are the last ones from ST to be decommissioned
        • For delivery 21Q4, should we consider PCC PoC for location?

        Upgrade plans:

        • Vault // Meredith should also be upgraded to Octopus
        • Nethub should go to Octopus to unblock mirroring

        3rd RBD cluster:

        • Can I take the ceph/cephadm boxes and make the 3rd RBD region out of it?
        • Suggestions for names are welcome (we are at Pam)
      • Dan (dwight, flax, kopano, jim, upstream) 5m
        Speaker: Dan van der Ster (CERN)
        • dwight 2/3rds replaced with new hw.
        • flax: one OCIS client is particularly noisy with caps recall issues. Has been quiet over the weekend -- not clear why it resolved itself.
        •  
      • Arthur (levinson, pam) 5m
        Speaker: Arthur Outhenin-Chalandre (CERN)
        • Levinson upgraded to octopus last Tuesday
      • Jose (OpenStack) 20m
        Speaker: Jose Castro Leon (CERN)
        • jose sending per-type accounting info
        • jose to prepare detailed type-rationalisation plan for review and OTG.
        • all HVs updated to octopus client
    • 3:35 PM 3:45 PM
      R&D Projects Reports 10m
      • Reva/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
        • Reva
          • Issues with trying to implement file versions
          • There are no tools to inspect if the code works, apart from the UI
          • There is a routing issue where the request can find the storage provider, but other operations succeed
          • I will work with Ishank tomorrow morning to fix all those issues that keep me from developing the module further
        • Snapshots
          • The smallest interval is an hour, additional implementation is required to snapshot every 15min
          • We would also need to assess whether it is possible to snapshot every 15mins the snapshots of all users operating on the cluster.
            • Can we create snapshots every 15 min on flax for every manila share for example?
          • The structure of the scheduled snapshots is different than manual snapshotting
            • In subdirs, instead of "<inode of snapshot dir>_<snapshot name>" to "_scheduled-<datetime>_<some number>"
      • Disaster Recovery 5m
        Speaker: Arthur Outhenin-Chalandre (CERN)

        On the latest CDM they talked about prom metrics of rbd-mirror:

        • "We really need that for Quincy"
        • They also plan to do that for most of the daemon to replace eventually the mgr prom module
          • It has some scalability issue because it's a single endpoint

         

        Still developing my rbd mirror patch

        • Hope to have finished somewhere next week
      • EOS CephFS Test 5m
        Speaker: Roberto Valverde Cameselle (CERN)

        CephFS Testing:

        - No time still for adding those cephfs mounts in canary -> This week (hopefully)
        - Installed version canary with the fix for the async replica creation [not enabled yet, will do this week]. This should improve 2-replica layout performance. EOS-4930

        - Also new fix that should improve performance, introducing a memory cache for leveldb, which should reduce the time needed to get inconsistency reports. CERNBOX-2241

        Monitoring:

        - Herve bumped prometheus module in qa, looks ok on EOS side, but ceph prometheus stuff is all in production. Maybe a test environment should be created for the monitoring? This way Aswin should also have a test place to test upgrades. 

      • Monitoring NG 5m
        Speaker: Aswin Toni (CERN)

        - https://github.com/ceph/ceph/pull/43384 merged to ceph and being backported yay

        - Prometheus/alertmanager/thanos upgrade does not appear to have obvious breaking changes, still need to test with puppet and figure out deployment.

    • 3:45 PM 3:50 PM
      AOB 5m
      • Enrico: Absent next week (18th -- 22nd Oct.)