Minor fixes due to modulesync and C8-->Cs8 migration of test machines.
We should organize a cvmfs coffee // lunch now that Radu is here and Jakob is not gone yet.
Ceph Operations Reports
Teo (cta, kelly)
This week with Julien we will do a test migration with test data
If the operation succeeds with export/import,
We will plan an intervention to migrate the production to the new cluster
Ceph/Kelly upgraded to 15.2.14-7 (CTA is 15.2.14)
carbon-relay-ng leaks memory:
should be less than 1GB, can reach up to 8GB
until it is fixed, I will add 2GB memorymax in systemd
some issues with prrof profiling half of memory used, will create a github ticket for this
help FTS team to use Metrictank to transition from Elasticsearch
Consult on how to collect metrics, send them
How to create dashboards and use the graphite query language
Create variables, etc
Enrico (barn, beesly, gabe, meredith, nethub, vault)
Upgraded to Octopus last Monday, and doing ok since then.
MGRs running only on two C8 VMs (cephbeesly-mon-*) -- Required for upgrade to Octopus
Upgrade to Octopus (OTG0066572) went much faster than expected
Several slow requests, for the rest uneventful
Will start enrolling new hw and drain RA machines (OTG0066404)
Bytes keep on flying from old hw to new hw in HA racks
Moving monitor from old (almost-empty) machine to new hw for decommissioning
Same for MGRs, but C8 boxes in one rack only for now (acceptable)
Gabe, Meredith, Vault: NTR
Confirmed with CF, RA racks are the last ones from ST to be decommissioned
For delivery 21Q4, should we consider PCC PoC for location?
Vault // Meredith should also be upgraded to Octopus
Nethub should go to Octopus to unblock mirroring
3rd RBD cluster:
Can I take the ceph/cephadm boxes and make the 3rd RBD region out of it?
Suggestions for names are welcome (we are at Pam)
Dan (dwight, flax, kopano, jim, upstream)
Dan van der Ster
dwight 2/3rds replaced with new hw.
flax: one OCIS client is particularly noisy with caps recall issues. Has been quiet over the weekend -- not clear why it resolved itself.
Arthur (levinson, pam)
Levinson upgraded to octopus last Tuesday
Jose Castro Leon
jose sending per-type accounting info
jose to prepare detailed type-rationalisation plan for review and OTG.
all HVs updated to octopus client
R&D Projects Reports
Issues with trying to implement file versions
There are no tools to inspect if the code works, apart from the UI
There is a routing issue where the request can find the storage provider, but other operations succeed
I will work with Ishank tomorrow morning to fix all those issues that keep me from developing the module further
The smallest interval is an hour, additional implementation is required to snapshot every 15min
We would also need to assess whether it is possible to snapshot every 15mins the snapshots of all users operating on the cluster.
Can we create snapshots every 15 min on flax for every manila share for example?
The structure of the scheduled snapshots is different than manual snapshotting
In subdirs, instead of "<inode of snapshot dir>_<snapshot name>" to "_scheduled-<datetime>_<some number>"
On the latest CDM they talked about prom metrics of rbd-mirror:
"We really need that for Quincy"
They also plan to do that for most of the daemon to replace eventually the mgr prom module
It has some scalability issue because it's a single endpoint
Still developing my rbd mirror patch
Hope to have finished somewhere next week
EOS CephFS Test
Roberto Valverde Cameselle
- No time still for adding those cephfs mounts in canary -> This week (hopefully)
- Installed version canary with the fix for the async replica creation [not enabled yet, will do this week]. This should improve 2-replica layout performance. EOS-4930
- Also new fix that should improve performance, introducing a memory cache for leveldb, which should reduce the time needed to get inconsistency reports. CERNBOX-2241
- Herve bumped prometheus module in qa, looks ok on EOS side, but ceph prometheus stuff is all in production. Maybe a test environment should be created for the monitoring? This way Aswin should also have a test place to test upgrades.
- https://github.com/ceph/ceph/pull/43384 merged to ceph and being backported yay
- Prometheus/alertmanager/thanos upgrade does not appear to have obvious breaking changes, still need to test with puppet and figure out deployment.