Ceph/CVMFS/Filer Service Meeting

600/R-001 (CERN)



Show room on map

Zoom: Ceph Zoom

    • 14:00 14:20
      Ceph: Operations Reports 20m
      • Teo (cta, erin, kelly, levinson) 5m
        Speaker: Theofilos Mouratidis (CERN)
      • Enrico (barn, beesly, gabe, meredith, nethub, vault) 5m
        Speaker: Enrico Bocchi (CERN)

        Meredith, Vault: NTR


        • Upgraded to 14.2.20 this morning
        • Warnings on insecure_global_id_reclaim (CEPH-1129)


        • Problem with osd recreation after upgrade to 14.2.19
          • `ceph-volume lvm batch` sees the device for block-db (typically ssd) as full
          • Sadly, this works only with physical devices and not with LVs/VGs
          • `--db-devices` and `--block-db-size` do not help


        • Draining of old machines almost done (2 racks, 6 boxes to go)
        • 1 out of 4 big boxes also drained
        • Once draining finished, we will have non-contiguous osd ids. Problem of osd map size?
          New: osd.0 --> osd.191
          Old: osd.912  --> osd.959, osd.960  --> osd.1007, osd.1104 --> osd.1151, osd.1200 --> osd.1247
        • TLS certificate not compliant with IGTF:
          • User complaining (INC2768117) doing TPC from S3 to dCache
          • Old Dan's ticket with certificate procurement RQF1193130


        • Traefik working nicely since last week -- switch to be planned for Gabe
        • Still draining one machine for HW repair, painful process
        • Schedule upgrade to 14.2.20 (osd restart will enforce numa interleave)

        Gabe + Nethub:

        • Review S3 access logs ingestion into kafka + elastic search (CEPH-1132)
          • Immediate action: migrate from monit-kakfax to monit-kakfay (ETA: this week)
          • Replace kafka with DB-provided kafka streaming or send properly-formatted logs to `monit-logs` (ETA: summer)
      • Dan (dwight, flax, kopano, jim) 5m
        Speaker: Dan van der Ster (CERN)
      • Arthur 5m
        Speaker: Arthur Outhenin-Chalandre (CERN)

        Created the pam cluster for CephFS for HPC users ~ 2 PB raw

        Tested subvolume v1/v2 and manila interaction on Octopus

        • legacy subvolume created by manila are manageable by `ceph fs` cli and treated like subvolume v1
          • Wallaby version of manila should be able to handle the legacy volume
    • 14:20 14:30
      Ceph: Operations Tools (ceph-scripts, puppet, monitoring, etc...) 10m
      • Draft SSB to inform users about CVE: 
        • We ask users to upgrade at their earliest convenience.
        • Additionally, we can put a deadline for rbd users that don't upgrade; after that date we can coordinate a forced reboot campaign.
        • For CephFS users, a forced campaign is not feasible. We'll need to chase users.
        • SSB++ we could email impacted users directly:
          • `ceph daemon mon.`hostname -s` session ls` -- check for machines with old global_id renewal.
          • ai-dump to find the machine's owners, group hosts by email
          • send emails
      • CEPH-1020: upmap-remapped fix for cycles in upmap-items entries.
      • CEPH-1051: Teo made a dashboard for drives with high IO latency: https://monit-grafana.cern.ch/d/WikHriBGz/ceph-osd-perf?orgId=49,
      • CEPH-1127: puppet nftables rules are now done before installing ceph. (So no chance to start osds before rules installed).
      • CEPH-1133: tuning cephfs warnings about clients not responding to caps recall.
    • 14:30 14:40
      Ceph: R&D Projects Reports 10m
      • Reva/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
      • Disaster Recovery 5m
        Speaker: Arthur Outhenin-Chalandre (CERN)
        • Sent a patch upstream to fix a segfault on snapshot mirroring https://github.com/ceph/ceph/pull/40937
          • Will likely need to add this patch to my cluster to continue my testing
        • Writeback cache didn't help on 4M io with RBD journaling
          • testing to push other settings further
    • 14:40 14:50
      Ceph: Upstream News 10m
    • 14:50 15:05
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)

      Fabrizio: all the CVMFS repositories that do not rely on gateways have been upgraded to the new version 2.8.1


      • Better to upgrade the gateways (and related repos) only after double-checking with the devs in the Friday meeting
      • ca-proxy high miss rate for Atlas frontier traffic to be investigated
      • Review logging for squid caches RQF1785541



    • 15:05 15:10
      AOB 5m


      • To add a new volume on mic-nfs05.cern.ch for MICprojects
      • DBOD instance 'fdohridb' to expire in 30 days (?!)


      Telegram notification for S3/CVMFS issues:

      • There have been problems lately with acron: OTG0063586
      • Scripts doing basic checks and sending notifications to telegram were impacted


      Enrico is absent Wed afternoon + in a training on Friday


      Bloomberg Virtual Visit: https://indico.cern.ch/event/1032902/