Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Videoconference Rooms
Ceph_CVMFS_Filer_Service_Meeting
Name
Ceph_CVMFS_Filer_Service_Meeting
Description
Meeting
Extension
10896342
Owner
Dan van der Ster
Auto-join URL
Useful links
Phone numbers
    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)

      Enrico:

      CVMFS @ Group meeting on Tue, 8 September
       
    • 14:15 14:30
      Ceph: Operations 15m
      • Incidents, Requests, Capacity Planning 5m
        Speaker: Dan van der Ster (CERN)

         

        Issues and alarms over the weekend:

        • Beesly, inconsistent PG: (INC2534066, osd.114), standby manager daemon p05517715y58557 started
        • Dwight: Unresponsive client (k8s, htc-operator-test-a3cbb6-uzemjggmz22z-node-0.cern.ch) eviction
        • Erin, slow ops on osd.261
        • Flax: 1 MDSs report slow requests, client.127643749 isn't responding to mclientcaps(revoke), evicting unresponsive client sonarcluster-dev-t2d4suuwcgnz-minion-0.cern.ch
        • Gabe: Slow requests on osd.219
        • Vault: missing primary copy of 1:8ca621b8:::rbd_data.e6fd06c06f344.0000000000111de7:head
         
      • Cluster Upgrades, Migrations 5m
        Speaker: Theofilos Mouratidis (CERN)

        Enrico:

        • All ceph clusters update to 14.2.11 but flax, gabe, and kelly (CEPH-558)

         

        Giuliano:

        • S3 test cluster upgraded from luminous to v14.2.11
          • Provided the Swift issue isn't blocking on the openstack side, can proceed to gabe update
        • Nethub upgrade went fine
         
      • Hardware Repairs 5m
        Speaker: Julien Collet (CERN)

        Julien

        • Repair scripts updated, will be able to resume to normal ops
        • Misconfigured osds on beesly:
          • Almost clear - 3 OSDs remaining
      • Puppet and Tools 5m

        Dan

        • the c8 changes for module-ceph and hostgroup-ceph were completed and merged into qa. I'll merge to master tomorrow morning.
      • JIRA 5m
    • 14:30 14:45
      Ceph: Projects, News, Other 15m
      • Kopano/Dovecot 5m
        Speaker: Dan van der Ster (CERN)
      • REVA/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
    • 14:45 14:55
      S3 10m
      Speakers: Julien Collet (CERN) , Roberto Valverde Cameselle (CERN)

      Dan:

      • slow reqs on gabe seem to have increased. My theory is this is caused by large omap objects (bucket indices) due to a few large buckets. We can / should reshard, then compact the rocksdbs. CEPH-950
        • so far tested the reshard on a few large cvmfs buckets: reads are not affected, but writes block for the ~10 minutes of the reshard.

      Example:

      2020-08-31 10:41:10.868622 7fc4d28b5dc0  1 execute INFO: reshard of bucket "cvmfs-sft-test0" from "cvmfs-sf
      t-test0:61c59385-085d-4caa-9070-63a3868dccb6.205147250.6" to "cvmfs-sft-test0:61c59385-085d-4caa-9070-63a3868dccb6.271824312.1" completed successfully

      real    10m13.655s
      user    2m59.270s
      sys     0m37.773s

       # rados ls -p default.rgw.buckets.index | grep 61c59385-085d-4caa-9070-63a3868dccb6.271824312.1
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.271824312.1.213
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.271824312.1.192
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.271824312.1.151
       # rados ls -p default.rgw.buckets.index | grep 61c59385-085d-4caa-9070-63a3868dccb6.205147250.6
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.205147250.6.4
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.205147250.6.3
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.205147250.6.26
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.205147250.6.25
      .dir.61c59385-085d-4caa-9070-63a3868dccb6.205147250.6.30

      In some cases like ^^ we need to remove the old bucket index objects still. procedure TBD.

      [11:04][root@cephgabe0 (production:ceph/gabe/mon*2:leader) ~]# radosgw-admin reshard stale-instances list
      []
       

      In any case, we still need to reshard the following buckets: 

                      "bucket": "cvmfs-na62",
                      "bucket": "gitlabartifacts",
                      "bucket": "gitlabregistry",
                      "bucket": "cvmfs-lhcb",
                      "bucket": "cvmfs-cms-ci",
                      "bucket": "cvmfs-atlas",
                      "bucket": "cvmfs-sft",
                      "bucket": "cvmfs-lhcbdev",
       

      Julien

      • S3 account cleaning campaign
        • Lots of personal account previously flagged as "illegitimate" have been disabled
        • Most of which have 0 byte used
        • They will be deleted after a couple of weeks without complaints
          • To see with Jose if we email those who have some data left
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN) , Theofilos Mouratidis (CERN)

      Enrico:

      • "Meyrin CephFS Vault SSD A" enabled for "IT Ceph Storage Service" (1TB, 10 shares)
      • OTG0058715: Filer for gitlab-dev-storage down
      • INC2534546: Predicted low space for MICscratch (mic-nfs12.cern.ch)
        • Using 3.71TB, free 1.01TB. Volume is 5 TB
      • CEPH-946: Ganesha not releasing caps on jim cluster
      • CEPH-573: Benchmarking CephFS: Missing Levinson, EOS, vanilla NFS
        • Can I squat itnfs08c (old filer for openshift registry)?

      Dan

      • CEPH-948: documented how to monitor cephfs_toofull. Implemented the example to monitor /mnt/projectspace on cephadm.cern.ch
      • CEPH-947: changed the mds cache_full warning threshold on ceph/kelly.
    • 15:05 15:10
      AOB 5m