Ceph/CVMFS/Filer Service Meeting

600/R-001 (CERN)



Show room on map

Zoom: Ceph Zoom

    • 2:00 PM 2:20 PM
      Ceph: Operations Reports 20m
      • Teo (cta, erin, kelly, levinson) 5m
        Speaker: Theofilos Mouratidis (CERN)
        • cta migration to kelly
          • organising meeting this thursday 14:30
        • tickets regarding broken EPEL repo
          • QA C8 machines only
        • tcpdump + nc to forward carbon queries
          • may need some help with that
      • Enrico (barn, beesly, gabe, meredith, nethub, vault) 5m
        Speaker: Enrico Bocchi (CERN)


        • Fixed monitoring and documented
        • MONIT team asked for it for new Kakfa deployment (20TB) -- Need to enroll in OpenStack


        • Stats mismatch due to new OSDs up but 0 PGs (CEPH-1010)
        • Still waiting for new software release fixing PG deletion slowness



        • Stats mismatch, unclear why:
              POOL                           ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
              default.rgw.control             2         0 B           8         0 B         0       2.0 TiB 
              default.rgw.log                 4      59 MiB         249      59 MiB         0       2.0 TiB 
              default.rgw.buckets.index       7     175 GiB      43.41k     175 GiB      2.74       2.0 TiB 
        • Planning for new capacity in B773
          • We will continue to use 10 racks
          • RA[06-09], 09 new -- RB [01-07], 1 to be freed


        • Fixed monitoring and documented

        Vault: NTR

      • Dan (dwight, flax, kopano, jim) 5m
        Speaker: Dan van der Ster (CERN)
        • dwight: while replacing the hw there was a crush and peering issue (sometimes PGs are not degraded even though only 2 osds are acting). The crush issue is fixable with a better config (proposing a PR) but the peering issue looks like a bug. Tracking it here: https://tracker.ceph.com/issues/49104
        • flax: some HPC nodes did not correctly reconnect after their network outage. There is a similar old thread on the ML; I replied to it with our logs.

        Kafka would like 20TB on each of:

        • hyperc (kelly)
        • cpio2 (barn)
        • io2 (meredith)

        Are we ok with that?

        https://its.cern.ch/jira/browse/FILER-140: I will start removing one volume this week.

    • 2:20 PM 2:30 PM
      Ceph: Operations Tools (ceph-scripts, puppet, monitoring, etc...) 10m
      • Improved monitoring of S3
      • Ceph-1069/1070: hg_ceph issues when only partially adding new clusters 
    • 2:30 PM 2:40 PM
      Ceph: R&D Projects Reports 10m
    • 2:40 PM 2:50 PM
      Ceph: Upstream News 10m

      I made a quick tool to see the size of the releases:

      From v14.2.1 back to v14.2.0: 214 commits
      From v14.2.2 back to v14.2.1: 424 commits
      From v14.2.3 back to v14.2.2: 315 commits
      From v14.2.4 back to v14.2.3: 7 commits
      From v14.2.5 back to v14.2.4: 671 commits
      From v14.2.6 back to v14.2.5: 3 commits
      From v14.2.7 back to v14.2.6: 3 commits
      From v14.2.8 back to v14.2.7: 524 commits
      From v14.2.9 back to v14.2.8: 7 commits
      From v14.2.10 back to v14.2.9: 611 commits
      From v14.2.11 back to v14.2.10: 186 commits
      From v14.2.12 back to v14.2.11: 225 commits
      From v14.2.13 back to v14.2.12: 46 commits
      From v14.2.14 back to v14.2.13: 35 commits
      From v14.2.15 back to v14.2.14: 8 commits
      From v14.2.16 back to v14.2.15: 6 commits
      From nautilus back to v14.2.16: 163 commits


      From v15.2.1 back to v15.2.0: 103 commits
      From v15.2.2 back to v15.2.1: 252 commits
      From v15.2.3 back to v15.2.2: 2 commits
      From v15.2.4 back to v15.2.3: 447 commits
      From v15.2.5 back to v15.2.4: 653 commits
      From v15.2.6 back to v15.2.5: 4 commits
      From v15.2.7 back to v15.2.6: 3 commits
      From v15.2.8 back to v15.2.7: 505 commits
      From octopus back to v15.2.8: 217 commits


    • 2:50 PM 3:05 PM
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)
      • All front-lcg0{1..4} machines recreated and doing very well
        • Request/Traffic hit ration +10% wrt old caches
        • Disk usage at 78% (~490 GB / 640 GB total, max cache size 550 GB)
        • Dashboard
      • Bug in frontier-squid counters observed in ca-proxy
        • `squidclient -h localhost cache_object://localhost/counters`
        • client_http.requests may overflow
        • Dashboard
    • 3:05 PM 3:10 PM
      AOB 5m