Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

5
Show room on map
Description

Zoom: Ceph Zoom

    • 2:00 PM 2:20 PM
      Ceph: Operations Reports 20m
      • Teo (cta, erin, kelly, levinson) 5m
        Speaker: Theofilos Mouratidis (CERN)
        • Plan to migrate from cephcta to cephkelly:
          • once done, dismantle the cephcta cluster
        • Split metrictank and scylladb
          • Openstack side: done
          • Puppet side: in progress
      • Enrico (barn, beesly, gabe, meredith, nethub, vault) 5m
        Speaker: Enrico Bocchi (CERN)

        Beesly:

        • New capacity installed and configured -- 12 servers, 48x12TB each
        • Waiting for new software release to solve PG deletion issue before rebalancing // decommissioning old HW
        • Leading MGR not service prometheus data. Restarted.

        Barn==Beesly':

        • Configured and ready -- 4 servers, 48x12 each, 2.1 PB raw, 3 replicas
        • To be benchmarked and enrolled in OpenStack

        Gabe:

        • One of the Traefik frontends unhappy on Sunday morning
        • One power user on CVMFS complained + many emails due to our scripts failing
        • Thanks Dan!

        Nethub:

        • Network intervention last Monday (router replacement + redundancy ToR switch to router) went fine 
      • Dan (dwight, flax, kopano, jim) 5m
        Speaker: Dan van der Ster (CERN)
        • dwight: added new host from ceph/spare (CEPH-1059) to replace failed old hardware.
          • All the new boxes are very large -- too much resource to allocate to dwight which should be for testing only.
          • Propose to move 3/4 of the RA* boxes in flax to dwight, then add the 4 ceph/spare cephdata20b boxes to flax.
        • "projectx" EOS/CephFS test new pools (CEPH-1057):
          • cephdata_ec82 -> 8+2 EC with 2 parts on each host.
          • cephdata_ec162 -> 16+2 EC with 3 parts on each host
          • the cephfs is configured to use these pools this morning and tests are ongoing.
        • Beams ML users have completed tests of S3 -- they now want to go production but require some TN component (either S3 entirely in TN or just gateways or neither). Mail thread ongoing with Stefan L, and a Zoom call will be organized.
        • All OpenShift hosts have been rebooted so the CodiMD 0-byte file issue should be solved definitively.
    • 2:20 PM 2:30 PM
      Ceph: Operations Tools (ceph-scripts, puppet, monitoring, etc...) 10m

      DAN:

      • filer-carbon has had a few backlog glitches -- could be a network issue but unclear until now.
      • CEPH-1052: the manila shares grafana probe was optimized
      • CEPH-1053: decreased the down out interval to 15 minutes (from 60) on all clusters. Mons need a restart for this to take effect.
      • CEPH-1038: yum repos now more clear per cluster: most are using koji only (currently 14.2.11-2) with exceptions for new hw (14.2.16).

      Enrico:

    • 2:30 PM 2:40 PM
      Ceph: R&D Projects Reports 10m
    • 2:40 PM 2:50 PM
      Ceph: Upstream News 10m
      • NFS Ganesha 3.5 has been released with several FSAL_CEPH fixes. Worth testing again before the HPC/SWAN call?
        • Has been mirrored to cephmirror.cern.ch this morning.
      • Still waiting new octopus/nautilus releases. Dan is testing latest octopus again to see when we should plan N->O upgrades to begin.
    • 2:50 PM 3:05 PM
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)
      • Recreating front0N machines (2 out of 4)
        • Got s2.2xl flavor on LCG -- 640GB disk instead of 160GB
        • Cache is @480 GB after 3days
        • Efficiency is very high -- Grafana
      • Reviewed and updated DNS alias policies for stratum-zero-lbp and stratum-one-lbp
        • Follow up of INC2678250
        • Now using 'cmsfrontier' as metric and polling interval of 300s (was 'minimum' and 900)
      • Test repo for na61 (na61test.cern.ch) for integration with GitLab runners
      • Disk usage on backup machine is 78% (and 70% on the Stratum 1)
      • CVMFS workshop today+tomorrow afternoon -- https://indico.cern.ch/event/885212/
    • 3:05 PM 3:10 PM
      AOB 5m

      Enrico:

      • On holiday Feb 18, 19. Is this a problem?