Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Videoconference Rooms
Ceph_CVMFS_Filer_Service_Meeting
Name
Ceph_CVMFS_Filer_Service_Meeting
Description
Meeting
Extension
10896342
Owner
Dan van der Ster
Auto-join URL
Useful links
Phone numbers
    • 14:00 14:15
      CVMFS 15m
      Speaker: Enrico Bocchi (CERN)

      Enrico

       
    • 14:15 14:30
      Ceph: Operations 15m
      • Notable Incidents or Requests 5m
        • Additional large block storage request from IT-DB for oracle recovery servers (~500TB eventually, but 200TB for now). Waiting on new vault cluster to grant this.
        • Following this and the AFS request (+800TB) it was decided that:
          • New ceph servers in the vault will be created as a new cluster "cephvault" which will be added as new volume types in Cinder "vault-100, vault-500" for the two qos types. (original plan was for this vault hardware to replace beesly RA racks).
          • We expect another 4 quads in July (8.5PB) -- this was originally to replace S3/CASTOR/Cephfs machines in RJ and EC racks. Instead we will use the July delivery to replace the current beesly hardware (RA racks).
          • Bernd will aim to get us 8 more quads for Q4 delivery to replace S3, CASTOR, Cephfs.

        • nethub s3 cluster has been suffering with slow ping time for quite some time. Friday I ran some iperf3 tests and found that one of the racks is very very slow. 
          • CEPH-893 and INC2407400
          • I propose that we configure an iperf3 regular test between all ceph osds. E.g. run iperf3 server on port 8001, then in an hourly cron run an iperf test and send an email to ceph-alerts if the bandwidth is below 500Mbps.

        • This morning one ceph/gabe/osd host went down, not coming back after a reboot. Connected to console, it was OK, but there is no network. I pinged Vincent Ducret on mattermost:
          • There was an intervention fixing crc errors on port 12 of the switch -- our server is on port 11.
          • He went to the switch, saw there are no activity leds, so he replugged the cable, and our server came back up.
          • He asked the 2nd line CS ops to please exercise more caution when manipulating cables.
          • On the ceph side, the cluster was degraded for a few hours -- I set noout so that backfilling wouldn't start.
      • Repair Service Liaison 5m
        Speaker: Julien Collet (CERN)
      • Backend Cluster Maintenance 5m
        Speaker: Theofilos Mouratidis (CERN)

        Theo:

        • Added new capacity to ceph/beesly
        • Formatted its OSDS
        • Removed its OSDS
        • Removed new capacity from ceph/beesly
        • Created new "vault" cluster
        • Moved new machines to the vault cluster
        • Picked 3 machines to act as mons also on different racks 
          • SE04 6 7
        • OSD creation on progress
    • 14:30 14:45
      Ceph: Projects, News, Other 15m
      • Backup 5m
        Speaker: Roberto Valverde Cameselle (CERN)
      • HPC 5m
        Speaker: Dan van der Ster (CERN)
      • Kopano 5m
        Speaker: Dan van der Ster (CERN)
      • Upstream News 5m
        Speaker: Dan van der Ster (CERN)
    • 14:45 14:55
      S3 10m
      Speakers: Julien Collet (CERN), Roberto Valverde Cameselle (CERN)

      Giuliano:

      • cephgabe bare rgw:
        • hostgroup changes in ceph_dev (will be: ceph/gabe/radosgw/bare ceph/gabe/radosgw/hashi)
        • rgw up and running, but keystone issue

       

      • es-ceph
        • Request of a new es-ceph instance for nethub postpone for now
        • Logs pushed to es-ceph from s3 are quite big (9TB), needs to be shrinked
          • 30days of logs
        • Pablo shared pointers to try addressing the issue (in progress):
          • aggregation
          • filtering of useless indices

       

      • s3 accounting
        • publishing of data gather is working
        • data format needs a bit of refining (using some jq magic)

       

      Enrico

      • Prototype of traefik 2.2 with httpbin on test cluster
        • To finialze and decide how to deploy in production
      • Merge Requests and Docs for S3 upgrade
      • Tickets on draining and recreating OSDs (with Giuliano)
      • S3 SSL Cert expiring alarm on prometheus
       
       
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN), Theofilos Mouratidis (CERN)
      • CEPH-728: cephfs: auto-evict hung clients with a non-zero mds_cap_revoke_eviction_timeout
        • dwight and flax have this set to 900s now. It evicted one hung client over the weekend, so it seems to be working ok.
    • 15:05 15:10
      AOB 5m