Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

4
Show room on map
Description

Zoom: Ceph Zoom

    • 14:00 14:15
      CVMFS 15m
      Speakers: Enrico Bocchi (CERN) , Fabrizio Furano (CERN)
      • Migrations completed !!!
        • lhcbdev's feedback: very smooth migration and we publish by 09:30 what previously required the whole day
      • Stratum1 and S3 (CVMFSOPS-283):
        • Most of the repositories are still replicated and served from local copy
        • Few are `ProxyPass`ed to S3
        • In some cases (e.g., nightly repos with frequent GC), replication might be impossible
      • Problem loading root catalog on release manager because S3 slow
      • Problem with unpacked.cern.ch
        • `cvmfs2: (unpacked.cern.ch) failed to fetch manifest (5 - outdated manifest), trying another stratum 1`
        • It looks like front* caches are returning a very old .cvmfspublished right after a publication
        • Unclear why this happens, but it is one of the few repos which are `ProxyPass`ed to S3
      • Issues with S3 today -- see below in S3 section
    • 14:15 14:30
      Ceph: Operations 15m
      • Notable Incidents/Requests 5m
        • See S3 below.
        • Core routing issue impacted ceph on friday: INC2614993
          • Ceph OTG: https://cern.service-now.com/service-portal?id=outage&n=OTG0060562
          • Caused by CS OTG: https://cern.service-now.com/service-portal?id=outage&n=OTG0060564
      • Upgrades, Migrations, (De-)Commissioning 5m
        Speaker: Enrico Bocchi (CERN)
        • Flax: RJ* machines are fully drained. Now to recreate the first (new) machine with 4 ssds and then decide what to do with RA* machines.
        • Gabe: CQ* machines half drained. Once these are out of the picture we can start draining RJ*.
      • Hardware Repair Liaison 5m
        Speaker: Julien Collet (CERN)

        Julien

        • Put back a production all the awaiting drives in beesly
        • Also, there were some hiccups on drive replaced on beesly
          • Investigation in progress

        Enrico

        • cephnethub-data-c116fa59b2 crashed overnight
          • Missing memory module (see INC2622305)
          • Intervention on Thu 26th?
      • Puppet, Tools, Monitoring 5m
        Speaker: Theofilos Mouratidis (CERN)

        Metrictank:

        • Possible through monit-grafana
        • Port 6060 only supports functions implemented in metrictank
        • Port 8080 should be used instead
          • It is the Graphite-web port, with query proxy to 6060 for metrictank
          • Something wrong in the current config, can't display any data
        • Metrictank team has an Epic for getting rid of graphite-web, and implement every function
          • Ticket active: Last month
          • Now closed, will be opened for review in ~2 months

        Filer-Carbon:

      • Upstream News 5m
        Speaker: Dan van der Ster (CERN)
        • v14.2.13/v14.2.14 review:
          • Good news is that the upstream S3 data loss issue (https://tracker.ceph.com/issues/47866) has been understood now, and (a) it does not impact nautilus at all and (b) has a config workaround and fix in development.
          • Still not confident about the new "use-extra" bluestore allocator feature. There are a couple reports of 14.2.14 crashing osds in aio_write, which might be related (https://tracker.ceph.com/issues/48276)
          • CVE-2020-25660: replay attack allowing auth to ceph clusters. This requires packet sniffing the network.
        • Ceph SWG Meeting this Weds at 4pm.
    • 14:30 14:45
      Ceph: Ongoing Projects 15m
      • Kopano/Dovecot 5m
        Speaker: Dan van der Ster (CERN)
      • REVA/CephFS 5m
        Speaker: Theofilos Mouratidis (CERN)
    • 14:45 14:55
      S3 10m
      Speakers: Enrico Bocchi (CERN) , Julien Collet (CERN)
      • Network intervention (OTG0060179) caused troubles to S3
        • Opened S3.cern.ch OTG: OTG0060593
        • Machines put out of the LB last week were not re-added
          • Some bare machines became part of the alias
            • 16 machines in consul - 7 today - 4 last week = 5 IPs
            • We always return 10 IPs, hence 5 were bare boxes
        • Bare machines showed misconfiguration on
          • TLS, to be investigated (reported by Adrian and Ricardo)
          • IPv6 (reported by LHCB - CVMFS)
          • No logs reporting via filebeat
          • See CEPH-1008
        • Services complaining: CVMFS, GitLab Registry, Indico
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN) , Enrico Bocchi (CERN) , Theofilos Mouratidis (CERN)
    • 15:05 15:10
      AOB 5m