Ceph/CVMFS/Filer Service Meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
    • 14:00 14:15
      CVMFS 15m
      Speaker: Enrico Bocchi (CERN)

      (Enrico):

      • CVMFS 2020 PoW: https://indico.cern.ch/event/871750/
        • New server-side code. Today: cvmfs_server + cvmfs_swissknife; Future: shared library with primitives and CLI built on top.
        • Push sw.hsf.org: S3 + Multiple RMs && Gateway service + Publish jobs from Jenkins
        • Conveyor: Publication to jobs to a central queue, release managers are ephemeral containers
      • Collectd spinning 100%:
        • collectd-blockdevice-drivers (20.1.3-1) in qa 
        • shadow cvmfs-backend machine no longer complains
      • Frontier Squid (4.9-4.1) in production for site caches
        • Next for Stratum1: front*.cern.ch machines
        • New s2.2xlarge flavor -- 640 GB disk
        • Cache not filled. ~80% after 5 days of operation
      • YubiKey signing today @3pm for {ams, cvmfs-config, grid}.cern.ch
      • Migration plan to CC7 (and S3?)
        • 40 repos on SLC6, 25 release managers
        • All but one (lhcbdev-test, no longer needed) on cinder volumes

       

      (Dan)

      • INC2277760: cvmfs-alice had failed publications on Sat. Giulio tried again a few times and it finally worked. I didn't find any reason for the failures.
    • 14:15 14:30
      Ceph: Operations 15m
      • Notable Incidents or Requests 5m

        ceph/erin outage: (15  Jan)

        Around 13:30 there was a network failure in the datacenter. The osds couldn't peer with their neighbours, and after an hour they started marking themselves as failed.
        This caused a lot of operations to be on hold for about an hour.
        The nodes where restarted again and the operations continued as normal within 30 minutes.

         

        An interesting scsi glitch, mdraid reassembly seen on AFS: https://its.cern.ch/jira/projects/AFS/issues/AFS-508 -- what would we do if such a glitch happened on a Ceph server?

      • Repair Service Liaison 5m
        Speaker: Julien Collet (CERN)

        Giuliano

        • Updated the rota document for the repair team
        • Added a pvscan --cache to clean after wild pvremoves
      • Backend Cluster Maintenance 5m
        Speaker: Theofilos Mouratidis (CERN)

        ceph-beesly (critical-power):

        Migrated:
        p05798818s40185
        p05798818s49204
        p05798818s63747


        Migrated-backfilling:
        p05798818s98313

    • 14:30 14:45
      Ceph: Projects, News, Other 15m
      • Backup 5m
        Speaker: Roberto Valverde Cameselle (CERN)
      • HPC 5m
        Speaker: Dan van der Ster (CERN)

        Tuesday: Pablo Llopis saw stuck requests on ceph/jim (/bescratch). I found slow requests on the MDS (hpc-be144). Failed over to the standby to unblock the requests. Unclear root cause, but jim is getting due for an upgrade.

      • Kopano 5m
        Speaker: Dan van der Ster (CERN)

        Kopano servers have dual 10Gig-E NICs -- Jose configured them in LACP mode -- link aggregation. In theory this should give 20Gig-E per machine, but my rados bench showed only 1000MB/s. Jose is checking with network people (Vincent).

        Just before the meeting, the cephkopano cluster lost (several?) nodes at once, so the cluster is now currently down.

      • Upstream News 5m
        Speaker: Dan van der Ster (CERN)
    • 14:45 14:55
      S3 10m
      Speakers: Julien Collet (CERN), Roberto Valverde Cameselle (CERN)

      Julien

      • CEPH-811: Identify rgw by nomad job
        • Quite useful in debugging (like last week)
        • Should be really easy to implement
      • S3 accounting:
        • Remove the useless storage of old usage snapshots
        • Now only storing usage from day-1 on s3 bucket so to trigger the usage alarm
          • (still pushes actual data in the accounting folder)

       

      Roberto

      • S3 backup broken since mid last week. Really difficult to debug, what we know:
        • Clear degradation of performance from wed around 13:30 [es-ceph], maybe related to router intervention. [OTG0054209]. There were also a lot of stuff going on that day on the network side. 
        • During restic scanning prior to the backup, some timeouts and errors are visible on fusex logs. This is causing that the process is taking way more  time that it needs, more visible in accounts with a lot of iles. I've tried a configuration change suggested by Andreas but it did not help. We are seeing this since long time ago, maybe is not the root cause of the current problem [EOS-3619]. This is also reproduced backing up to a local restic repository (no S3 related). I'll check with Andreas after this meeting.
        • While backing up big users, with big files, the backup process get stuck at some point (after the initial scanning) at random places. Nothing useful on fusex logs, mgm logs, diskservers logs... I haven't tested to reproduce it locally, need more space. 
        • Prune agents also affected and they are not related to fusex (it is not even configured in those nodes). The process just hangs at a random point :D
    • 14:55 15:05
      Filer/CephFS 10m
      Speakers: Dan van der Ster (CERN), Theofilos Mouratidis (CERN)

      Upgraded ceph/dwight to v14.2.6 (from v13.2.7). The mon/mgr/osd upgrades all went smoothly. The MDS upgrade was not too smooth when I enabled msgr v2 -- the active MDS reported that the leading mon "lost contact" at the moment I enabled v2. Details here: https://tracker.ceph.com/issues/43596

      I don't think this is a showstopper for flax upgrade. I'll be doing more multi-MDS testing on dwight now before scheduling the flax upgrade.

    • 15:05 15:10
      AOB 5m