Seeking Cost-Optimal Infrastructure Size for Distributed File Systems:

A Ceph Case Study

2025-03-19

Motivations

What machines do I need to buy to achieve a specific performance of a distributed file system?

  • How many CPU cores?
  • How many GBs of RAM?
  • How fast does my HDD need to be?
  • How fast does my NIC need to be?

Idea

Explore all possible configurations and measure the performance!

Without having to buy all the possible configurations.

Methodology

Emulate smaller hardware configurations on existing machines without virtualization or additional software layers:

  • Cgroups to throttle device access, such as hard drive bandwidth and IOPS:
$ echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
$ echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
  • Linux hot-plug interface to manage CPU cores and RAM:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo "online" > /sys/devices/system/memory/memory1/state

Storage nodes

The storage nodes consist of 8 DELL PowerEdge R760 servers, configured with:

  • 2 x Intel Xeon Gold 6426Y 16 Cores
  • 192 GB of RAM (12 X 16 GB) @ 4800 MT/s
  • 12 x 22TB SAS HDD WDC WUH722222AL5200 1
  • 1 x Nvidia ConnectX-7 100 Gbit Infiniband

MDS and client nodes

10 DELL PowerEdge R6625 HPC nodes were used, configured with:

  • 2 x AMD EPYC 9374F 3.85GHz 32C (Genoa)
  • 512 GB Ram DIMM (16 X 32GB) @ 4800 MT/s
  • 1 x Nvidia ConnectX-7 100Gb Infiniband

Two dedicated as MDS and up to 8 as client.

CEPH configuration

  • Metadata servers (MDS) run on dedicated compute nodes to prevent performance issues on metadata.
  • Monitoring services (Prometheus, Grafana) and CEPH daemons (Manager, Monitor) run on separate nodes to avoid noise.
  • NVMe storage hosts the metadata pool.
  • HDD storage hosts the file system data pool, tested with replicas count: 2 and 3.
  • Required failure domain host and disabled scrub procedure.

Benchmark

Each compute node runs 64 FIO clients, continuously processing 1 TB of data per node.

Workload Types

  • Sequential Read/Write: Block size 8MB
  • Random I/O (IOPS Test): Block size 4KB
[global]
ioengine=libaio
direct=1
buffered=0
invalidate=1
runtime=150
size=16G
numjobs=64

Sequential IO

Random IO

Sequential IO

Random IO

Conclusion

  • Sequential I/O workloads are not significantly influenced by core count or RAM.
  • Random writes benefit from increased RAM, while random reads also gain from a higher core count.
  • Sequential writes and random reads improve with faster storage devices, but only up to a certain threshold.

Limitations

  • Use the entire cgroup interface, including device IOPS and NIC bandwidth.
  • Evaluate performance when using erasure coding.
  • Assess the impact on recovery and rebalance procedures..

Contacts

niccolo.tosato@phd.units.it