Speaker
Description
Currently High Energy Physics (HEP) faces increasingly severe data storage challenges. Next-generation particle collider experiments are expected to generate unprecedented data volumes and acquisition rates, demanding continuous I/O capabilities with sub-milliseconds PB/s-level throughput. Traditional kernel-based file systems, burdened by context switching, interrupt handling, and heavy metadata overheads, struggle to fully unleash the performance potential of emerging NVMe SSD hardware, becoming a critical bottleneck in experimental data processing and analysis pipelines.
To address this, we present ODGDFS, a user-space object storage engine optimized for HEP data access patterns, originating from the JwanFS project at IHEP. Built upon the SPDK Blobstore, the system minimizes software stack overhead by completely bypassing the OS kernel and employing lock-less polling I/O with a lightweight metadata architecture. Its core innovations include:
- Flat Metadata Organization: Designed a custom superblock-backed metadata scheme that utilizes in-memory hash indexing to achieve O(1) complexity for object localization, effectively eliminating the overhead of multi-level directory lookups found in traditional file systems;
- Zero-Copy Tail Cache: Proposed a tail cache aggregation mechanism to optimize small-scale asynchronous write patterns common in HEP experiments, significantly reducing write amplification while boosting sequential write throughput;
- Stream Decoupling & Lazy Loading: Implemented the logical decoupling of data and index streams alongside a lazy-loading architecture, maintaining efficient memory and CPU utilization while supporting thousands of concurrent data volumes.
Preliminary benchmark based on rigorous stress testing has confirmed the system's stability and correctness under high-concurrency simulated workloads. it is foreseeable that when handling typical HEP workloads, ODGDFS will demonstrate significant improvements in I/O throughput and stable low-latency performance, providing a scalable and efficient storage solution for managing massive datasets in future large-scale experimental data centers.