Speaker
Description
The CBM experiment at GSI/FAIR will investigate QCD matter at high baryon densities with a free-streaming, self-triggered detector readout delivering time-stamped data on approximately 5000 input links. Designed for aggregate data rates exceeding 1 TB/s, the First-level Event Selector (FLES) system performs timeslice building, aggregating these streams into overlapping processing intervals for online event reconstruction.
Years of production experience with the original Flesnet software stack at the mCBM FAIR Phase-0 experiment revealed limitations in the monolithic run concept, particularly regarding resilience against detector malfunctions and external failures. These challenges motivated a complete rewrite of the timeslice building infrastructure.
The new system introduces a central manager architecture that enables dynamic load balancing and fault tolerance across the FLES HPC cluster. Key innovations include wall-time-driven operation with opportunistic timeouts, dynamic calculation of timeslice components with flexible overlap, dynamic buffer management for improved memory efficiency, and support for live scaling of build nodes during data acquisition. Communication between senders, builders, and the central manager utilizes UCX (Unified Communication X), providing efficient RDMA transport over InfiniBand while maintaining flexibility for alternative network technologies.
The system gracefully handles non-ideal conditions: failing senders are bypassed after short timeouts, failing builders are automatically excluded from scheduling, and the manager can be restarted during active runs. Initial deployment in development setups demonstrates the system's operational readiness for SIS100 commissioning.
This work is supported by BMFTR (05P24RF3).