Speaker
Description
The data reduction stage is a major bottleneck in processing data from the Large Hadron Collider (LHC) at CERN, which generates hundreds of petabytes annually for fundamental particle physics research. Here, scientists must refine petabytes into only gigabytes of relevant information for analysis. This data filtering process is limited by slow network speeds when fetching data from globally dispersed storage facilities, which leads to thousands of wasted CPU hours waiting for data to arrive.
We demonstrate a near-data computing model that optimizes data access and enhances performance by filtering LHC data close to its storage before transmission over the slow network. This model is designed to be implemented with minimal change in the existing data layout and seamless integration with the underlying storage infrastructure, ensuring compatibility and ease of adoption for current systems.
We achieve this by deploying Data Processing Units (DPUs) within the storage cluster. Our model leverages DPU's high-bandwidth connections to perform fast data retrieval and filtering near storage, significantly improving overall data processing speeds and freeing up compute node CPUs for more important tasks. Additionally, it streamlines the workflow by removing coding complexities and making programming accessible for end users. We demonstrate that our model significantly outperforms current methods using real physics data and a realistic data reduction workflow.