Speaker
Description
The High-Level Trigger (HLT) of the Compact Muon Solenoid (CMS) selects event data in real time, reducing the data rate from hundreds of kHz to few kHz for offline storage. With the upcoming Phase-2 upgrade of the CMS experiment, data volumes are expected to increase substantially, making efficient, lossless compression essential for sustainable storage and processing.
Recent work has shown that autoregressive transformer models can achieve higher compression ratios than traditional algorithms. The LMCompress framework achieves 2-3x compression ratios for images, audio, and text by training models to predict data patterns and using arithmetic coding for lossless compression.
We investigate whether transformer-based compression can be applied to CMS HLT RAW data. We propose using Byte-level Generative Pre-trained Transformers (bGPT) trained on CMS Run 3 RAW data to learn the distribution of probabilities of the detector readouts. We explore detector-specific fine-tuning by partitioning training data by Front-End Driver (FED), that could lead to models more specialized on the characteristics of individual subdetector systems. The trained models would predict probability distributions for each byte in the RAW data stream, enabling arithmetic coding to achieve stronger lossless compression.
We benchmark this approach against existing compression algorithms (LZMA and ZSTD) currently used in the HLT framework, evaluating compression ratios and processing time. The performance is tested on Run 3 data and Phase-2 simulation datasets to test the efficacy of the method for future data-taking, where significant compression ratio improvements could substantially reduce storage requirements.