28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Name: 28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)
Start: 2026-05-25T08:00:00+07:00
End: 2026-05-29T14:00:00+07:00
Location: Chulalongkorn University

25–29 May 2026

Chulalongkorn University

Asia/Bangkok timezone

Transformer-Based Lossless Compression for CMS High-Level Trigger RAW Data

Not scheduled

Chulalongkorn University

Poster Presentation Track 2 - Online and real-time computing Poster

CMS Collaboration

The High-Level Trigger (HLT) of the Compact Muon Solenoid (CMS) selects event data in real time, reducing the data rate from hundreds of kHz to few kHz for offline storage. With the upcoming Phase-2 upgrade of the CMS experiment, data volumes are expected to increase substantially, making efficient, lossless compression essential for sustainable storage and processing.

Recent work has shown that autoregressive transformer models can achieve higher compression ratios than traditional algorithms. The LMCompress framework achieves 2-3x compression ratios for images, audio, and text by training models to predict data patterns and using arithmetic coding for lossless compression.

We investigate whether transformer-based compression can be applied to CMS HLT RAW data. We propose using Byte-level Generative Pre-trained Transformers (bGPT) trained on CMS Run 3 RAW data to learn the distribution of probabilities of the detector readouts. We explore detector-specific fine-tuning by partitioning training data by Front-End Driver (FED), that could lead to models more specialized on the characteristics of individual subdetector systems. The trained models would predict probability distributions for each byte in the RAW data stream, enabling arithmetic coding to achieve stronger lossless compression.

We benchmark this approach against existing compression algorithms (LZMA and ZSTD) currently used in the HLT framework, evaluating compression ratios and processing time. The performance is tested on Run 3 data and Phase-2 simulation datasets to test the efficacy of the method for future data-taking, where significant compression ratio improvements could substantially reduce storage requirements.

CMS Collaboration

There are no materials yet.

28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Transformer-Based Lossless Compression for CMS High-Level Trigger RAW Data

Chulalongkorn University

Speaker

Description

Author

Presentation materials