Conference on Computing in High Energy and Nuclear Physics

Name: Conference on Computing in High Energy and Nuclear Physics
Start: 2024-10-19T08:00:00+02:00
End: 2024-10-25T18:30:00+02:00
Location: No location set

19–25 Oct 2024

Europe/Zurich timezone

Contact Program Chairs

chep2024-pc@cern.ch

On Demand Column Joining for End User Analysis

21 Oct 2024, 14:06

18m

Large Hall B

Talk Track 5 - Simulation and analysis tools Parallel (Track 5)

Nick Manganelli (University of Colorado Boulder (US))

The high luminosity LHC (HL-LHC) era will deliver unprecedented luminosity and new detector capabilities for LHC experiments, leading to significant computing challenges with storing, processing, and analyzing the data. The development of small, analysis-ready storage formats like CMS NanoAOD (4kB/event), suitable for up to half of physics searches and measurements, helps achieve necessary reductions in data processing and storage. However, a large fraction of analyses frequently require very computationally expensive machine learning output or data only stored in larger and less accessible formats, such as CMS MiniAOD (45kB/eevent) or AOD (450kB/event). This necessitates the non-volatile storage of derived data in custom formats. In this work, we present research on the development of workflows and integration of tools with ServiceX to efficiently fetch, cache, and join together data for use with columnar analysis tools.
We leverage scaleable, distributed SQL query engines like Trino to join disparate columns sourced from multiple files and without a restriction on relative row ordering. By replacing many customized datasets, containing largely overlapping contents, with smaller and unique sets of information that can be joined on demand with common central data, duplication can be reduced. Caching of these results keeps the cost of subsequent retrieval low, fitting well with modern physics analysis paradigms.

Benjamin Galewsky (Univ. Illinois at Urbana Champaign (US)) Dr Burt Holzman (Fermi National Accelerator Lab. (US)) Nick Manganelli (University of Colorado Boulder (US))

Gordon Watts (University of Washington (US)) Keith Ulmer (University of Colorado, Boulder (US)) Lindsey Gray (Fermi National Accelerator Lab. (US))

NickManganelli_CHEP2024_ColumnJoining.pdf

Conference on Computing in High Energy and Nuclear Physics

Contact Program Chairs

On Demand Column Joining for End User Analysis

Large Hall B

Speaker

Description

Authors

Co-authors

Presentation materials

Choose timezone

Conference on Computing in High Energy and Nuclear Physics

Contact Program Chairs

Speaker

Description

Authors

Co-authors

Presentation materials