14–17 Jul 2025
Seattle, Washington
US/Pacific timezone

Using Commodity Data Tools in LEGEND-1000

15 Jul 2025, 11:50
20m
Seattle, Washington

Seattle, Washington

University of Washington

Speaker

Isaac Kenneth Kunen

Description

The current phase of the LEGEND neutrinoless double-beta decay search, LEGEND-200, holds its primary experimental data in a customized HDF5 format, This requires the team to build and maintain a significant custom data access layer that lies outside the team’s core physics mission and expertise, and the performance and complexity of the system impacts both data production pipelines and analysis of the data.

Multi-petabyte data sets like those LEGEND will amass used to be outliers, but are now common in industry, and the database community has produced a wealth of tools for dealing with them. For the future phase of the project, LEGEND-1000, we’re exploring how we can improve performance and functionality, while reducing cost to the team by leveraging these tools.

In this discussion, we’ll give an overview of our early work to use vanilla Parquet in conjunction with HIVE Partitioning (and possibly Iceberg) for storage, off-the-shelf data access and coordination systems in Python like DuckDB and PySpark to process and query data, and standard OCI containers to simplify deployment across environments.

Author

Isaac Kenneth Kunen

Presentation materials