Speaker
Description
High Energy Physics analyses frequently rely on large-scale datasets stored in ROOT format, while modern machine learning workflows are increasingly built around PyTorch and its data pipeline abstractions. This disconnect between domain-specific storage and general-purpose ML frameworks creates a barrier to efficient end-to-end workflows.
We introduce F9columnar (https://pypi.org/project/f9columnar/) a lightweight Python package that bridges ROOT, HDF5, and PyTorch. The package provides dedicated data loader classes for both ROOT and HDF5 file formats that integrate natively with PyTorch’s Dataset and DataLoader interfaces, enabling physicists to stream columnar data directly into training pipelines built with PyTorch or PyTorch Lightning.
Beyond integrated PyTorch I/O, F9columnar offers optimized parallel writing and shuffling of events to HDF5 datasets, facilitating efficient data preparation for large-scale training. It also introduces a DAG-based pipeline framework that allows users to compose custom data flows and seamlessly integrate them into the PyTorch DataLoader, supporting flexible and modular data processing workflows.
By building on the existing Python HEP ecosystem - notably Awkward Arrays and uproot - F9columnar creates a natural bridge to modern machine learning frameworks, lowering the barrier to applying ML techniques in physics and enabling more efficient and reproducible workflows.