PyHEP 2025 - "Python in HEP" Users Workshop (hybrid), CERN

Europe/Zurich
Ianna Osborne (Princeton University), Manfred Peter Fackeldey (Princeton University (US)), Marcel Rieger (Hamburg University (DE)), Matthew Feickert (University of Wisconsin Madison (US)), Nikolai Krug (Ludwig Maximilians Universitat (DE))
Description

The PyHEP workshops are a series of workshops initiated and supported by the HEP Software Foundation (HSF) with the aim to provide an environment to discuss and promote the usage of Python in the HEP community at large. Further information is given on the PyHEP Working Group website.

PyHEP 2025 will be a hybrid workshop, held in this format for the first time. While the event has traditionally taken place online, this edition will welcome participants both virtually and in person at CERN. The workshop will serve as a forum for participants and the wider community to discuss developments in Python packages and tools, share experiences, and help shape the future of community activities. Ample time will be dedicated to open discussions.

The agenda is composed of plenary sessions:

Topical sessions with 3 types of presentations (see the call for abstracts).
 
Registration is closed. There are no workshop fees.
 
You are encouraged to register to the PyHEP WG Gitter channel and/or to the HSF forum to receive further information concerning the organisation of the workshop. Workshop updates and information will also be shared on the workshop Twitter in addition to email. Follow the workshop @PyHEPConf.
 

Organising Committee

Peter Fackeldey - Princeton University
Ianna Osborne - Princeton University
Nikolai Krug - Ludwig Maximilian University of Munich
Marcel Rieger - Hamburg University
Matthew Feickert - University of Wisconsin-Madison                     

 

Sponsors

The event is kindly sponsored by

 
   
 
Registration
Registration for the PyHEP 2025 hybrid Workshop
    • Plenary Session Monday (1) 40/S2-B01 - Salle Bohr (CERN)

      40/S2-B01 - Salle Bohr

      CERN

      100
      Show room on map
      Convener: Ianna Osborne (Princeton University)
      • 1
        Welcome and workshop overview
        Speaker: Ianna Osborne (Princeton University)
      • 2
        ROOT's Newest Pythonizations: UHI, RDataFrame and More

        The ROOT software package features automatic and dynamic Python bindings that provide access to its powerful and performant C++ core. With the growing adoption of Python in the HEP community, ROOT continues to evolve to offer a more intuitive and Pythonic user experience.

        Recent developments make key components of the framework more accessible and interoperable from Python. This includes full support of the Unified Histogram Interface (UHI) for ROOT histograms, enhanced functionality in RDataFrame -ROOT's high level interface for data analysis- enabling the execution of user-defined Python functions leveraging Numba-based JIT compilation, as well as work on streamlining the conda and pip distributions.

        This contribution demonstrates these new capabilities and discusses ongoing efforts toward a more Pythonic framework for the benefit of the HEP community.

        Speaker: Silia Taider (CERN)
      • 3
        RNTuple and Uproot

        RNTuple is a new columnar data storage format with a variety of improvements over TTree. The first stable version of the specification became available earlier this year, so the transition to RNTuples has now begun. The Uproot Python library aims to provide a much better support for reading and writing RNTuples than it did for TTrees, thanks to its modern and simple design. Uproot already offers full support for reading any RNTuple into an Awkward array, and writing any Awkward array into an RNTuple.

        In this talk, I will briefly introduce the RNTuple format and its benefits, demonstrate how to use the Uproot library to read and write RNTuple data, and discuss current capabilities, limitations, and future work that will be done to support as much of the RNTuple specification as possible.

        Speaker: Andres Rios-Tascon (Princeton University)
      • 4
        uproot-custom: customize branch data reading in Uproot

        uproot-custom is an extension of Uproot that allows users to define custom behaviors when reading branch data from ROOT files. This capability is particularly useful when handling classes with overloaded Streamer methods or when specific data transformations are required during the reading process. By being implemented in C++ and Python, uproot-custom ensures both high performance and seamless integration with Uproot components.

        Speaker: Mingrun Li (IHEP, CAS)
      • 5
        Simulations, Post-Processing and Visualisations of Detector Cooling Systems

        CO2-based two-phase pumped loop systems are now the de-facto solution for Detector cooling at CERN. The scope of these systems grows ever larger, and with it, so does the complexity of the underlying technology

        For the past decade-and-a-half, Matlab has been our one-stop shop for simulations, post-processing, data analyses and data visualisations. Recently, we have begun a piecemeal migration of some of this rich toolbox to Python. Python allows easy extraction of data from NXCals, including in Swan, and has a rich plotting library.

        In this talk, we present the different tools we use to simulate component behaviour, extract and post-process data, and conduct visualisations for plotting and understanding system performance.

        Speaker: Viren Bhanot (CERN)
      • 6
        ISpy NanoAOD: An event display for the NanoAOD format of the CMS Experiment

        The CMS Experiment introduced a new lightweight format for physics analysis, NanoAOD, during Run 2. Stored as ROOT TTrees, NanoAOD can be read directly with ROOT or with Python libraries such as uproot. Current CMS event displays rely on the larger MiniAOD data tier, which requires CMS-specific software and resources and includes information not available in NanoAOD.
        ISpy NanoAOD is a prototype Python package that provides interactive 3D visualization of NanoAOD event content within Jupyter notebooks. Built on uproot, awkward, and pythreejs (a Python–three.js bridge for Jupyter widgets), it offers lightweight, synoptic views of events directly from NanoAOD.

        Speaker: Thomas McCauley (University of Notre Dame (US))
      • 7
        PyLHE in 2025: New features and improvements

        The PyLHE library (Python LHE interface) has seen major improvements since 2024. Recent releases introduced LHE file writing (v0.9.0) and extended event weight support for POWHEG (v0.8.0). Event weights, when available, are now included in the output Awkward Arrays, and systematic tests are performed using LHE files from widely used general-purpose Monte Carlo event generators. In addition to bug fixes, a new Sphinx-based documentation has been created. These enhancements make PyLHE more powerful, reliable, and user-friendly for high-energy physics workflows.

        https://github.com/scikit-hep/pylhe

        Speaker: Alexander Puck Neuwirth (UNIMIB & INFN)
      • 8
        Coffea schemas modifications

        This talk explores the results of my recent project as an IRIS-HEP fellow. I was working on improving the coffea schemas by simplifying how they work internally. It eventually transitioned into making a new package that would include all the simplified schemas, separated from coffea. Eventually coffea will use them instead of its old schemas. This new package was given the name 'zipper' and you can find it here (also you can check out the documentation for it).

        Speaker: Maksym Naumchyk
    • 16:00
      Break
    • Plenary Session Monday (2) 40/S2-B01 - Salle Bohr (CERN)

      40/S2-B01 - Salle Bohr

      CERN

      100
      Show room on map
      Convener: Matthew Feickert (University of Wisconsin Madison (US))
      • 9
        Reproducible reuse by default: Use of Pixi for software in (Py)HEP

        While advancements in software development practices across particle physics and adoption of Linux container technology have made substantial impact in the ease of replicability and reuse of analysis software stacks, the underlying software environments are still primarily bespoke builds that lack a full manifest to ensure reproducibility across time. Pixi is a new technology supporting the conda and Python packaging communities that allows for the declarative specification of dependencies across multi-platforms and automatic creation of fully specified and portable scientific computing environments. This applies to the Python ecosystem, hardware accelerated software, and the broader HEP and scientific open source ecosystems as well.

        This talk will be structured as a practical hands on tutorial that will explore relevant use cases for the PyHEP community as well as provide participants with their own example repository for future reference.

        Speaker: Matthew Feickert (University of Wisconsin Madison (US))
      • 10
        Histogram Serialization

        This talk covers histogram serialization that has been added to the latest versions of boost-histogram, hist, and uhi. We'll see how you can serialize and deserialize histograms to multiple formats. We'll also look as related recent advancements, such as the new cross-library tests provided in uhi.

        We'll take a deeper look at the new serialization specification that was developed in UHI, we'll look at how other libraries can take advantage of this serialization. We will also discuss some future possibilities for formats beyond the initial three (HDF5, zip, and JSON).

        Speaker: Henry Fredrick Schreiner (Princeton University)
      • 11
        PyTrees for vectors

        PyTrees are a powerful mechanism for working with nested data structures, while allowing algorithms like finite-differences, minimization, and integration routines to run on flattened 1D arrays of the the same data. The Scikit-HEP vector package recently added pytree support through optree. In this lightning talk, we'll introduce pytrees, show an example of usage, and discuss opportunities for further development.

        Speaker: Nick Smith (Fermi National Accelerator Lab. (US))
    • Plenary Session Tuesday (3) 222/R-001 (CERN)

      222/R-001

      CERN

      200
      Show room on map
      Convener: Iason Krommydas (Rice University (US))
      • 12
        Up your ML game: an overview of Python ML tools relevant to HEP

        Machine learning is advancing at a breathtaking pace, and navigating the ever-growing ecosystem of Python tools can be time consuming. This talk offers a practical guide to the ML landscape most relevant to high-energy physics. We discuss:

        • Common ML frameworks including PyTorch, PyTorch Lightning, Keras, Jax, Scikit-learn - strengths and weaknesses and how to choose
        • ML workflow tools including Weights & Biases, Mlflow, Optuna, b-hive - leverage tools to improve productivity and code quality
        • Model training and deployment tools including hls4ml, Swan, HTCondor, ONNX - resources for training and scaling your models
        • Supporting packages and structures including uproot, hist, awkward array - bridge HEP and ML workflows
        • Industry tools: SageMaker, testing and linting, GitHub Actions - the differences between ML in HEP and industry, and what we can learn
        • Fun shortcuts: make LLMs do your work, steal models from Hugging Face and other fun shenanigans

        <brr>
        Through live demonstrations, we will highlight practical strategies for adopting these tools with minimal friction, helping you up your ML game whether you’re new to the field or already deep in the weeds.

        Speaker: Liv Helen Vage (Princeton University (US))
      • 13
        Zero-overhead ML training from Python with ROOT in an ATLAS Open Data analysis

        The ROOT software framework is widely used from Python in HEP for storage, processing, analysis and visualization of large datasets. With the large increase in usage of ML from the Python ecosystem for experiment workflows, especially lately in the last steps of the analysis pipeline, the matter of exposing ROOT data ergonomically to ML models becomes ever more pressing. In this contribution we discuss the experimental component of ROOT that exposes ROOT datasets in batches ready for the training phase. A new shuffling strategy for creating the batches to prevent biased training is discussed, taking as examples real-life use cases relative to ATLAS Open Data.
        An end-to-end ML physics analysis using ATLAS Open Data is carried out to show how training a model with common ML tools can be done directly from ROOT datasets to avoid intermediate data conversions, streamline workflows and used in the case where the training data does not fit in memory.

        Speaker: Martin Foll (University of Oslo (NO))
      • 14
        From ROOT to PyTorch: Seamless Data Pipelines with HDF5

        High Energy Physics analyses frequently rely on large-scale datasets stored in ROOT format, while modern machine learning workflows are increasingly built around PyTorch and its data pipeline abstractions. This disconnect between domain-specific storage and general-purpose ML frameworks creates a barrier to efficient end-to-end workflows.

        We introduce F9columnar (https://pypi.org/project/f9columnar/) a lightweight Python package that bridges ROOT, HDF5, and PyTorch. The package provides dedicated data loader classes for both ROOT and HDF5 file formats that integrate natively with PyTorch’s Dataset and DataLoader interfaces, enabling physicists to stream columnar data directly into training pipelines built with PyTorch or PyTorch Lightning.

        Beyond integrated PyTorch I/O, F9columnar offers optimized parallel writing and shuffling of events to HDF5 datasets, facilitating efficient data preparation for large-scale training. It also introduces a DAG-based pipeline framework that allows users to compose custom data flows and seamlessly integrate them into the PyTorch DataLoader, supporting flexible and modular data processing workflows.

        By building on the existing Python HEP ecosystem - notably Awkward Arrays and uproot - F9columnar creates a natural bridge to modern machine learning frameworks, lowering the barrier to applying ML techniques in physics and enabling more efficient and reproducible workflows.

        Speaker: Jan Gavranovic (Jozef Stefan Institute (SI))
      • 15
        Efficient Statistical Modeling for Particle Physics Using Computational Graphs in Python

        Statistical modeling is central to discovery in particle physics, yet the tools commonly used to define, share, and evaluate these models are often complex, fragmented, or tightly coupled to legacy systems. In parallel, the scientific Python community has developed a variety of statistical modeling tools that have been widely adopted for their performance and ease of use, but remain under-utilized in particle physics. We attempt to bridge this gap with a lightweight python framework that calculates likelihood ratios through the construction and evaluation of computational graphs. With modularity, auto-differentiability, and computational efficiency in mind, we designed the framework to integrate with modern scientific computing ecosystems while providing a clean, well-documented, and extendable API. This implementation makes published particle physics results more transparent, reproducible, and accessible for reanalysis. We present the initial framework, validate its results against established calculations, examine its performance relative to existing systems, and outline future development plans.

        This work was supported by the U.S. Department of Energy (DOE) Office of High Energy Physics under Grant No. DE-SC0010107.

        Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
    • 16:00
      Break
    • Plenary Session Tuesday (4) 222/R-001 (CERN)

      222/R-001

      CERN

      200
      Show room on map
      Convener: Iason Krommydas (Rice University (US))
      • 16
        Smart Gradient-Based Optimization of HEP Analyses

        Automatic differentiation, the technique behind modern deep learning, can be applied more broadly in High Energy Physics (HEP) to make entire analysis pipelines differentiable. This enables direct optimization of analysis choices such as selection thresholds, binning strategies, and systematic treatments by propagating gradients through the statistical analysis chain.

        This talk will introduce automatic differentiation for HEP, explaining how it works and how its usefulness extends beyond deep learning. We will highlight the potential and challenges of writing differentiable pipelines in the Scikit-HEP ecosystem, while utilising tools such as JAX and evermore. We will also outline ongoing developments in differentiable particle reconstruction and identification algorithms, placing them in the broader vision of an end-to-end differentiable pipeline.

        The second half will be a tutorial on building differentiable statistical analyses with GRAEP (Gradient-based End-to-End Physics Analysis). We will show how to implement selections, construct differentiable histograms, and perform likelihood inference within a gradient-enabled environment. Examples will demonstrate how gradients can be used to optimize analysis parameters and streamline exploration of strategies compared to traditional approaches.

        All code and examples will be openly available on GitHub for participants to try out after the session.

        Speakers: Lino Oscar Gerlach (Princeton University (US)), Mohamed Aly (Princeton University (US))
      • 17
        evermore: differentiable binned likelihoods in JAX

        evermore is a software package for statistical inference using likelihood
        functions of binned data. It fulfils three key concepts: performance,
        differentiability, and object-oriented statistical model building.
        evermore is build on JAX - a powerful autodifferentiation Python frame-
        work. By making every component in evermore a “PyTree”, each compo-
        nent can be jit-compiled (jax.jit), vectorized (jax.vmap) and differ-
        entiated (jax.grad). This enables additionally novel computational
        concepts, such as running thousands of fits simultaneously on a GPU
        or differentiating through measurements of physical observables.
        We present the key concepts of evermore, show its features, and discuss
        performance benchmarks with toy datasets.

        Speakers: Felix Philipp Zinn (Rheinisch Westfaelische Tech. Hoch. (DE)), Manfred Peter Fackeldey (Princeton University (US))
      • 18
        Efficient binned profile likelihood maximization with Rabbit

        The High-Luminosity LHC era will deliver unprecedented data volumes, enabling measurements on fine-grained multidimensional histograms containing millions of bins with thousands of events each. Achieving ultimate precision requires modeling thousands of systematic uncertainty sources, creating computational challenges for likelihood maximization and inference. Fast optimization is crucial for efficient analysis development.

        We present a novel tensorflow-based tool, Rabbit, that leverages optimized parallelization on CPUs and GPUs for this task. We utilize automatic differentiation to compute first and second-order derivatives, yielding robust and efficient results. We implement nonlinear Poisson profile likelihoods as well as Gaussian approximations including a fully linearized formalism that results analytic solutions. Our python API supports the unified histogram interface and flexible configurations with symmetrization options to establish Gaussian approximations.

        Our tool distinctly focuses on measuring physical observables rather than intrinsic parameters, disentangling likelihood parameterization from quantities of interest and creating a more intuitive, less error prone user experience. Comprehensive benchmarking demonstrates excellent scaling with increased threading and reveals significant efficiency gaps when compared to commonly used frameworks in the field. These performance differences highlight the need for continued development of optimized statistical tools for high-energy physics analyses.

        Speaker: David Walter (Massachusetts Inst. of Technology (US))
    • Plenary Session Wednesday (5) 13/2-005 (CERN)

      13/2-005

      CERN

      90
      Show room on map
      Convener: Manfred Peter Fackeldey (Princeton University (US))
      • 19
        FLARE: FCCee b2Luigi Automated Reconstruction And Event processing

        The FCCee b2Luigi Automated Reconstruction And Event processing (FLARE) package is an open source python based data workflow orchestration tool powered by b2luigi. FLARE automates the workflow of Monte Carlo (MC) generators inside the Key4HEP stack such as Whizard, MadGraph5_aMC@NLO, Pythia8 and Delphes. FLARE also automates the Future Circular Collider (FCC) Physics Analysis software workflow. These two workflows are naturally combined inside of FLARE allowing a user to have a fully automated pipeline from MC production to final FCCanalysis histograms. With its many customizations and easy to use API, FLARE can simplify running FCCee analyses especially those that require their own MC to be produced via the Key4HEP stack. FLARE also gives HEP researchers interested in the FCC project an easy way to begin FCCee analyses in an automated and controlled way. FLARE is available on PyPI as the hep-flare package.

        Speaker: Cameron Harris
      • 20
        PocketCoffea: Configuration Framework for CMS Analyses based on Coffea

        PocketCoffea is an analysis framework based on Coffea for CMS NanoAOD events. It relies on a BaseProcessor class which processes the NanoAOD files in a columnar fashion.
        PocketCoffea defines a Configurator class to handle parameter, analysis workflow configurations such as datasets definition, object and event selection, event weights, systematic uncertainties and output histogram characteristics in a user-friendly way. The user can customize the workflow through a configuration file and/or by redefining steps in a custom workflow derived from the BaseProcessor.
        We present how to setup a simple analysis using PocketCoffea and demonstrate important features, such as the CLI and task workflow structure and the latest utilities to create fit templates and datacards for the CMS combine tool.

        Speaker: Felix Philipp Zinn (Rheinisch Westfaelische Tech. Hoch. (DE))
      • 21
        Workflows via ParaO - Parametrizing Objects

        Luigi is a powerful workflow tool for data analyses. Yet, it has some limitations that become quite debilitating in larger and more complex workflows. The PyHEP.dev 2024 Talk waluigi - Beyond luigi outlined some basic principles and ideas that sought to address these shortcomings. Together with the feedback gathered from the community an implementation is now available:
        ParaO is the name of a new python package, and it's central building block, the parametrized object. Dependencies upon each other are expressed as parameters of corresponding type, thus relying on composition instead of inheritance. Beyond this, the means by which specific parameters can be set are vastly expanded, which enable effective use of large and deep graphs. These parameter mechanics also allow the transplantation and stitching together of different graphs at runtime.
        While the package is still quite young, the majority of the features is already implemented, thus allowing serious use already. This contribution introduced the core principles behind ParaO and contextualizes some of them with Luigi. A live demonstration shows a broad scope of features from how to get started up to some more advanced patterns.

        Speaker: Benjamin Fischer (RWTH Aachen University (DE))
      • 22
        Scattering Amplitude Reconstruction in Python

        Scattering amplitudes encode the chances of different outcomes when
        particles collide. Calculating them to the precision required by
        current and future colliders is extremely challenging: the
        intermediate steps explode in size and become unwieldy even for modern
        computers. Yet the final answers often turn out to be surprisingly
        simple and efficient to use, if only they can be uncovered.

        In this talk I will present a suite of Python libraries designed to
        achieve precisely this. These include pyadic, which provides $p$-adic
        numbers, finite fields, and interpolation algorithms; syngular, an
        object-oriented extension and interface to the algebraic-geometry
        software Singular; and lips (Lorentz-invariant phase space), which
        generates and manipulates phase-space points across number
        fields. Together, these packages provide the building blocks for
        antares, a framework under development for the automated
        reconstruction of amplitudes from numerical evaluations.

        Speaker: Giuseppe De Laurentis (University of Edinburgh)
    • 16:00
      Break
    • Plenary Session Wednesday (6) 13/2-005 (CERN)

      13/2-005

      CERN

      90
      Show room on map
      Convener: Manfred Peter Fackeldey (Princeton University (US))
      • 23
        Pythonic GPU Parallelism for HEP with cuda-cccl

        High-energy physics analyses involve complex computations over large, irregular, nested data structures. Libraries such as Awkward Array have demonstrated that the massive parallelism of GPUs can be applied to accelerate these analyses. However, today this requires significant expertise from both library developers and end users, who must navigate the low-level details of CUDA kernel programming—often writing kernels in Numba or CUDA C++ to get the job done.

        This tutorial introduces the CUDA Core Compute Libraries for Python (cuda-cccl), designed to simplify parallel programming on NVIDIA GPUs. It exposes parallel primitives such as reduce, sort and histogram, and tools to combine them into more complex algorithms. In particular, algorithms can be segmented to achieve efficient event-level parallelism crucial to HEP analyses. These enable Python developers to compose high-performance GPU algorithms that result in efficient, fused kernels, without ever leaving Python or writing low-level CUDA.

        This tutorial is for both maintainers of Python libraries like Awkward and analysts interested in speeding up algorithms for HEP analysis with GPUs. Participants will understand how to use cuda-cccl to develop optimized GPU algorithms purely from Python, and how to make GPU acceleration more accessible and maintainable across the HEP ecosystem.

        Speaker: Ashwin Srinath
      • 24
        Accelerating scientific Python code with dispatching: Graphs and Arrays

        Modern high-energy physics workflows rely heavily on large-scale computation, where performance bottlenecks often emerge as data sizes grow. This talk explores various dispatching mechanisms incorporated in libraries like NetworkX (Graphs), NumPy (Arrays) and scikit-image, and how they enable code acceleration across heterogeneous backends such as GPUs, distributed arrays, CPU parallelism, or even different programming languages. We’ll walk through how Python entry-points, array protocols like __array_function__ and the Array API standards unify user-APIs while dynamically selecting the fastest backend implementation at runtime. A demo will show how existing pure Python code can achieve significant speedups by seamlessly dispatching across GPUs, multiple CPU cores, and HPC environments-- or even to an implementation written in a different language-- without major code changes. Attendees and maintainers will gain insights into adopting dispatching mechanisms to accelerate existing scientific and HEP workflows with minimal code changes. The talk might also be helpful to folks working with large graph or array data.

        Speaker: Aditi Juneja
      • 25
        Reviving Formulate

        The formulate Python package was released in 2018 aimed to be a translation tool between the C++ expressions used ROOT and the Python counterparts used in the Scikit-HEP ecosystem. It worked well for simple expressions, but had serious performance issues when expressions were lengthy and complex. Last year, there was an effort to rewrite the package from scratch to solve these performance issues. The first "stable" version 1.0.0 was released, but it had to be yanked due to various parsing issues and inconsistencies. In this talk, I will present about my work fixing this newer version and the heavy refactoring I have done to fix and simplify the code. I will briefly showcase the NEW v1.0 working properly, and mention plans to use this package in other parts of the Scikit-HEP ecosystem.

        Speaker: Andres Rios-Tascon (Princeton University)
    • Awkward Array Internals: A Hands-On Hacking Session 31/3-004 - IT Amphitheatre (CERN)

      31/3-004 - IT Amphitheatre

      CERN

      105
      Show room on map
      Convener: Ianna Osborne (Princeton University)
      • 26
        Awkward Array Internals: A Hands-On Hacking Session

        We’re planning a hands-on session to explore Awkward Array’s internals, contribute to development, or just learn how it works.

        Vote for what you’d like to focus on: GitHub poll link

        Options include array internals, performance hacks, GPU/Numba integration, extending Awkward, debugging, interoperability, or just learning the basics.

        Speaker: Ianna Osborne (Princeton University)
    • Plenary Session Thursday 222/R-001 (CERN)

      222/R-001

      CERN

      200
      Show room on map
      Convener: Manfred Peter Fackeldey (Princeton University (US))
      • 27
        Coffea Framework: Current Status and Recent Updates

        This tutorial will provide a comprehensive introduction to the current state of Coffea (Columnar Object Framework for Effective Analysis), focusing on its transition to virtual arrays as the primary backend for efficient HEP data processing. With the introduction of Awkward Array's Virtual Arrays feature, Coffea now offers lazy data loading capabilities that dramatically reduce memory consumption while maintaining the familiar, user-friendly analysis syntax that physicists expect.

        The tutorial will begin with an introduction to columnar analysis concepts, demonstrating how Coffea enables physicists to work with complex, nested particle physics data using familiar NumPy-like operations. Through interactive Jupyter examples, participants will learn to structure typical HEP analyses, from basic event selection to histogramming.

        A key focus will be the seamless migration path from Coffea 0.7 to the current virtual arrays implementation. Attendees will see how existing analysis code requires minimal modifications—often none at all—to benefit from lazy loading capabilities that dramatically reduce memory consumption while maintaining computational efficiency.

        The session will cover advanced optimization techniques including explicit branch preloading for network-efficient data access, workflow tracing to identify required branches for efficient bulk loading, and new checkpointing features for robust, resumable workflows. Practical examples will demonstrate how virtual arrays work transparently behind familiar analysis patterns, allowing physicists to focus on physics rather than data management details.

        Speaker: Iason Krommydas (Rice University (US))
      • 28
        (Super)powering ntuple analysis with coffea for ATLAS

        ATLAS analysis in Run 2 was chaotic. ATLAS Run 3 and beyond has started to consolidate to a few common frameworks that are maintained more centrally. The top two most popular analysis frameworks are currently TopCPToolkit and easyjet. Both are configurable with yaml, while the former is part of ATLAS's offline software (athena) and the latter is developed primarily for use by higgs/di-higgs physics groups and has started to seen adoption outside of those groups.

        In both cases, these frameworks analyze (D)AODs and can produce ntuple outputs. These ntuple outputs have similar patterned structures that can be represented by a coffea schema to power columnar analysis efforts within the ATLAS collaboration.

        This talk will advertise atlas-schema and discuss some of the key UI/UX benefits that this provides over using a more generic schema from coffea. Some of these features include easier interpretation of enum-like data (such as truth classification bits), additional utilities for common kinematic algorithms or selections, and an organized interface for accessing systematic variations within these ntuples. As this work is under active development, feature requests from users are welcome in order to aid their analysis workflows.

        Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
      • 29
        Wrangling Massive Task Graphs with Dynamic Hierarchical Composition

        Data analysis in High Energy Physics is constrained by the scalability of systems that rely on a single, static workflow graph. This representation is rigid, struggles with overhead when applied to workflows involving large data, and can be slow to construct (such as with Dask). To overcome this, we introduce Dynamic Data Reduction (DDR), built upon the common pattern in event processing. This pattern consists of applying an analysis function to event chunks followed by a commutative and associative reduction operation. Recognizing this property allows us to decouple decisions about data chunking and result accumulation from the global workflow definition.

        DDR implements this through a hierarchical and dynamic composition of tasks, separating cluster-level and node-level concerns. For coffea applications, this means we flip the generation of events: there is one event factory per chunk at the execution nodes, rather than one factory for the whole workflow, deferring resource decisions until execution time. The scheduler manages distribution to the cluster using an abstract workflow representation, while tasks for computation on the node are generated on demand and in parallel before execution. This approach defers parallelization settings, making execution adaptive to resources.

        We use Cortado, a skimming coffea application, for empirical validation. This workflow, involving 14 terabytes of data and 12 billion events, proved intractable for static graph methods, often failing after ∼20 hours of graph generation. DDR, however, reliably completed the entire analysis in only ∼5.5 hours.

        Speaker: Benjamin Tovar Lopez (University of Notre Dame)
      • 30
        Workshop close-out
        Speaker: Ianna Osborne (Princeton University)