PyHEP.dev 2025 - "Python in HEP" Developer's Workshop

US/Pacific
Seattle, Washington

Seattle, Washington

University of Washington
Ianna Osborne (Princeton University), Manfred Peter Fackeldey (Princeton University (US)), Marcel Rieger (Hamburg University (DE)), Matthew Feickert (University of Wisconsin Madison (US)), Nikolai Krug (Ludwig Maximilians Universitat (DE))
Description

PyHEP.dev is an in-person, informal workshop for developers of Python software in HEP to plan a coherent roadmap and make priorities for the upcoming year. It complements the PyHEP Users online workshop, which is intended for both developers and physicists.

Both PyHEP workshops are supported by the HEP Software Foundation (HSF). Further information is on the PyHEP Working Group website.

You are encouraged to register to the PyHEP WG Gitter channel and/or to the HSF forum to receive further information concerning the organisation of the workshop.

PyHEP.dev 2025 shared live notes

Program

The agenda will consist of morning kick-off talks and afternoon discussions, in which the discussion groups and topics are self-assigned. Pre-workshop organization is happening here, via GitHub Issues.
 
The following topics can be expected for discussions and hacking sessions:
  • Statistical tooling
  • Scaling HEP analyses with dask & plans for dask-awkward
  • Autodifferentation of HEP analyses (clarifying plans & scope)
  • Serialization: Histograms (UHI), RNTuple, HS3
  • Workflows for HEP analyses

Organising Committee

Peter Fackeldey - Princeton University
Ianna Osborne - Princeton University
Nikolai Krug - Ludwig Maximilian University of Munich
Marcel Rieger - Hamburg University
Matthew Feickert - University of Wisconsin-Madison

Local Organising Committee

Gordon Watts - University of Washington

 

This event is also kindly sponsored by the Python Software Foundation.

 

Zoom Meeting ID
61061661309
Host
Manfred Peter Fackeldey
Alternative host
Peter Elmer
Useful links
Join via phone
Zoom URL
    • 08:30
      Coffee
    • 1
      Welcome
      Speakers: Gordon Watts (University of Washington (US)), Manfred Peter Fackeldey (Princeton University (US))
    • Talks: Introductions
    • 10:00
      Coffee
    • Talks: Deep Dive
      • 14
        Awkward Array: The Swiss Army Knife of Irregular Data (and Still a Little Awkward)

        Awkward Array is a stable and widely used Python library for working with nested, variable-length, and irregular data — the kind of data that traditional NumPy arrays can’t easily handle. Originally developed for high-energy physics, it has grown into a reliable tool for many fields beyond HEP.

        Today, Awkward Array offers strong integration with libraries like NumPy, Numba, JAX, and GPU backends — to name a few. It’s fast, flexible, and lets scientists work with complex data structures in a clear, efficient way.

        But there’s still more we can do. This talk will give a short update on the current status of Awkward Array, recent improvements, and how it fits into the broader scientific Python ecosystem. We’ll also discuss ideas for the future: better JAX and GPU support, simpler APIs (or ones that align better with array standards), and stronger connections to other scientific libraries. Even though Awkward Array is already a solid, stable tool, we want to keep making it better — and we invite the community to help guide where it goes next.

        Speaker: Ianna Osborne (Princeton University)
      • 15
        Lazy Data Loading with "Virtual Arrays" in Awkward

        High-energy physics (HEP) analyses frequently manage massive datasets that surpass available computing resources, requiring specialized techniques for efficient data handling. Awkward Array, a widely adopted Python library in the HEP community, effectively manages complex, irregularly structured ("ragged") data by mapping flat arrays into nested structures that intuitively represent physical objects like particles and their associated properties. Typically, analyses utilize only specific subsets of these objects and properties, presenting an important opportunity to reduce memory usage through lazy data loading strategies.

        In this presentation, we will introduce and delve into Awkward Array's newly developed "Virtual Arrays" feature, explicitly designed for lazy loading of data buffers. Instead of immediately loading entire datasets into memory, Virtual Arrays defer data retrieval from disk until explicitly requested by computation. We will discuss in greater detail the underlying architecture, design considerations, and practical implementation of Virtual Arrays, highlighting their integration into analytical workflows.

        We will illustrate how developers and analysts can seamlessly incorporate lazy data loading into their existing frameworks using Coffea—the Columnar Object Framework For Effective Analysis. Coffea facilitates efficient event data processing through columnar operations and transparently scales computations from personal laptops to extensive distributed computing environments without modifications to analysis code. Real-world examples from high-energy physics, including selective data processing and efficient histogramming, will underscore the technical implications and significant performance improvements provided by Virtual Arrays, accelerating data-intensive analysis and enhancing computational efficiency in collider experiments.

        Speaker: Iason Krommydas (Rice University (US))
      • 16
        Towards rapid and efficient columnar-based analyses at scale

        As we pursue new physics at the LHC, the challenge of efficiently analyzing our rapidly mounting data volumes will continue to grow. This talk will describe the development and benchmarking of a realistic columnar-based end-user analysis workflow (for skimming Run 2 + Run 3 scale data with the Coffea framework) in order to characterize the current capabilities and understand bottlenecks as we scale towards HL-LHC data volumes. This talk will also discuss how the execution of columnar operations can be accelerated with GPUs, studying the performance with a set of benchmark queries and discussing paths towards running a full-scale columnar analysis on GPUs.

        Speaker: Kelci Ann Mohrman (University of Florida (US))
    • 11:15
      Coffee
    • Talks: Deep Dive
      • 17
        rootfilespec

        The rootfilespec package is designed to efficiently parse ROOT file binary data into python datastructures. It does not drive I/O and expects materialized bytes buffers as input. It also does not return any types beyond python dataclasses of primitive types (and numpy arrays thereof). The goal of the project is to provide a stable and feature-complete read/write backend for packages such as uproot.

        Speaker: Nick Smith (Fermi National Accelerator Lab. (US))
      • 18
        RNTuple support in Scikit-HEP

        RNTuple is an new columnar data storage format with a variety of improvements over TTree. The first stable version of the specification became available in 6.34, at the beginning of the year. Thus, we have entered the transition period where our software migrates from TTrees to RNTuples. The Uproot Python library has stayed in the forefront of this transition, and already has fairly comprehensive support for reading and writing RNTuples. In this talk, I will briefly introduce the RNTuple format and its benefits, demonstrate how to use Uproot to read and write RNTuple data, and discuss current capabilities, limitations, and future work to support RNTuples in the rest of the Scikit-HEP ecosystem.

        Speaker: Andres Rios-Tascon (Princeton University)
      • 19
        Accelerating binned 
Likelihood fits in HEP
 with JAX

        Binned Likelihoods (and optimizations of thereof) in HEP offer various parallelization opportunities. This talk discusses those opportunities, and discusses how they can be implemented using the JAX package. Finally, the evermore package is presented as a show-case that enables those optimizations with JAX already.

        Speaker: Manfred Peter Fackeldey (Princeton University (US))
    • 12:30
      Lunch
    • 15:00
      Coffee
    • Hacking
    • 08:30
      Coffee
    • Talks: Introductions
      • 20
        Introduction: Roger Janusiak
        Speaker: Roger Janusiak (University of Washington)
      • 21
        Introduction: Massimiliano Galli
        Speaker: Massimiliano Galli (Princeton University (US))
      • 22
        Introduction: Nikolai Krug
        Speaker: Nikolai Krug (Ludwig Maximilians Universitat (DE))
      • 23
        Introduction: Isaac Kunen
        Speaker: Isaac Kenneth Kunen
      • 24
        Introduction: Leon Lin
        Speaker: Yuan-Ru Lin (University of Washington (US))
      • 25
        Introduction: Dennis Daniel Nick Noll
        Speaker: Dennis Daniel Nick Noll (Lawrence Berkeley National Lab (US))
      • 26
        Introduction: George Marshall
        Speaker: George Marshall (University of Washington)
      • 27
        Introduction: Henry Fredrick Schreiner
        Speaker: Henry Fredrick Schreiner (Princeton University)
      • 28
        Introduction: Jonas Eschle
        Speaker: Jonas Eschle
      • 29
        Introduction: Saheed Oyeniran
        Speaker: Saheed Oyeniran (University of New Mexico)
      • 30
        Introduction: Mason Proffitt
        Speaker: Mason Proffitt (University of Washington (US))
      • 31
        Introduction: Matthew Feickert
        Speaker: Matthew Feickert (University of Wisconsin Madison (US))
    • 10:00
      Coffee
    • Talks: Deep Dive
      • 32
        HEP Packaging Coordination: Reproducible reuse by default

        While advancements in software development practices across particle physics and adoption of Linux container technology have made substantial impact in the ease of replicability and reuse of analysis software stacks, the underlying software environments are still primarily bespoke builds that lack a full manifest to ensure reproducibility across time. The HEP Packaging Coordination community project is bootstrapping packaging of the broader community ecosystem on conda-forge. This process covers multi-platform packaging from low level language phenomenology tools, to the broader simulation stack, to end user analysis tools, and the reinterpretation ecosystem. When combined with next generation scientific package management and manifest tools, the creation of fully specified, portable, and trivially reproducible environments becomes easy and fast, even with the use of hardware accelerators. This ongoing process significantly lowers technical barriers across tool development, distribution, and use, and when combined with public data products provides a transparent system for full analysis reinterpretation and reuse.

        This also represents an opportunity for the PyHEP community to ensure that Pythonic community tooling can be robustly distributed for multiple computing platforms across Python package indexes (e.g. PyPI), conda package index (e.g. conda-forge), and as bespoke overlays through CVMFS. Supporting these distribution methods will allow for Analysis Facility managers and end user physicists alike to build complex scientific computing environments with confidence of stability and speed.

        Speaker: Matthew Feickert (University of Wisconsin Madison (US))
      • 33
        Histogram Serialization

        This talk covers histogram serialization development. We'll take a look at the new serialization specification being developed in UHI, we'll look at how libraries can be developed to support serialization (such as boost-histogram), and work through some examples.

        This is intended to be an introduction to serialization so that it can be a hackathon/sprint target later.

        Speaker: Henry Fredrick Schreiner (Princeton University)
      • 34
        News and overview of the fitting ecosystem

        In this talk, I plan to give an informal overview over the current fitting ecosystem used in HEP (mainly with pyhf, zfit , hepstats, evermore,...). The talk covers current efforts, needs and future plans and challenges and discussed model building, inference, optimization/inference, serialization, interchange and backends.

        Speaker: Jonas Eschle
    • 11:15
      Coffee
    • Talks: Deep Dive
      • 35
        The LEGEND-200 Analysis Framework

        The LEGEND Collaboration has developed a fully python based framework for its data analysis and processing. The framework is comprised of 5 main packages: lgdo for handling the data objects, dspeed for fast digital signal processing, pygama for the calibration and optimisation routines, pylegendmeta for handling metadata/configs and legend-dataflow which uses snakemake to manage the data processing. In the last year this software was used to repeatedly process ~100 TB of data to produce the first LEGEND-200 result. This talk will present the implementation and performance of this software stack.

        Speaker: George Marshall
      • 36
        Using Commodity Data Tools in LEGEND-1000

        The current phase of the LEGEND neutrinoless double-beta decay search, LEGEND-200, holds its primary experimental data in a customized HDF5 format, This requires the team to build and maintain a significant custom data access layer that lies outside the team’s core physics mission and expertise, and the performance and complexity of the system impacts both data production pipelines and analysis of the data.

        Multi-petabyte data sets like those LEGEND will amass used to be outliers, but are now common in industry, and the database community has produced a wealth of tools for dealing with them. For the future phase of the project, LEGEND-1000, we’re exploring how we can improve performance and functionality, while reducing cost to the team by leveraging these tools.

        In this discussion, we’ll give an overview of our early work to use vanilla Parquet in conjunction with HIVE Partitioning (and possibly Iceberg) for storage, off-the-shelf data access and coordination systems in Python like DuckDB and PySpark to process and query data, and standard OCI containers to simplify deployment across environments.

        Speaker: Isaac Kenneth Kunen
    • 12:30
      Lunch
    • Hacking
    • 08:30
      Coffee
    • Talks: Introductions
    • 10:00
      Coffee
    • Talks: Deep Dive
      • 47
        Static Compilation in Julia -- using FHist.jl as an example

        In the past year, development in Julia has lead to the ability to statically compile small (relative to full runtime and LLVM) binaries.

        In this presentation we quickly go over the basic principle of it, the challenge of it, and demonstrate a proof-of-concept binding to FHist.jl.

        Finally, we discuss what are some potential future usage as well as the on-going development in larger project such as JetReconstruction.jl that involves static compilation.

        Speaker: Jerry 🦑 Ling (Harvard University (US))
      • 48
        FlexCAST: Enabling Flexible Scientific Data Analyses

        The development of scientific data analyses is a resource-intensive process that often yields results with untapped potential for reuse and reinterpretation. In many cases, a developed analysis can be used to measure more than it was designed for, by changing its input data or parametrization. Building on the RECAST approach, which enables the reinterpretation of a physics analysis in the context of high-energy physics to variations in a part of the input data, namely a specific signal model, we introduce FlexCAST, an approach that allows for changes in the entire input data and parametrization. FlexCAST is based on three core principles: modularity, validity, and robustness. Modularity enables the input data and parametrization of the analysis to change, while validity ensures that the obtained results remain meaningful, and robustness ensures that as many configurations as possible yield meaningful results. While not being limited to data-driven machine learning techniques, FlexCAST is particularly valuable for the reinterpretation of analyses in this context, where changes in input data can significantly impact the parametrization of the analysis. Using a state-of-the-art anomaly detection analysis on LHC-like data, we showcase FlexCAST's core principles and implementation and demonstrate how it can expand the reach of scientific data analysis through flexible reuse and reinterpretation.

        Speaker: Dennis Daniel Nick Noll (Lawrence Berkeley National Lab (US))
      • 49
        Common interface for end-of-analysis statistics with general PyTree operations

        Statistical procedures at the end stages of analysis such as hypothesis testing. likelihood scans, and pull plots are currently implemented across multiple Python packages, yet lack interoperability despite performing similar functions once the log-likelihood is constructed. We present a contribution to HEPStats of the Scikit-HEP ecosystem to provide a common interface for these final stages of analysis. Any combination of log-likelihood and parameter objects adhering to a minimal interface becomes compatible with HEPStats and gains access to a comprehensive suite of tools supporting both frequentist and asymptotic inference. Internally, generality is achieved through being able to handle model parameters and data provided in nearly arbitrary python data structures. We introduce a novel approach by representing these structures as PyTrees, enabling automatic traversal, fitting, and tracking of parameters of interest without requiring custom logic for each data type. Any nesting of common python objects such as lists, dicts, and NamedTuples are recognized natively as PyTrees, and additional types can be internally registered to extend functionality without sacrificing generality. These tree operations are efficiently implemented using the optree package, offering performance benefits over manual traversals. The talk will demonstrate how this approach streamlines statistical inference in HEP statistical workflows and its implementation with PyTrees.

        Speaker: Max Zhao (Princeton University (US))
    • 11:15
      Coffee
    • Talks: Deep Dive
      • 50
        Graph Me If You Can: Modern Python Meets HEP Statistical Models

        Statistical tooling in the scientific python ecosystem continues to advance, while at the same time ROOT has recently adopted the HEP Statistics Serialization Standard (HS3) as the way of serializing RooWorkspaces for any probability model that has been built. There is a gap between packages such as jax and scipy.stats and what HS3 provides. This is where pyhs3 comes in—a modern Python implementation of HS3 designed with modern scientific python development practices. Prioritizing a developer-friendly interface and cross-platform compatibility, pyhs3 provides a python-callable function built from the computational graph encoded in serialized HS3 probability models.

        The goal of this effort is to facilitate existing efforts in statistical inference (pyhf, zfit, cabinetry) and auto-differentiability (neos, MadJax, evermore, relaxed) by providing a common core for bidirectional translation of HS3-compatible workspaces.

        We'll discuss the design of the library, how the pieces are defined, how to extend or contribute to it, and proof-of-concept with a real-world workspace from the ATLAS $HH\to bb\gamma\gamma$ analysis. The talk presents the pyhs3 package as a step towards a common 'inference API' and providing implementations of many mathematical probability distributions common in HEP.

        Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
      • 51
        A tool for unbinned frequentist inference for quasi-background free searches

        Current statistical inference tools in high-energy physics typically focus on binned analyses and often use asymptotic approximations to draw statistical inferences. However, present and future neutrinoless double beta decay experiments, such as the Large Enriched Germanium Experiment for Neutrinoless ββ Decay (LEGEND), operate in a quasi-background free regime, where the expected number of background counts in the signal region is less than or close to one. Due to the well-established peak shape and good energy resolution of these experiments, an unbinned frequentist analysis is used to maximize the power of the statistical analysis.

        For the first physics analysis of LEGEND-200 [1], a new Python-based tool (freqfit) for conducting unbinned frequentist inference was created [2], making heavy use of the existing iminuit package. This tool builds up test statistic distributions through Monte Carlo pseudoexperiments, enabling frequentist inference in the non-asymptotic, low-statistics regime in which LEGEND and other experiments operate. By allowing for user-defined likelihoods, freqfit is applicable for a broad class of experiments, not only neutrinoless double beta decay. This talk will discuss the development of freqfit, including the computing and mathematical challenges encountered, and its application to LEGEND data.

        [1] H. Acharya et al., arXiv:2505.10440.
        [2] L. Varriano, S. Borden, G. Song, CJ Nave, Y.-R. Lin, & J. Detwiler. (2025). cenpa/freqfit: https://github.com/cenpa/freqfit

        Speaker: Sam Borden (University of Washington, CENPA)
      • 52
        Packaging Collaboration Offline Software

        Is it possible for all individual collaboration software to be packaged and maintained on conda-forge? There are lots of caveats involved from the non-technical aspects including licensing and usage; to technical aspects such as cross-compilation and the larger number of dependencies and configuration / parallel releases that may make this challenging. The collaborations I am thinking about include ATLAS, CMS, LHCb, ALICE, as well as Belle-II, DUNE, EIC, FCC, and so on. Could it be impactful to improve the ability of preserving the code in a state that could be reusable and allow for reproducible physics? Especially if the analysis code depends on this offline software?

        Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
    • 12:30
      Lunch
    • 15:00
      Coffee
    • Hackathon
    • Dinner Vista Cafe

      Vista Cafe

      Located just down the road from the physics department. See graphic. Google Maps has it as well!