PyHEP.dev 2025 - "Python in HEP" Developer's Workshop
Seattle, Washington
PyHEP.dev is an in-person, informal workshop for developers of Python software in HEP to plan a coherent roadmap and make priorities for the upcoming year. It complements the PyHEP Users online workshop, which is intended for both developers and physicists.
Both PyHEP workshops are supported by the HEP Software Foundation (HSF). Further information is on the PyHEP Working Group website.
PyHEP.dev 2025 shared live notes
Program
- Statistical tooling
- Scaling HEP analyses with dask & plans for dask-awkward
- Autodifferentation of HEP analyses (clarifying plans & scope)
- Serialization: Histograms (UHI), RNTuple, HS3
- Workflows for HEP analyses
Organising Committee
Peter Fackeldey - Princeton University
Ianna Osborne - Princeton University
Nikolai Krug - Ludwig Maximilian University of Munich
Marcel Rieger - Hamburg University
Matthew Feickert - University of Wisconsin-Madison
Local Organising Committee
Gordon Watts - University of Washington
![]()
|
This event is also kindly sponsored by the Python Software Foundation. |
-
-
08:30
Coffee
-
1
WelcomeSpeakers: Gordon Watts (University of Washington (US)), Manfred Peter Fackeldey (Princeton University (US))
-
Talks: Introductions
-
2
Introduction: Andres Rios-TasconSpeaker: Andres Rios-Tascon (Princeton University)
-
3
Introduction: Samantha AbbottSpeaker: Samantha Abbott (University of California Davis (US))
-
4
Introduction: Peter FackeldeySpeaker: Manfred Peter Fackeldey (Princeton University (US))
-
5
Introduction: Elise ChavezSpeaker: Elise Chavez (University of Wisconsin Madison (US))
-
6
Introduction: Nick SmithSpeaker: Nick Smith (Fermi National Accelerator Lab. (US))
-
7
Introduction: Artur Cordeiro Oudot ChoiSpeaker: Artur Cordeiro Oudot Choi (University of Washington (US))
-
8
Introduction: Sari DamenSpeaker: Sari Damen (university of kocaeli)
-
9
Introduction: Kelci MohrmanSpeaker: Kelci Ann Mohrman (University of Florida (US))
- 10
- 11
-
12
Introduction: Iason KrommydasSpeaker: Iason Krommydas (Rice University (US))
-
13
Introduction: Lindsey GraySpeaker: Lindsey Gray (Fermi National Accelerator Lab. (US))
-
2
-
10:00
Coffee
-
Talks: Deep Dive
-
14
Awkward Array: The Swiss Army Knife of Irregular Data (and Still a Little Awkward)
Awkward Array is a stable and widely used Python library for working with nested, variable-length, and irregular data — the kind of data that traditional NumPy arrays can’t easily handle. Originally developed for high-energy physics, it has grown into a reliable tool for many fields beyond HEP.
Today, Awkward Array offers strong integration with libraries like NumPy, Numba, JAX, and GPU backends — to name a few. It’s fast, flexible, and lets scientists work with complex data structures in a clear, efficient way.
But there’s still more we can do. This talk will give a short update on the current status of Awkward Array, recent improvements, and how it fits into the broader scientific Python ecosystem. We’ll also discuss ideas for the future: better JAX and GPU support, simpler APIs (or ones that align better with array standards), and stronger connections to other scientific libraries. Even though Awkward Array is already a solid, stable tool, we want to keep making it better — and we invite the community to help guide where it goes next.
Speaker: Ianna Osborne (Princeton University) -
15
Lazy Data Loading with "Virtual Arrays" in Awkward
High-energy physics (HEP) analyses frequently manage massive datasets that surpass available computing resources, requiring specialized techniques for efficient data handling. Awkward Array, a widely adopted Python library in the HEP community, effectively manages complex, irregularly structured ("ragged") data by mapping flat arrays into nested structures that intuitively represent physical objects like particles and their associated properties. Typically, analyses utilize only specific subsets of these objects and properties, presenting an important opportunity to reduce memory usage through lazy data loading strategies.
In this presentation, we will introduce and delve into Awkward Array's newly developed "Virtual Arrays" feature, explicitly designed for lazy loading of data buffers. Instead of immediately loading entire datasets into memory, Virtual Arrays defer data retrieval from disk until explicitly requested by computation. We will discuss in greater detail the underlying architecture, design considerations, and practical implementation of Virtual Arrays, highlighting their integration into analytical workflows.
We will illustrate how developers and analysts can seamlessly incorporate lazy data loading into their existing frameworks using Coffea—the Columnar Object Framework For Effective Analysis. Coffea facilitates efficient event data processing through columnar operations and transparently scales computations from personal laptops to extensive distributed computing environments without modifications to analysis code. Real-world examples from high-energy physics, including selective data processing and efficient histogramming, will underscore the technical implications and significant performance improvements provided by Virtual Arrays, accelerating data-intensive analysis and enhancing computational efficiency in collider experiments.
Speaker: Iason Krommydas (Rice University (US)) -
16
Towards rapid and efficient columnar-based analyses at scale
As we pursue new physics at the LHC, the challenge of efficiently analyzing our rapidly mounting data volumes will continue to grow. This talk will describe the development and benchmarking of a realistic columnar-based end-user analysis workflow (for skimming Run 2 + Run 3 scale data with the Coffea framework) in order to characterize the current capabilities and understand bottlenecks as we scale towards HL-LHC data volumes. This talk will also discuss how the execution of columnar operations can be accelerated with GPUs, studying the performance with a set of benchmark queries and discussing paths towards running a full-scale columnar analysis on GPUs.
Speaker: Kelci Ann Mohrman (University of Florida (US))
-
14
-
11:15
Coffee
-
Talks: Deep Dive
-
17
rootfilespec
The rootfilespec package is designed to efficiently parse ROOT file binary data into python datastructures. It does not drive I/O and expects materialized bytes buffers as input. It also does not return any types beyond python dataclasses of primitive types (and numpy arrays thereof). The goal of the project is to provide a stable and feature-complete read/write backend for packages such as uproot.
Speaker: Nick Smith (Fermi National Accelerator Lab. (US)) -
18
RNTuple support in Scikit-HEP
RNTuple is an new columnar data storage format with a variety of improvements over TTree. The first stable version of the specification became available in 6.34, at the beginning of the year. Thus, we have entered the transition period where our software migrates from TTrees to RNTuples. The Uproot Python library has stayed in the forefront of this transition, and already has fairly comprehensive support for reading and writing RNTuples. In this talk, I will briefly introduce the RNTuple format and its benefits, demonstrate how to use Uproot to read and write RNTuple data, and discuss current capabilities, limitations, and future work to support RNTuples in the rest of the Scikit-HEP ecosystem.
Speaker: Andres Rios-Tascon (Princeton University) -
19
Accelerating binned Likelihood fits in HEP with JAX
Binned Likelihoods (and optimizations of thereof) in HEP offer various parallelization opportunities. This talk discusses those opportunities, and discusses how they can be implemented using the JAX package. Finally, the evermore package is presented as a show-case that enables those optimizations with JAX already.
Speaker: Manfred Peter Fackeldey (Princeton University (US))
-
17
-
12:30
Lunch
-
15:00
Coffee
-
Hacking
-
08:30
-
-
08:30
Coffee
-
Talks: Introductions
-
20
Introduction: Roger JanusiakSpeaker: Roger Janusiak (University of Washington)
- 21
-
22
Introduction: Nikolai KrugSpeaker: Nikolai Krug (Ludwig Maximilians Universitat (DE))
- 23
-
24
Introduction: Leon LinSpeaker: Yuan-Ru Lin (University of Washington (US))
-
25
Introduction: Dennis Daniel Nick NollSpeaker: Dennis Daniel Nick Noll (Lawrence Berkeley National Lab (US))
-
26
Introduction: George MarshallSpeaker: George Marshall (University of Washington)
-
27
Introduction: Henry Fredrick SchreinerSpeaker: Henry Fredrick Schreiner (Princeton University)
- 28
-
29
Introduction: Saheed OyeniranSpeaker: Saheed Oyeniran (University of New Mexico)
-
30
Introduction: Mason ProffittSpeaker: Mason Proffitt (University of Washington (US))
-
31
Introduction: Matthew FeickertSpeaker: Matthew Feickert (University of Wisconsin Madison (US))
-
20
-
10:00
Coffee
-
Talks: Deep Dive
-
32
HEP Packaging Coordination: Reproducible reuse by default
While advancements in software development practices across particle physics and adoption of Linux container technology have made substantial impact in the ease of replicability and reuse of analysis software stacks, the underlying software environments are still primarily bespoke builds that lack a full manifest to ensure reproducibility across time. The HEP Packaging Coordination community project is bootstrapping packaging of the broader community ecosystem on conda-forge. This process covers multi-platform packaging from low level language phenomenology tools, to the broader simulation stack, to end user analysis tools, and the reinterpretation ecosystem. When combined with next generation scientific package management and manifest tools, the creation of fully specified, portable, and trivially reproducible environments becomes easy and fast, even with the use of hardware accelerators. This ongoing process significantly lowers technical barriers across tool development, distribution, and use, and when combined with public data products provides a transparent system for full analysis reinterpretation and reuse.
This also represents an opportunity for the PyHEP community to ensure that Pythonic community tooling can be robustly distributed for multiple computing platforms across Python package indexes (e.g. PyPI), conda package index (e.g. conda-forge), and as bespoke overlays through CVMFS. Supporting these distribution methods will allow for Analysis Facility managers and end user physicists alike to build complex scientific computing environments with confidence of stability and speed.
Speaker: Matthew Feickert (University of Wisconsin Madison (US)) -
33
Histogram Serialization
This talk covers histogram serialization development. We'll take a look at the new serialization specification being developed in UHI, we'll look at how libraries can be developed to support serialization (such as boost-histogram), and work through some examples.
This is intended to be an introduction to serialization so that it can be a hackathon/sprint target later.
Speaker: Henry Fredrick Schreiner (Princeton University) -
34
News and overview of the fitting ecosystem
In this talk, I plan to give an informal overview over the current fitting ecosystem used in HEP (mainly with pyhf, zfit , hepstats, evermore,...). The talk covers current efforts, needs and future plans and challenges and discussed model building, inference, optimization/inference, serialization, interchange and backends.
Speaker: Jonas Eschle
-
32
-
11:15
Coffee
-
Talks: Deep Dive
-
35
The LEGEND-200 Analysis Framework
The LEGEND Collaboration has developed a fully python based framework for its data analysis and processing. The framework is comprised of 5 main packages: lgdo for handling the data objects, dspeed for fast digital signal processing, pygama for the calibration and optimisation routines, pylegendmeta for handling metadata/configs and legend-dataflow which uses snakemake to manage the data processing. In the last year this software was used to repeatedly process ~100 TB of data to produce the first LEGEND-200 result. This talk will present the implementation and performance of this software stack.
Speaker: George Marshall -
36
Using Commodity Data Tools in LEGEND-1000
The current phase of the LEGEND neutrinoless double-beta decay search, LEGEND-200, holds its primary experimental data in a customized HDF5 format, This requires the team to build and maintain a significant custom data access layer that lies outside the team’s core physics mission and expertise, and the performance and complexity of the system impacts both data production pipelines and analysis of the data.
Multi-petabyte data sets like those LEGEND will amass used to be outliers, but are now common in industry, and the database community has produced a wealth of tools for dealing with them. For the future phase of the project, LEGEND-1000, we’re exploring how we can improve performance and functionality, while reducing cost to the team by leveraging these tools.
In this discussion, we’ll give an overview of our early work to use vanilla Parquet in conjunction with HIVE Partitioning (and possibly Iceberg) for storage, off-the-shelf data access and coordination systems in Python like DuckDB and PySpark to process and query data, and standard OCI containers to simplify deployment across environments.
Speaker: Isaac Kenneth Kunen
-
35
-
12:30
Lunch
-
Hacking
-
08:30
-
-
08:30
Coffee
-
Talks: Introductions
- 37
-
38
Introduction: Yehyun ChoiSpeaker: Yehyun Choi
-
39
Introduction: Giordon Holtsberg StarkSpeaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
-
40
Introduction: Louis VarrianoSpeaker: Louis Varriano (University of Washington)
-
41
Introduction: Gordon WattsSpeaker: Gordon Watts (University of Washington (US))
-
42
Introduction: Peter ZabbackSpeaker: Peter Zabback (UW / CENPA)
- 43
-
44
Introduction: Samuel BordenSpeaker: Sam Borden (University of Washington, CENPA)
-
45
Introduction: Pankaj Kumar BindSpeaker: Pankaj Kumar Bind (Uka Tarsadia University, India)
- 46
-
10:00
Coffee
-
Talks: Deep Dive
-
47
Static Compilation in Julia -- using FHist.jl as an example
In the past year, development in Julia has lead to the ability to statically compile small (relative to full runtime and LLVM) binaries.
In this presentation we quickly go over the basic principle of it, the challenge of it, and demonstrate a proof-of-concept binding to FHist.jl.
Finally, we discuss what are some potential future usage as well as the on-going development in larger project such as JetReconstruction.jl that involves static compilation.
Speaker: Jerry 🦑 Ling (Harvard University (US)) -
48
FlexCAST: Enabling Flexible Scientific Data Analyses
The development of scientific data analyses is a resource-intensive process that often yields results with untapped potential for reuse and reinterpretation. In many cases, a developed analysis can be used to measure more than it was designed for, by changing its input data or parametrization. Building on the RECAST approach, which enables the reinterpretation of a physics analysis in the context of high-energy physics to variations in a part of the input data, namely a specific signal model, we introduce FlexCAST, an approach that allows for changes in the entire input data and parametrization. FlexCAST is based on three core principles: modularity, validity, and robustness. Modularity enables the input data and parametrization of the analysis to change, while validity ensures that the obtained results remain meaningful, and robustness ensures that as many configurations as possible yield meaningful results. While not being limited to data-driven machine learning techniques, FlexCAST is particularly valuable for the reinterpretation of analyses in this context, where changes in input data can significantly impact the parametrization of the analysis. Using a state-of-the-art anomaly detection analysis on LHC-like data, we showcase FlexCAST's core principles and implementation and demonstrate how it can expand the reach of scientific data analysis through flexible reuse and reinterpretation.
Speaker: Dennis Daniel Nick Noll (Lawrence Berkeley National Lab (US)) -
49
Common interface for end-of-analysis statistics with general PyTree operations
Statistical procedures at the end stages of analysis such as hypothesis testing. likelihood scans, and pull plots are currently implemented across multiple Python packages, yet lack interoperability despite performing similar functions once the log-likelihood is constructed. We present a contribution to HEPStats of the Scikit-HEP ecosystem to provide a common interface for these final stages of analysis. Any combination of log-likelihood and parameter objects adhering to a minimal interface becomes compatible with HEPStats and gains access to a comprehensive suite of tools supporting both frequentist and asymptotic inference. Internally, generality is achieved through being able to handle model parameters and data provided in nearly arbitrary python data structures. We introduce a novel approach by representing these structures as PyTrees, enabling automatic traversal, fitting, and tracking of parameters of interest without requiring custom logic for each data type. Any nesting of common python objects such as lists, dicts, and NamedTuples are recognized natively as PyTrees, and additional types can be internally registered to extend functionality without sacrificing generality. These tree operations are efficiently implemented using the optree package, offering performance benefits over manual traversals. The talk will demonstrate how this approach streamlines statistical inference in HEP statistical workflows and its implementation with PyTrees.
Speaker: Max Zhao (Princeton University (US))
-
47
-
11:15
Coffee
-
Talks: Deep Dive
-
50
Graph Me If You Can: Modern Python Meets HEP Statistical Models
Statistical tooling in the scientific python ecosystem continues to advance, while at the same time
ROOT
has recently adopted the HEP Statistics Serialization Standard (HS3
) as the way of serializing RooWorkspaces for any probability model that has been built. There is a gap between packages such asjax
andscipy.stats
and whatHS3
provides. This is wherepyhs3
comes in—a modern Python implementation ofHS3
designed with modern scientific python development practices. Prioritizing a developer-friendly interface and cross-platform compatibility,pyhs3
provides a python-callable function built from the computational graph encoded in serializedHS3
probability models.The goal of this effort is to facilitate existing efforts in statistical inference (pyhf, zfit, cabinetry) and auto-differentiability (neos, MadJax, evermore, relaxed) by providing a common core for bidirectional translation of
HS3
-compatible workspaces.We'll discuss the design of the library, how the pieces are defined, how to extend or contribute to it, and proof-of-concept with a real-world workspace from the ATLAS $HH\to bb\gamma\gamma$ analysis. The talk presents the
pyhs3
package as a step towards a common 'inference API' and providing implementations of many mathematical probability distributions common in HEP.Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US)) -
51
A tool for unbinned frequentist inference for quasi-background free searches
Current statistical inference tools in high-energy physics typically focus on binned analyses and often use asymptotic approximations to draw statistical inferences. However, present and future neutrinoless double beta decay experiments, such as the Large Enriched Germanium Experiment for Neutrinoless ββ Decay (LEGEND), operate in a quasi-background free regime, where the expected number of background counts in the signal region is less than or close to one. Due to the well-established peak shape and good energy resolution of these experiments, an unbinned frequentist analysis is used to maximize the power of the statistical analysis.
For the first physics analysis of LEGEND-200 [1], a new Python-based tool (
freqfit
) for conducting unbinned frequentist inference was created [2], making heavy use of the existingiminuit
package. This tool builds up test statistic distributions through Monte Carlo pseudoexperiments, enabling frequentist inference in the non-asymptotic, low-statistics regime in which LEGEND and other experiments operate. By allowing for user-defined likelihoods,freqfit
is applicable for a broad class of experiments, not only neutrinoless double beta decay. This talk will discuss the development offreqfit
, including the computing and mathematical challenges encountered, and its application to LEGEND data.[1] H. Acharya et al., arXiv:2505.10440.
[2] L. Varriano, S. Borden, G. Song, CJ Nave, Y.-R. Lin, & J. Detwiler. (2025). cenpa/freqfit: https://github.com/cenpa/freqfitSpeaker: Sam Borden (University of Washington, CENPA) -
52
Packaging Collaboration Offline Software
Is it possible for all individual collaboration software to be packaged and maintained on conda-forge? There are lots of caveats involved from the non-technical aspects including licensing and usage; to technical aspects such as cross-compilation and the larger number of dependencies and configuration / parallel releases that may make this challenging. The collaborations I am thinking about include ATLAS, CMS, LHCb, ALICE, as well as Belle-II, DUNE, EIC, FCC, and so on. Could it be impactful to improve the ability of preserving the code in a state that could be reusable and allow for reproducible physics? Especially if the analysis code depends on this offline software?
Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
-
50
-
12:30
Lunch
-
15:00
Coffee
-
Hackathon
-
Dinner Vista Cafe
Vista Cafe
Located just down the road from the physics department. See graphic. Google Maps has it as well!
-
08:30
-
-
08:30
Coffee
-
Discussion: Summary of Discussion & Hacking Sessions
-
10:30
Coffee
-
53
Farewell
-
08:30