Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

PyHEP.dev 2024 - "Python in HEP" Developer's Workshop

Europe/Brussels
Aachen, Germany

Aachen, Germany

Erholungs-Gesellschaft Reihstraße 13, 52062 Aachen
Description

PyHEP.dev is an in-person, informal workshop for developers of Python software in HEP to plan a coherent roadmap and make priorities for the upcoming year. It complements the PyHEP Users online workshop, which is intended for both developers and physicists.

Both PyHEP workshops are supported by the HEP Software Foundation (HSF). Further information is on the PyHEP Working Group website.

The agenda will consist of morning kick-off talks and afternoon discussions, in which the discussion groups and topics are self-assigned. Pre-workshop organization is happening here, via GitHub Issues.

You are encouraged to register to the PyHEP WG Gitter channel and/or to the HSF forum to receive further information concerning the organisation of the workshop. Workshop updates and information will also be shared on the workshop Twitter in addition to email. Follow the workshop at @PyHEPConf.

Organising Committee

Eduardo Rodrigues - University of Liverpool (Chair)
Jim Pivarski - Princeton University
Nikolai Hartmann - Ludwig Maximilian University of Munich
Matthew Feickert - University of Wisconsin-Madison

Local Organising Committee

Peter Fackeldey - RWTH Aachen University & ErUM-Data-Hub
Angela Warkentin - ErUM-Data-Hub

The workshop is sponsored by, and organized in cooperation with, the ErUM-Data-Hub. The ErUM-Data-Hub is the central networking and transfer office for digital transformation in research on universe and matter in Germany and is funded by the German Federal Ministry of Education and Research (BMBF).

 

This event is also kindly sponsored by the Python Software Foundation.

Participants
  • Angela Warkentin
  • Azzah ALSHEHRi
  • Benjamin Fischer
  • Eduardo Rodrigues
  • Jim Pivarski
  • Jonas Eschle
  • Judith Steinfeld
  • Nikolai Hartmann
  • Oksana Shadura
  • Peter Fackeldey
  • +16
    • 08:30
      Coffee
    • 1
    • Kick-off talks
      • 2
        Self-introduction: Eduardo Rodrigues
        Speaker: Eduardo Rodrigues (University of Liverpool (GB))
      • 3
        Self-introduction: Juraj Smiesko
        Speaker: Juraj Smiesko (CERN)
      • 4
        Self-introduction: Jan Bürger
        Speaker: Jan Bürger (ErUM-Data-Hub)
      • 5
        Self-introduction: Jim Pivarski
        Speaker: Jim Pivarski (Princeton University)
      • 6
        Self-introduction: Lino Oscar Gerlach
        Speaker: Lino Oscar Gerlach (Princeton University (US))
      • 7
        Self-introduction: Josue Molina
        Speaker: Josue Molina
      • 8
        Self-introduction: Ianna Osborne
        Speaker: Ianna Osborne (Princeton University)
      • 9
        Self-introduction: Máté Farkas
        Speaker: Mate Farkas (Rheinisch Westfaelische Tech. Hoch. (DE))
      • 10
        Self-introduction: Yaroslav Nikitenko
        Speaker: Yaroslav Nikitenko
      • 11
        Fast end-to-end analysis pipelines for the HL-LHC

        We provide an overview of two ongoing projects that aim to ensure the availability of fast and user-friendly solutions for physics analysis pipelines towards the HL-LHC. The Analysis Grand Challenge (AGC) defines an analysis task that captures relevant physics analysis workflow aspects. A variety of implementations have been developed for this task, allowing to probe user experience and interoperability, and helping to center community discussions around a common benchmark. We will focus on the reference implementation provided by IRIS-HEP, which makes use of many tools in the Python HEP ecosystem and in particular a stack of Scikit-HEP libraries.

        A second project started in 2024 with a focus on specifically achieving very large data throughput in a physics analysis context. The project is often referred to by its target, "200 Gbps", for sustaining such a data rate. The project involved a collaboration between many areas of expertise, but we describe the user-facing software aspect of it, which we built with libraries from Scikit-HEP and the surrounding ecosystem.

        Speakers: Alexander Held (University of Wisconsin Madison (US)), Oksana Shadura (University of Nebraska Lincoln (US))
    • 11:00
      Coffee break
    • Discussion: What is a HEP analysis? What does PyHEP cover?
    • 12:30
      Lunch
    • Discussion: Challenges - AGC & 200Gbps
    • 15:00
      Coffee break
    • Hacking
    • Welcome Reception & Dinner
    • 08:30
      Coffee
    • Kick-off talks
      • 13
        Self-introduction: Peter Fackeldey
        Speaker: Manfred Peter Fackeldey (RWTH Aachen University (DE))
      • 14
        Self-introduction: Stefan Fröse
        Speaker: Stefan Fröse (ErUM-Data-Hub)
      • 15
        Self-introduction: Matthew Feickert
        Speaker: Matthew Feickert (University of Wisconsin Madison (US))
      • 16
        Self-introduction: Jonas Eschle
        Speaker: Jonas Eschle (Syracuse University (US))
      • 17
        Self-introduction: Alexander Held
        Speaker: Alexander Held (University of Wisconsin Madison (US))
      • 18
        Self-introduction: Giordon Holtsberg Stark
        Speaker: Dr Giordon Holtsberg Stark (University of California,Santa Cruz (US))
      • 19
        Self-introduction: Marcel Rieger
        Speaker: Marcel Rieger (Hamburg University (DE))
      • 20
        Self-introduction: Jonas Eppelt
        Speaker: Jonas Eppelt (Karlsruher Insititute of Technology (KIT))
      • 21
        Self-introduction: Alexander Heidelbach
        Speaker: Alexander Heidelbach
      • 22
        Self-introduction: Vincenzo Eduardo Padulano
        Speaker: Dr Vincenzo Eduardo Padulano (CERN)
      • 23
        An overview of the fitting ecosystem

        This talk will give a broad overview on the fitting that we're doing in
        HEP. On one hand, the talk will cover the variety of fits in HEP, the
        different needs and types of inference as well as efforts for
        serialization and standardization. On the other hand, the relevant
        libraries will be covered, that is zfit, pyhf, hepstats iminuit and
        Python packages like SciPy and how they work together today, as well as
        the future plans and technical considerations.

        Speaker: Jonas Eschle (Syracuse University (US))
      • 24
        evermore: differentiable (binned) likelihoods in JAX

        I'd like to present evermore (https://github.com/pfackeldey/evermore) that focusses in efficiently building and evaluating likelihoods typically for HEP. Currently, it focusses on binned template fits.
        It supports autodiff, JIT-compilation and vectorization of full fits (even on GPUs).

        Speaker: Manfred Peter Fackeldey (RWTH Aachen University (DE))
    • 25
      Group photo
    • 11:07
      Coffee break
    • Discussion: Building and evaluating likelihoods
    • 12:30
      Lunch
    • Discussion: Statistical model serialisation
    • 15:00
      Coffee break
    • Hacking
    • 08:30
      Coffee
    • Kick-off talks
      • 26
        Plothist - integrating with other histogram libraries (Remote talk)
        Speakers: Cyrille Praz, Tristan Fillinger (KEK / IPNS)
      • 27
        Self-introduction: Saransh Copra
        Speaker: Saransh Chopra (Princeton University (US))
      • 28
        Self-introduction: Oksana Shadura
        Speaker: Oksana Shadura (University of Nebraska Lincoln (US))
      • 29
        Self-introduction: Nikolai Hartmann
        Speaker: Nikolai Hartmann (Ludwig Maximilians Universitat (DE))
      • 30
        b2luigi — bringing batch 2 luigi!

        Workflow managers help structure the code of pipelined jobs by defining and managing dependencies between tasks in a clear and easy-to-understand fashion. This abstraction allows independent tasks to be automatically parallelised more independently of computing systems. Additionally, workflow managers help keep track of different tasks’ outputs and inputs.

        b2luigi is an extension of the workflow manager luigi and offers easy integration with batch systems such as HTCondor and LSF, allowing the combination of different systems within one workflow.

        b2luigi also provides additional interfaces tailored for Belle II workflows, allowing smooth interaction with the Belle II analysis software framework and distributed computing. Workflows such as VIBE, an automated Monte Carlo validation framework, the Systematics Framework, and many Belle II physics analyses have been automated using b2luigi.

        As the current maintainers of b2luigi and Belle II users, we look forward to discussing our experiences and plans for this tool at the PyHep 2024 conference.

        Speakers: Alexander Heidelbach, Jonas Eppelt (Karlsruher Insititute of Technology (KIT))
      • 31
        End-to-end workflow automation: updates of the luigi analysis workflow package

        Physicists performing data analyses are usually required to steer their individual, complex workflows manually, frequently involving job submission in several stages and interaction with distributed storage systems by hand. This process is not only time-consuming and error-prone, but also leads to undocumented relations between particular workloads, rendering the steering of an analysis a serious challenge, especially for newcomers to the field. In this presentation, I will demonstrate the main components of the Luigi Analysis Workflow (Law) package which is developed independently of any experiment or the language of executed code. Its core consists of flexible, pythonic workflow descriptions, interfaces to remote batch job and storage systems, as well as a granular environment sandboxing mechanism. In the second half, I will highlight the recent key changes to the package that were driven by requests of the user base that increased steadily over the past years.

        Speaker: Marcel Rieger (Hamburg University (DE))
      • 32
        waluigi - Beyond luigi

        Workflows for research in HEP experiments are not only quite complex but also require sufficient flexibility to adapt to changes in structuring, conditions, methodologies, and research interests. This holds especially true in the physics analyses extracting the results and measurements.
        Here, the use of workflows systems, specifically Luigi, have shown to be of great use to manage and organize the intricate dependencies of large task graphs that describe such analyses.
        Still, with intensive use comes the insight of where limitations lie. Now, as adoption and use of such software is rising, is good point to start thinking about how to improve upon it.
        I present a list of grievances and an idea to to address them, both of them to be discussed and iterated upon. While the "issues" are specific to the principles within Luigi, the current idea implies the need for a new software package: waluigi - Why Another LUIGI

        Speaker: Benjamin Fischer (RWTH Aachen University (DE))
      • 33
        offloading @ coffea

        Offloading resource intensive tasks, i.e.:
        - histograms (accumulation) - memory intensive
        - DL algorithms - compute intensive

        Speaker: Benjamin Fischer (RWTH Aachen University (DE))
    • 11:00
      Coffee break
    • Discussion: Workflows
    • 34
      Sustainability in computing
      Speaker: Martin Erdmann (Rheinisch Westfaelische Tech. Hoch. (DE))
    • 12:30
      Lunch
    • Discussion: Histogramming
    • 15:00
      Coffee break
    • Hacking
    • 35
      Workshop Dinner

      Ratskeller Aachen, Markt 40, 52062 Aachen

    • 08:30
      Coffee
    • Kick-off talks
      • 36
        Job openings in the ROOT team
        Speaker: Dr Vincenzo Eduardo Padulano (CERN)
      • 37
        Self-introduction: Benjamin Fischer
        Speaker: Benjamin Fischer (RWTH Aachen University (DE))
      • 38
        Self-introduction: Azzah Alshehri
        Speaker: Azzah Aziz Alshehri (University of Glasgow (GB))
      • 39
        File synchronization between Linux systems in Python with yarsync

        Yet Another Rsync is a Python wrapper around a well-established Linux tool rsync with a simple and familiar interface of git. Python allows us to create a higher-level instrument, which is safer and sometimes more efficient than the original binary.

        While many data analysts today heavily use databases and rely on cloud computing, other approaches have also their benefits. Many data kinds are difficult to represent in relational databases or it takes time to do that. Files in a user-defined format become a simpler and more general solution, which is often less expensive and error prone. Linux servers take a considerable share today, and many data analysts also use Linux as a good programming environment. Our approach is inspired by data analysis workflow in HEP. We shall tell about creating data repositories with yarsync, relevant rsync features and how the tool will assist against possible problems in data synchronization.

        Speaker: Yaroslav Nikitenko
      • 40
        Architectural framework for data analysis Lena

        The term «architecture» in software has numerous definitions. Ultimately it defines whether your analysis code will be extensible and maintainable. We propose an architecture based on the functional style and separation of data, logic and presentation. It is implemented in a free software framework Lena.

        Lena is a general data analysis framework in Python, named after a great Siberian river. It allows usage of any Python constructs and functions, but structures the analysis into reusable sequences and elements. It natively supports metadata (which is important for modern data analysis). It employs lazy evaluation, which makes it suitable for processing data which would not fit into memory, in particular, for big data analysis.

        The talk will be of primary interest to those who write large programs and face architectural challenges and who need to automatically create many similar plots. The audience will get a powerful tool, which would make their code structured and beautiful, or understand strengths and weaknesses of an alternative approach to data analysis in Python.

        Speaker: Yaroslav Nikitenko
      • 41
        FCCAnalyses: A Framework for Future Circular Collider Physics Performance Studies

        Physics performance analyses provide essential input for defining detector requirements in the Future Circular Collider (FCC) project. To streamline these analyses, we employ FCCAnalyses, a software framework built on top of the ROOT DataFrame.

        Among the functionalities offered by the FCCAnalyses are:
        * Standard set of RDataFrame EDM4hep functions: Events can be analysed directly in the EDM4hep event data format with ability to easily work with relationships among the datamodel objects.
        * Multi-stage Analysis Workflow: Enables running analyses locally or on CERN's HTCondor cluster split into multiple stages.
        * Metadata Management: Manages metadata associated with centrally produced pre-generated samples produced in Delphes fast simulation and Geant4 full simulation.

        The framework can be used equivalently well from Python and C++ and as the analyser functions are written in C++, this gives them possibility to directly employ any of the High Energy Physics (HEP) C++ frameworks.

        Speaker: Juraj Smiesko (CERN)
      • 42
        Bridging Python and Julia for Enhanced Data Analysis

        Let’s discuss the exciting world of combining Python and Julia for data analysis for high-energy physics (HEP) and other data-intensive fields.

        We'll kick things off with a quick overview of why Python is so popular for data analysis and introduce Julia, which is making waves with its incredible performance and suitability for scientific computing.

        Next, I'll show you how we can get the best of both worlds. We'll talk about using PythonCall to bring Python functions and libraries into Julia and how we can embed Julia code right into our Python scripts using JuliaCall. It's easier than you might think!

        I'll walk you through some practical examples where mixing Python and Julia really shines. We'll look at real-world scenarios and see how this combination can speed up our data analysis and make our work more efficient.

        Of course, there are always some bumps in the road, so I'll share some common challenges you might face and how to overcome them. We'll cover best practices for managing dependencies and keeping everything running smoothly.

        Finally, we'll look ahead to the future. There's so much potential for deeper integration and community-driven innovation. I hope to inspire you to explore these possibilities and collaborate with other developers to push the boundaries of what's possible.

        By the end of this talk, you'll have a good grasp of how to mix Python and Julia in your projects and leverage the strengths of both languages to supercharge your data analysis.

        Speaker: Ianna Osborne (Princeton University)
      • 43
        A Deep Dive into PocketCoffea

        PocketCoffea is a python columnar analysis framework based on coffea for CMS NanoAOD events. It provides a workflow for HEP analyses using a combination of customizable abstractions and configuration files. The package features dataset query automatisation, jet calibration, data processing, histogramming and plotting. PocketCoffea also provides support for code execution on various remote clusters out-of-the-box.
        In this talk, a detailed overview of PocketCoffea from a user's and from a technical perspective is going to be given.

        Speaker: Mate Farkas (Rheinisch Westfaelische Tech. Hoch. (DE))
    • 11:05
      Coffee break
    • Discussion: RDataframe/Coffea analyses (at scale)
    • 12:30
      Lunch
    • Discussion: Future of PyHEP.dev
    • 44
      Sightseeing Tour

      Starting and ending at Erholungsgesellschaft (workshop venue)

    • 18:00
      Optional Dinner at "60 Seconds to Napoli" - feel free to join!

      Markt 17, 52062 Aachen

    • 08:30
      Breakfast
    • 45
      Organizing the paper-writing
    • Discussion: Paper-writing
    • 11:00
      Coffee break
    • Discussion: Paper-writing
      • 11:30
        Coffee break
    • 46
      Close-out
    • 12:30
      Lunch