Status of HS3

Europe/Zurich
Felicia Volle (University of Birmingham (GB)), Jack Y. Araz (City St George's, University of London), Jonas Wurzinger (Technische Universitat Munchen (DE)), Sabine Kraml (LPSC Grenoble), Sezen Sekmen (Kyungpook National University (KR))
Zoom Meeting ID
62939975990
Host
Martin Habedank
Passcode
42295867
Useful links
Join via phone
Zoom URL
    • 14:00 14:05
      Introduction 5m
      Speakers: Felicia Volle (University of Birmingham (GB)), Jack Y. Araz (City St George's, University of London), Jonas Wurzinger (Technische Universitat Munchen (DE)), Sabine Kraml (LPSC Grenoble), Sezen Sekmen (Kyungpook National University (KR))
    • 14:05 14:30
      Data uncertainties in HS3 25m
      Speakers: Andy Buckley (University of Glasgow (GB)), Dr Carsten Burgard (Hamburg University (DE)), Simon Cello (Technische Universitaet Dortmund (DE))
      ## Summary of the HS3 Data Uncertainties Discussion (AI Generated)

      ### Context and Motivating Problem

      Andy Buckley opened by framing the problem broadly: in LHC reinterpretation, the "data" fed into a new-physics likelihood is often not raw event counts but the output of a prior inference step, for example unfolded differential cross-sections, or correlated measurements of EW parameters (m_top, m_W, α_EW). Such datasets carry non-trivial, correlated uncertainties inherited from the original analysis. Any HS3 standard that handles only Poisson counting likelihoods is therefore insufficient; one needs to be able to compose these "pre-inferred" likelihoods with new physics models in a well-defined way.

      Carsten then showed the current HS3 draft, which has three data types (point data, unbinned data, binned data), each with a rudimentary `uncertainty` field. He acknowledged this was brainstormed in an afternoon and had no implementation yet. The uncertainty field lacked two things Sabine immediately identified: the **shape** of the distribution and the **level** (1σ, 2σ, etc.) — i.e., it was not yet a proper probabilistic object.

      ---

      ### The Two Central Debates

      **1. Should data uncertainties duplicate the distribution machinery, or reference it?**

      Andy proposed encoding data uncertainties via HistFactory-style ±1σ templates plus an interpolation recipe (in the same spirit as `histosys` modifiers). Carsten pointed out that this would reproduce structure already present in the HS3 `distributions` section. His preferred resolution: instead of defining a new uncertainty vocabulary on the data side, the data record should simply **link by name to a distribution already declared in the HS3 file**. Any distribution type the standard supports is then automatically available for data uncertainties, with no duplication. Andy agreed this was the right architecture, confirming that distributions are already a top-level section in HS3.

      **2. What are the mathematical semantics of a data uncertainty?**

      This occupied most of the discussion (Oliver Schulz, Waltenberger, Andy, Sabine). The key tension was:

      - In a frequentist reading, an uncertainty on data arises from imagined repetitions of auxiliary measurements, and enters the likelihood as a **constraint term** (product likelihood).
      - In a Bayesian reading, the uncertainty represents a **prior** on a parameter that happens to be centred at the observed value.

      Waltenberger argued these are two distinct interpretations even if the mathematics is identical. Oliver pressed for unambiguous semantics in the standard, independent of tooling. The resolution was reached by Oliver: the two interpretations yield the **same log-density term** added to the total likelihood, provided one structures things correctly. Therefore the standard need not adjudicate between them.

      ---

      ### The Agreed Formalisation

      Oliver Schulz proposed, and all participants agreed, the following structure:

      > Attaching an uncertainty distribution to a data object **induces new variational parameters** — named δ\_`<dataname>` by convention — representing additive shifts on the data values. The attached distribution constrains these δ parameters. Whether one treats the distribution as a Bayesian prior on δ or as a frequentist constraint (implicit auxiliary measurement), the identical log-density term is added to the log-likelihood. This makes the semantics **interpretation-agnostic**.

      The agreed normative elements are:

      1. **Additivity**: data uncertainties are additive shifts δ on the nominal data values. The effective observed data entering the likelihood is d + δ.
      2. **Induced parameters**: each δ is a named parameter, named `delta_<dataname>` (or a scheme derived therefrom), so that different tools can produce comparable outputs.
      3. **Reference to existing distributions**: the constraint on δ is not a new keyword but a reference to a named distribution object already declared in the HS3 file.
      4. **Downstream freedom**: whether the inference engine profiles, marginalises, or treats δ as Bayesian or frequentist is left to the tool and user — the standard only specifies the functional form of the constraint.

      ### The Likelihood Structure

      Writing this out explicitly, the full log-likelihood for a dataset d with data uncertainty distribution p_data and a physics model with nuisance parameters θ is:

      **log L(μ, θ, δ) = log L_core(d + δ | ŷ(μ, θ)) + log p_data(δ) + log p_model(θ)**

      where:

      - μ is the parameter of interest (POI),
      - ŷ(μ, θ) is the model prediction (a function of both the physics parameters and the model-side nuisances θ),
      - δ are the data-side variational parameters induced by the data uncertainty attachment,
      - p_data(δ) is the constraint distribution defined in the HS3 file and referenced by name from the data record,
      - p_model(θ) are the existing model constraint terms (e.g. Gaussian or log-normal nuisance priors/auxiliary measurements already handled in HS3).

      For the common HEP case (binned data with a published covariance matrix, e.g., from HEPData), p_data(δ) is a multivariate Gaussian:

      **p_data(δ) = N(δ | 0, Σ)**

      where Σ is the covariance matrix accompanying the dataset. This distribution is declared once in the HS3 `distributions` section and linked by name from the data block — it is not repeated inside every physics model being tested.

      ---

      ### Sabine's Reference: arXiv:2109.04981

      This is the white paper "Publishing statistical models: Getting the most out of particle physics experiments" by Cranmer, Kraml, Prosper (editors) et al. (2021), with Sabine as a co-editor. She pointed specifically to pages 4–6, which discuss the frequentist vs. Bayesian ambiguity in exactly this context. Andy acknowledged its relevance but noted that the paper is written from a HistFactory-centric perspective and does not fully generalise to arbitrary distributions — so it is useful background but the HS3 treatment needs to go further.
    • 14:30 14:55
      HS3 in Spey 25m
      Speaker: Jack Y. Araz (UCL & City St George's, University of London)
    • 14:55 15:20
      HEPdata handling of HS3: conversion and visualization 25m
      Speaker: Steffen Albrecht (Hamburg University (DE))
    • 15:20 15:55
      General discussion 35m