Computing coverage probabilities

Of Bayesian and Frequentist intervals with NumPy and Numba

by Hans Dembinski, Max Planck Institute for Nuclear Physics, Heidelberg

Overview

  • Coverage probability important concept in high energy physics
  • Universal "yardstick" to compare different methods to compute uncertainty intervals and limits
    • In particular: Frequentist and Bayesian methods
    • No objections to Bayesian intervals if intervals have coverage
  • Published uncertainty intervals should always have 68% ("1 sigma") coverage probability

This talk

  • How to compute coverage probability for various uncertainty intervals for measured efficiencies
  • Challenge: No perfect interval with exact coverage, only approximate coverage
  • Same problem for all discrete distributions (Poisson, binomial, ...)
In [1]:
%matplotlib inline
In [2]:
# setup notebook

# !pip install numpy matplotlib numba

# Also a ROOT installation is needed with PyROOT support.
# These results were generated with ROOT-6.14.

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.size"] = 20

import ROOT
TE = ROOT.TEfficiency
Welcome to JupyROOT 6.18/04

What is coverage?

  • Let look at typical plot of histogram with error bars
  • Will see what coverage probability is and why it is useful
In [3]:
# generate some random data
np.random.seed(1)
x = np.random.rand(6000)
# compute histogram to get counts in bins (counts are Poisson-distributed)
w, xe = np.histogram(x, bins=20, range=(0, 1))

# you all know the sqrt(N) estimate for counts
wsigma = w ** 0.5
In [4]:
# draw data
cx = 0.5 * (xe[1:] + xe[1:])
plt.errorbar(cx, w, wsigma, fmt="o")

# draw expectation: number of counts / number of bins
expectation = len(x) / len(w)
plt.axhline(expectation, color="r")

# compute coverage probability: fraction of points 1 sigma away from expectation
mask = np.abs(w - expectation) < wsigma
p_coverage = np.sum(mask) * 1.0 / len(w)
plt.title("coverage probability = %.2f" % p_coverage, y=1.03)

# highlight points further away
plt.plot(cx[~mask], w[~mask], "o", ms=20, mfc="None")

plt.xlim(-0.1, 1.1);
  • 6 out of 20 bins are further away than 1 sigma
  • Coverage probability = 14 / 20 = 70 %
  • On average: 68 % of intervals should cover expected value (the definition of coverage!)
  • Needed to make different analysis results and experiments comparable

Hypothetical scenario to avoid

  • Two experiments A and B are equally sensitive on same measurement
  • Experiment A computes uncertainty with a method that has less coverage than 68 %
  • Experiment B uses a method with proper coverage of 68 %
  • On average: Experiment A will quote a smaller uncertainty interval than B
  • Experiment A will (falsely) appear superior to B

Rule of 2/3

If everyone uses uncertainties with 68 % coverage, we can quickly judge results.

If model is accurate and uncertainties are correct, about 2/3 of data points with error bars should overlap with model.

Coverage not always good

  • Uncertainty intervals computed by common methods usually only have approximate coverage
    • Poisson "error"
    • Efficiency "error"
    • Intervals produced by MINUIT and iminuit
  • Usually OK for large number of counts $n > 50$, problems appear when counts small
  • General method: Compute coverage with toy MC experiments
  • Let us do this for Poisson and binomial distributed data

Coverage of $n \pm \sqrt{n}$ interval

  • Lots of counting experiments in particle physics
  • Counts follow Poisson distribution
  • Standard uncertainty estimate: $n \pm \sqrt{n}$
    • But: $n = 0 \rightarrow 0 \pm 0$ never contains true value
  • How good is coverage of this interval?
In [5]:
# compute coverage n +/- sqrt(n) with random trials
@np.vectorize
def compute_coverage(mu, n_trials):
    n = np.random.poisson(mu, n_trials)
    n_sigma = n ** 0.5  # sqrt(n) estimate
    n_cover = np.sum(np.abs(n - mu) < n_sigma)
    return n_cover * 1.0 / n_trials
    
mu = np.linspace(0.1, 20, 500)
prob = compute_coverage(mu, 10000)
In [6]:
# draw expected coverage probability
plt.axhline(0.68, color="r")

# draw actual coverage probability
plt.plot(mu, prob)
plt.ylim(0, 1)
plt.xlim(0, 20)
plt.xlabel("$\mu$")
plt.ylabel("actual coverage");
  • Coverage of $n \pm \sqrt{n}$ interval is bad for $\mu \lesssim 1$
  • Jumps around nominal coverage are systematic, discreteness of Poisson distribution

Coverage of uncertainty interval for efficiencies

  • Efficiency in counting experiment is $\epsilon = k/n$ (called binomial proportion in statistics)
    • $n$ is number of trials, fixed by experiment design (not random)
    • $k$ randomly varies from experiment to experiment, function of true efficiency
  • $k$ is binomially distributed with variance $\text{Var}[k] = n \, \epsilon \, (1 - \epsilon)$
$$ P(k) = \binom{n}{k} \epsilon^k (1-\epsilon)^{n-k} $$
  • Common (bad) uncertainty estimate for $\epsilon$
$$ \sigma[\epsilon] := \frac{\sqrt{\text{Var}[k]}}{n} = \sqrt{\frac{\epsilon \, (1 - \epsilon)}{n}} $$
  • Has bad coverage for $\epsilon$ close to 0 and 1
    • $k = 0 \rightarrow \sigma[\epsilon] = 0$
    • $k = n \rightarrow \sigma[\epsilon] = 0$
  • Many other (differently constructed) intervals have been proposed to fix this
    • Many of those are implemented in ROOT's TEfficiency class

The following is based on a study from the TEfficiency reference page. We unify the color scales and focus on results for small $n$.

In [7]:
# lots of intervals are implemented in TEfficiency
import ROOT
TE = ROOT.TEfficiency
In [8]:
# draw Wilson interval
n = 6
k = np.arange(n + 1)
p = k * 1.0 / n
l = [TE.Wilson(n, ki, 0.68, 0) for ki in k]
u = [TE.Wilson(n, ki, 0.68, 1) for ki in k]
plt.errorbar(k, p, np.array(((p - l), (u - p))), fmt="o")
plt.xticks(k)
plt.xlim(-0.1, n + 0.1)
plt.xlabel("$k$")
plt.ylabel("$\epsilon$")
plt.title("Wilson interval", y=1.03);
In [9]:
# accelerate following computation with Numba
from numba import njit
In [10]:
# generate lookup-table of TEfficiency intervals which we can pass to Numba
def make_interval_table(method, n_max):
    table = np.empty((n_max, n_max + 1, 2))
    for n in range(1, n_max):
        for k in range(n + 1):
            table[n, k] = method(n, k, False), method(n, k, True)
    return table
In [11]:
# generate lookup-table of factorials (Numba currently has no factorial)
@njit
def make_factorial_table(n):
    table = np.empty(n)
    table[0] = 1
    for ni in range(1, n):
        table[ni] = table[ni-1] * ni
    return table
In [12]:
# actual computation happens here;
# coverage computed exactly without using random trials
def coverage(n_values, p_values, method):
    @njit
    def compute(n_values, p_values, fact, interval):
        cov = np.empty((len(n_values), len(p_values)), dtype=np.double)
        for i, n in enumerate(n_values):
            for j, p in enumerate(p_values):
                csum = 0.0
                for k in range(n + 1):
                    binom = fact[n] / (fact[k] * fact[n - k])
                    prob_k = binom * p ** k * (1.0 - p) ** (n - k)
                    l, u = interval[n, k]
                    if l <= p <= u:
                        csum += prob_k
                cov[i, j] = csum
        return cov

    facts = make_factorial_table(n_values[-1] + 1)
    intervals = make_interval_table(method, n_values[-1] + 1)
    return compute(n_values, p_values, facts, intervals)
In [13]:
# scan parameters
n = np.arange(1, 100)
p = np.linspace(0, 1, 201)
pcov = 0.68
In [14]:
cov_agresti_coull = coverage(n, p, lambda n, k, up:
                             TE.AgrestiCoull(n, k, pcov, up))
In [15]:
cov_bayesian_uniform = coverage(n, p, lambda n, k, up:
                                TE.Bayesian(n, k, pcov, 1, 1, up))
In [16]:
cov_bayesian_jeffreys = coverage(n, p, lambda n, k, up:
                                 TE.Bayesian(n, k, pcov, 0.5, 0.5, up))
In [17]:
cov_wilson = coverage(n, p, lambda n, k, up:
                      TE.Wilson(n, k, pcov, up))
In [18]:
# normal or Wald interval, the common (bad) estimate from above
cov_normal = coverage(n, p, lambda n, k, up:
                      TE.Normal(n, k, pcov, up))
In [19]:
# the "exact" frequentist interval
cov_clopper_pearson = coverage(n, p, lambda n, k, up:
                               TE.ClopperPearson(n, k, pcov, up))
In [20]:
imshow_opts = {'origin': 'lower', 'extent': (0, 1, n[0], n[-1]), 'aspect': 0.01}
In [21]:
plt.imshow(cov_clopper_pearson, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();
In [22]:
plt.imshow(cov_normal, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();
In [23]:
plt.imshow(cov_agresti_coull, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();
In [24]:
plt.imshow(cov_bayesian_uniform, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();
In [25]:
plt.imshow(cov_bayesian_jeffreys, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();
In [26]:
plt.imshow(cov_wilson, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();
In [27]:
# compute mean coverage
plt.figure(figsize=(10, 7))
plt.plot(n, np.mean(cov_clopper_pearson, axis=1), ls=":", lw=2,
         label="Clopper Pearson")
plt.plot(n, np.mean(cov_normal, axis=1), ls=":", lw=2,
         label="Normal")
plt.plot(n, np.mean(cov_wilson, axis=1), ls="-", lw=2,
         label="Wilson")
plt.plot(n, np.mean(cov_bayesian_uniform, axis=1), ls="--", lw=2,
         label="Bayesian (uniform prior)")
plt.plot(n, np.mean(cov_bayesian_jeffreys, axis=1), ls="--", lw=2,
         label="Bayesian (Jeffreys prior)")
plt.plot(n, np.mean(cov_agresti_coull, axis=1), ls=":", lw=2,
         label="Agresti Coull")
plt.xlim(1, 30)
plt.ylim(0.0, 1.0)
plt.axhline(pcov, ls="-", color="k", zorder=0)
plt.legend()
plt.xlabel("$N$")
plt.ylabel("measured coverage");
  • Wilson interval performs best
    • Good coverage (Bayesian with uniform prior has better average coverage)
    • No intervals that never contain true efficiency
    • Recommended in LHCb Statistics Guidelines
  • "Exact" frequentist interval (Clopper-Pearson) has worse coverage than Bayesian intervals
  • Normal interval seriously undercovers even for large $n$
In [ ]: