Computing coverage probabilities¶

Of Bayesian and Frequentist intervals with NumPy and Numba

by Hans Dembinski, Max Planck Institute for Nuclear Physics, Heidelberg

Overview¶

Coverage probability important concept in high energy physics
Universal "yardstick" to compare different methods to compute uncertainty intervals and limits
- In particular: Frequentist and Bayesian methods
- No objections to Bayesian intervals if intervals have coverage
Published uncertainty intervals should always have 68% ("1 sigma") coverage probability

This talk¶

How to compute coverage probability for various uncertainty intervals for measured efficiencies
Challenge: No perfect interval with exact coverage, only approximate coverage
Same problem for all discrete distributions (Poisson, binomial, ...)

%matplotlib inline

# setup notebook

# !pip install numpy matplotlib numba

# Also a ROOT installation is needed with PyROOT support.
# These results were generated with ROOT-6.14.

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.size"] = 20

import ROOT
TE = ROOT.TEfficiency

Welcome to JupyROOT 6.18/04

What is coverage?¶

Let look at typical plot of histogram with error bars
Will see what coverage probability is and why it is useful

# generate some random data
np.random.seed(1)
x = np.random.rand(6000)
# compute histogram to get counts in bins (counts are Poisson-distributed)
w, xe = np.histogram(x, bins=20, range=(0, 1))

# you all know the sqrt(N) estimate for counts
wsigma = w ** 0.5

# draw data
cx = 0.5 * (xe[1:] + xe[1:])
plt.errorbar(cx, w, wsigma, fmt="o")

# draw expectation: number of counts / number of bins
expectation = len(x) / len(w)
plt.axhline(expectation, color="r")

# compute coverage probability: fraction of points 1 sigma away from expectation
mask = np.abs(w - expectation) < wsigma
p_coverage = np.sum(mask) * 1.0 / len(w)
plt.title("coverage probability = %.2f" % p_coverage, y=1.03)

# highlight points further away
plt.plot(cx[~mask], w[~mask], "o", ms=20, mfc="None")

plt.xlim(-0.1, 1.1);

6 out of 20 bins are further away than 1 sigma
Coverage probability = 14 / 20 = 70 %
On average: 68 % of intervals should cover expected value (the definition of coverage!)
Needed to make different analysis results and experiments comparable

Hypothetical scenario to avoid¶

Two experiments A and B are equally sensitive on same measurement
Experiment A computes uncertainty with a method that has less coverage than 68 %
Experiment B uses a method with proper coverage of 68 %
On average: Experiment A will quote a smaller uncertainty interval than B
Experiment A will (falsely) appear superior to B

Rule of 2/3¶

If everyone uses uncertainties with 68 % coverage, we can quickly judge results.

If model is accurate and uncertainties are correct, about 2/3 of data points with error bars should overlap with model.

Coverage not always good¶

Uncertainty intervals computed by common methods usually only have approximate coverage
- Poisson "error"
- Efficiency "error"
- Intervals produced by MINUIT and iminuit
Usually OK for large number of counts $n > 50$, problems appear when counts small
General method: Compute coverage with toy MC experiments
Let us do this for Poisson and binomial distributed data

Coverage of $n \pm \sqrt{n}$ interval¶

Lots of counting experiments in particle physics
Counts follow Poisson distribution
Standard uncertainty estimate: $n \pm \sqrt{n}$
- But: $n = 0 \rightarrow 0 \pm 0$ never contains true value
How good is coverage of this interval?

# compute coverage n +/- sqrt(n) with random trials
@np.vectorize
def compute_coverage(mu, n_trials):
    n = np.random.poisson(mu, n_trials)
    n_sigma = n ** 0.5  # sqrt(n) estimate
    n_cover = np.sum(np.abs(n - mu) < n_sigma)
    return n_cover * 1.0 / n_trials
    
mu = np.linspace(0.1, 20, 500)
prob = compute_coverage(mu, 10000)

# draw expected coverage probability
plt.axhline(0.68, color="r")

# draw actual coverage probability
plt.plot(mu, prob)
plt.ylim(0, 1)
plt.xlim(0, 20)
plt.xlabel("$\mu$")
plt.ylabel("actual coverage");

Coverage of $n \pm \sqrt{n}$ interval is bad for $\mu \lesssim 1$
Jumps around nominal coverage are systematic, discreteness of Poisson distribution

Coverage of uncertainty interval for efficiencies¶

Efficiency in counting experiment is $\epsilon = k/n$ (called binomial proportion in statistics)
- $n$ is number of trials, fixed by experiment design (not random)
- $k$ randomly varies from experiment to experiment, function of true efficiency
$k$ is binomially distributed with variance $\text{Var}[k] = n \, \epsilon \, (1 - \epsilon)$

$$ P(k) = \binom{n}{k} \epsilon^k (1-\epsilon)^{n-k} $$

Common (bad) uncertainty estimate for $\epsilon$

$$ \sigma[\epsilon] := \frac{\sqrt{\text{Var}[k]}}{n} = \sqrt{\frac{\epsilon \, (1 - \epsilon)}{n}} $$

Has bad coverage for $\epsilon$ close to 0 and 1
- $k = 0 \rightarrow \sigma[\epsilon] = 0$
- $k = n \rightarrow \sigma[\epsilon] = 0$
Many other (differently constructed) intervals have been proposed to fix this
- Many of those are implemented in ROOT's TEfficiency class

The following is based on a study from the TEfficiency reference page. We unify the color scales and focus on results for small $n$.

# lots of intervals are implemented in TEfficiency
import ROOT
TE = ROOT.TEfficiency

# draw Wilson interval
n = 6
k = np.arange(n + 1)
p = k * 1.0 / n
l = [TE.Wilson(n, ki, 0.68, 0) for ki in k]
u = [TE.Wilson(n, ki, 0.68, 1) for ki in k]
plt.errorbar(k, p, np.array(((p - l), (u - p))), fmt="o")
plt.xticks(k)
plt.xlim(-0.1, n + 0.1)
plt.xlabel("$k$")
plt.ylabel("$\epsilon$")
plt.title("Wilson interval", y=1.03);

# accelerate following computation with Numba
from numba import njit

# generate lookup-table of TEfficiency intervals which we can pass to Numba
def make_interval_table(method, n_max):
    table = np.empty((n_max, n_max + 1, 2))
    for n in range(1, n_max):
        for k in range(n + 1):
            table[n, k] = method(n, k, False), method(n, k, True)
    return table

# generate lookup-table of factorials (Numba currently has no factorial)
@njit
def make_factorial_table(n):
    table = np.empty(n)
    table[0] = 1
    for ni in range(1, n):
        table[ni] = table[ni-1] * ni
    return table

# actual computation happens here;
# coverage computed exactly without using random trials
def coverage(n_values, p_values, method):
    @njit
    def compute(n_values, p_values, fact, interval):
        cov = np.empty((len(n_values), len(p_values)), dtype=np.double)
        for i, n in enumerate(n_values):
            for j, p in enumerate(p_values):
                csum = 0.0
                for k in range(n + 1):
                    binom = fact[n] / (fact[k] * fact[n - k])
                    prob_k = binom * p ** k * (1.0 - p) ** (n - k)
                    l, u = interval[n, k]
                    if l <= p <= u:
                        csum += prob_k
                cov[i, j] = csum
        return cov

    facts = make_factorial_table(n_values[-1] + 1)
    intervals = make_interval_table(method, n_values[-1] + 1)
    return compute(n_values, p_values, facts, intervals)

# scan parameters
n = np.arange(1, 100)
p = np.linspace(0, 1, 201)
pcov = 0.68

cov_agresti_coull = coverage(n, p, lambda n, k, up:
                             TE.AgrestiCoull(n, k, pcov, up))

cov_bayesian_uniform = coverage(n, p, lambda n, k, up:
                                TE.Bayesian(n, k, pcov, 1, 1, up))

cov_bayesian_jeffreys = coverage(n, p, lambda n, k, up:
                                 TE.Bayesian(n, k, pcov, 0.5, 0.5, up))

cov_wilson = coverage(n, p, lambda n, k, up:
                      TE.Wilson(n, k, pcov, up))

# normal or Wald interval, the common (bad) estimate from above
cov_normal = coverage(n, p, lambda n, k, up:
                      TE.Normal(n, k, pcov, up))

# the "exact" frequentist interval
cov_clopper_pearson = coverage(n, p, lambda n, k, up:
                               TE.ClopperPearson(n, k, pcov, up))

imshow_opts = {'origin': 'lower', 'extent': (0, 1, n[0], n[-1]), 'aspect': 0.01}

plt.imshow(cov_clopper_pearson, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

plt.imshow(cov_normal, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

plt.imshow(cov_agresti_coull, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

plt.imshow(cov_bayesian_uniform, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

plt.imshow(cov_bayesian_jeffreys, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

plt.imshow(cov_wilson, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

# compute mean coverage
plt.figure(figsize=(10, 7))
plt.plot(n, np.mean(cov_clopper_pearson, axis=1), ls=":", lw=2,
         label="Clopper Pearson")
plt.plot(n, np.mean(cov_normal, axis=1), ls=":", lw=2,
         label="Normal")
plt.plot(n, np.mean(cov_wilson, axis=1), ls="-", lw=2,
         label="Wilson")
plt.plot(n, np.mean(cov_bayesian_uniform, axis=1), ls="--", lw=2,
         label="Bayesian (uniform prior)")
plt.plot(n, np.mean(cov_bayesian_jeffreys, axis=1), ls="--", lw=2,
         label="Bayesian (Jeffreys prior)")
plt.plot(n, np.mean(cov_agresti_coull, axis=1), ls=":", lw=2,
         label="Agresti Coull")
plt.xlim(1, 30)
plt.ylim(0.0, 1.0)
plt.axhline(pcov, ls="-", color="k", zorder=0)
plt.legend()
plt.xlabel("$N$")
plt.ylabel("measured coverage");

Wilson interval performs best
- Good coverage (Bayesian with uniform prior has better average coverage)
- No intervals that never contain true efficiency
- Recommended in LHCb Statistics Guidelines
"Exact" frequentist interval (Clopper-Pearson) has worse coverage than Bayesian intervals
Normal interval seriously undercovers even for large $n$