# Computing coverage probabilities¶

Of Bayesian and Frequentist intervals with NumPy and Numba

by Hans Dembinski, Max Planck Institute for Nuclear Physics, Heidelberg

## Overview¶

• Coverage probability important concept in high energy physics
• Universal "yardstick" to compare different methods to compute uncertainty intervals and limits
• In particular: Frequentist and Bayesian methods
• No objections to Bayesian intervals if intervals have coverage
• Published uncertainty intervals should always have 68% ("1 sigma") coverage probability

## This talk¶

• How to compute coverage probability for various uncertainty intervals for measured efficiencies
• Challenge: No perfect interval with exact coverage, only approximate coverage
• Same problem for all discrete distributions (Poisson, binomial, ...)
In [1]:
%matplotlib inline

In [2]:
# setup notebook

# !pip install numpy matplotlib numba

# Also a ROOT installation is needed with PyROOT support.
# These results were generated with ROOT-6.14.

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.size"] = 20

import ROOT
TE = ROOT.TEfficiency

Welcome to JupyROOT 6.18/04


## What is coverage?¶

• Let look at typical plot of histogram with error bars
• Will see what coverage probability is and why it is useful
In [3]:
# generate some random data
np.random.seed(1)
x = np.random.rand(6000)
# compute histogram to get counts in bins (counts are Poisson-distributed)
w, xe = np.histogram(x, bins=20, range=(0, 1))

# you all know the sqrt(N) estimate for counts
wsigma = w ** 0.5

In [4]:
# draw data
cx = 0.5 * (xe[1:] + xe[1:])
plt.errorbar(cx, w, wsigma, fmt="o")

# draw expectation: number of counts / number of bins
expectation = len(x) / len(w)
plt.axhline(expectation, color="r")

# compute coverage probability: fraction of points 1 sigma away from expectation
mask = np.abs(w - expectation) < wsigma
p_coverage = np.sum(mask) * 1.0 / len(w)
plt.title("coverage probability = %.2f" % p_coverage, y=1.03)

# highlight points further away

plt.xlim(-0.1, 1.1);

• 6 out of 20 bins are further away than 1 sigma
• Coverage probability = 14 / 20 = 70 %
• On average: 68 % of intervals should cover expected value (the definition of coverage!)
• Needed to make different analysis results and experiments comparable

## Hypothetical scenario to avoid¶

• Two experiments A and B are equally sensitive on same measurement
• Experiment A computes uncertainty with a method that has less coverage than 68 %
• Experiment B uses a method with proper coverage of 68 %
• On average: Experiment A will quote a smaller uncertainty interval than B
• Experiment A will (falsely) appear superior to B

## Rule of 2/3¶

If everyone uses uncertainties with 68 % coverage, we can quickly judge results.

If model is accurate and uncertainties are correct, about 2/3 of data points with error bars should overlap with model.

## Coverage not always good¶

• Uncertainty intervals computed by common methods usually only have approximate coverage
• Poisson "error"
• Efficiency "error"
• Intervals produced by MINUIT and iminuit
• Usually OK for large number of counts $n > 50$, problems appear when counts small
• General method: Compute coverage with toy MC experiments
• Let us do this for Poisson and binomial distributed data

## Coverage of $n \pm \sqrt{n}$ interval¶

• Lots of counting experiments in particle physics
• Standard uncertainty estimate: $n \pm \sqrt{n}$
• But: $n = 0 \rightarrow 0 \pm 0$ never contains true value
• How good is coverage of this interval?
In [5]:
# compute coverage n +/- sqrt(n) with random trials
@np.vectorize
def compute_coverage(mu, n_trials):
n = np.random.poisson(mu, n_trials)
n_sigma = n ** 0.5  # sqrt(n) estimate
n_cover = np.sum(np.abs(n - mu) < n_sigma)
return n_cover * 1.0 / n_trials

mu = np.linspace(0.1, 20, 500)
prob = compute_coverage(mu, 10000)

In [6]:
# draw expected coverage probability
plt.axhline(0.68, color="r")

# draw actual coverage probability
plt.plot(mu, prob)
plt.ylim(0, 1)
plt.xlim(0, 20)
plt.xlabel("$\mu$")
plt.ylabel("actual coverage");

• Coverage of $n \pm \sqrt{n}$ interval is bad for $\mu \lesssim 1$
• Jumps around nominal coverage are systematic, discreteness of Poisson distribution

## Coverage of uncertainty interval for efficiencies¶

• Efficiency in counting experiment is $\epsilon = k/n$ (called binomial proportion in statistics)
• $n$ is number of trials, fixed by experiment design (not random)
• $k$ randomly varies from experiment to experiment, function of true efficiency
• $k$ is binomially distributed with variance $\text{Var}[k] = n \, \epsilon \, (1 - \epsilon)$
$$P(k) = \binom{n}{k} \epsilon^k (1-\epsilon)^{n-k}$$
• Common (bad) uncertainty estimate for $\epsilon$
$$\sigma[\epsilon] := \frac{\sqrt{\text{Var}[k]}}{n} = \sqrt{\frac{\epsilon \, (1 - \epsilon)}{n}}$$
• Has bad coverage for $\epsilon$ close to 0 and 1
• $k = 0 \rightarrow \sigma[\epsilon] = 0$
• $k = n \rightarrow \sigma[\epsilon] = 0$
• Many other (differently constructed) intervals have been proposed to fix this
• Many of those are implemented in ROOT's TEfficiency class

The following is based on a study from the TEfficiency reference page. We unify the color scales and focus on results for small $n$.

In [7]:
# lots of intervals are implemented in TEfficiency
import ROOT
TE = ROOT.TEfficiency

In [8]:
# draw Wilson interval
n = 6
k = np.arange(n + 1)
p = k * 1.0 / n
l = [TE.Wilson(n, ki, 0.68, 0) for ki in k]
u = [TE.Wilson(n, ki, 0.68, 1) for ki in k]
plt.errorbar(k, p, np.array(((p - l), (u - p))), fmt="o")
plt.xticks(k)
plt.xlim(-0.1, n + 0.1)
plt.xlabel("$k$")
plt.ylabel("$\epsilon$")
plt.title("Wilson interval", y=1.03);

In [9]:
# accelerate following computation with Numba
from numba import njit

In [10]:
# generate lookup-table of TEfficiency intervals which we can pass to Numba
def make_interval_table(method, n_max):
table = np.empty((n_max, n_max + 1, 2))
for n in range(1, n_max):
for k in range(n + 1):
table[n, k] = method(n, k, False), method(n, k, True)
return table

In [11]:
# generate lookup-table of factorials (Numba currently has no factorial)
@njit
def make_factorial_table(n):
table = np.empty(n)
table[0] = 1
for ni in range(1, n):
table[ni] = table[ni-1] * ni
return table

In [12]:
# actual computation happens here;
# coverage computed exactly without using random trials
def coverage(n_values, p_values, method):
@njit
def compute(n_values, p_values, fact, interval):
cov = np.empty((len(n_values), len(p_values)), dtype=np.double)
for i, n in enumerate(n_values):
for j, p in enumerate(p_values):
csum = 0.0
for k in range(n + 1):
binom = fact[n] / (fact[k] * fact[n - k])
prob_k = binom * p ** k * (1.0 - p) ** (n - k)
l, u = interval[n, k]
if l <= p <= u:
csum += prob_k
cov[i, j] = csum
return cov

facts = make_factorial_table(n_values[-1] + 1)
intervals = make_interval_table(method, n_values[-1] + 1)
return compute(n_values, p_values, facts, intervals)

In [13]:
# scan parameters
n = np.arange(1, 100)
p = np.linspace(0, 1, 201)
pcov = 0.68

In [14]:
cov_agresti_coull = coverage(n, p, lambda n, k, up:
TE.AgrestiCoull(n, k, pcov, up))

In [15]:
cov_bayesian_uniform = coverage(n, p, lambda n, k, up:
TE.Bayesian(n, k, pcov, 1, 1, up))

In [16]:
cov_bayesian_jeffreys = coverage(n, p, lambda n, k, up:
TE.Bayesian(n, k, pcov, 0.5, 0.5, up))

In [17]:
cov_wilson = coverage(n, p, lambda n, k, up:
TE.Wilson(n, k, pcov, up))

In [18]:
# normal or Wald interval, the common (bad) estimate from above
cov_normal = coverage(n, p, lambda n, k, up:
TE.Normal(n, k, pcov, up))

In [19]:
# the "exact" frequentist interval
cov_clopper_pearson = coverage(n, p, lambda n, k, up:
TE.ClopperPearson(n, k, pcov, up))

In [20]:
imshow_opts = {'origin': 'lower', 'extent': (0, 1, n[0], n[-1]), 'aspect': 0.01}

In [21]:
plt.imshow(cov_clopper_pearson, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

In [22]:
plt.imshow(cov_normal, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

In [23]:
plt.imshow(cov_agresti_coull, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

In [24]:
plt.imshow(cov_bayesian_uniform, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

In [25]:
plt.imshow(cov_bayesian_jeffreys, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

In [26]:
plt.imshow(cov_wilson, **imshow_opts)
plt.clim(0, 1)
plt.colorbar();

In [27]:
# compute mean coverage
plt.figure(figsize=(10, 7))
plt.plot(n, np.mean(cov_clopper_pearson, axis=1), ls=":", lw=2,
label="Clopper Pearson")
plt.plot(n, np.mean(cov_normal, axis=1), ls=":", lw=2,
label="Normal")
plt.plot(n, np.mean(cov_wilson, axis=1), ls="-", lw=2,
label="Wilson")
plt.plot(n, np.mean(cov_bayesian_uniform, axis=1), ls="--", lw=2,
label="Bayesian (uniform prior)")
plt.plot(n, np.mean(cov_bayesian_jeffreys, axis=1), ls="--", lw=2,
label="Bayesian (Jeffreys prior)")
plt.plot(n, np.mean(cov_agresti_coull, axis=1), ls=":", lw=2,
label="Agresti Coull")
plt.xlim(1, 30)
plt.ylim(0.0, 1.0)
plt.axhline(pcov, ls="-", color="k", zorder=0)
plt.legend()
plt.xlabel("$N$")
plt.ylabel("measured coverage");

• Wilson interval performs best
• Good coverage (Bayesian with uniform prior has better average coverage)
• No intervals that never contain true efficiency
• Recommended in LHCb Statistics Guidelines
• "Exact" frequentist interval (Clopper-Pearson) has worse coverage than Bayesian intervals
• Normal interval seriously undercovers even for large $n$
In [ ]: