"Title"

Mandatory imports

In [1]:
%pylab inline
import pandas as pd
import plotly
import seaborn as sns
sns.set_context("talk")
from matplotlib import cm
tab20 = cm.get_cmap('tab20')
Populating the interactive namespace from numpy and matplotlib

First a few notations

Dimensions Matrixes
$N$: number of events (~100k)
$m$: number of input attributes (~10) $X$: input matrix (N x m)
$p$: number of target features (~1) $Y$: matrix of targets (N x p)
$q$: number of OPU input attributes (~1M) $V$: matrix of OPU inputs (N x q)
$r$: number of OPU random features (~10k) $U$: matrix of random features (N x r)

Ridge regression: $ \min_{\beta \in R} || X \beta - y || + || \gamma \beta || $

Workflow

"Workflow"

Tracking 2D simplified model

Track reconstruction at LHC as a collaborative data challenge use case with RAMP

Original paper : http://inspirehep.net/record/1616034

Original library : https://github.com/yetkinyilmaz/tracking

Python3 port : https://github.com/LAL/tracking2Dsim

"2D Tracker"

Then ./runMany.sh

#!/bin/sh
for i in `seq 1439 2000`; do
    python produce.py --seed $i --N 100 --M 10
done

Particles ground truth

In [2]:
part_cols=['event_id','cluster_id','pt', 'phi', 'xVtx', 'yVtx']
particles=pd.read_csv("data/particles_100_1439.csv", names=part_cols)
part_ev0=particles.query("event_id==143900")
part_ev0
Out[2]:
event_id cluster_id pt phi xVtx yVtx
0 143900 0 3875.998795 -3.113895 0.677386 -0.103509
1 143900 1 7350.878984 -1.019270 -0.748584 0.954167

The hits files

In [3]:
hits_cols=['event_id','cluster_id','layer','iphi', 'x', 'y']
hits=pd.read_csv("data/hits_100_1439.csv", names=hits_cols)
hits_ev0=hits.query("event_id==143900")
hits_ev0
Out[3]:
event_id cluster_id layer iphi x y
0 143900 1 1 17898 44.550584 -72.389540
1 143900 1 2 32651 81.526956 -131.826990
2 143900 0 5 25801 -404.613287 -17.694297
3 143900 0 8 64177 -997.739581 -67.199164
4 143900 1 8 105688 541.313646 -840.820752
5 143900 0 1 10792 -84.955114 -2.761992
6 143900 0 0 4951 -38.979971 -1.249757
7 143900 1 0 8207 20.331063 -33.281344
8 143900 1 4 57118 143.210975 -230.068722
9 143900 0 7 48759 -760.727131 -44.025352
10 143900 1 3 44881 112.294733 -180.994179
11 143900 1 5 42704 215.023742 -343.205172
12 143900 0 4 34474 -270.797581 -10.472350
13 143900 1 7 80461 409.395360 -642.681445
14 143900 0 3 27078 -212.857658 -7.785727

The detector

See ./starting_kit/HEP_tracking_starting_kit.ipynb

The data provided to the user is a list of hit positions from a simple toy detector model that mimics the Atlas detector design (which is generic enough for recent silicon-based tracking detectors). The detector has an onion-like geometry with 9 layers surrounding the origin with polar distances of R = [39,85,155,213,271,405,562,762,1000] cm. These layers have a very small thickness compared to the distances, therefore the thickness can be neglected.

Each layer is segmented in azimuth with high granularity. There are ($2\pi$R/pitch)+1 pixels in every layer, where pitch is 0.025 cm for layers 0-4 and 0.05 cm for layers 5-9.

Every "pixel" corresponds to the smallest detector unit defined by layer and iphi (azimuthal index).

In [4]:
R=np.array([39,85,155,213,271,405,562,762,1000])
K=2*pi/0.025
gran=np.hstack([ K * np.ones(4) , K/2 * np.ones(5) ])
nseg=np.ceil(gran*R).astype(int)
nseg
Out[4]:
array([  9802,  21363,  38956,  53533,  34055,  50894,  70624,  95756,
       125664])
In [5]:
nseg_cs=np.cumsum(nseg)
nseg_cs
Out[5]:
array([  9802,  31165,  70121, 123654, 157709, 208603, 279227, 374983,
       500647])
In [6]:
nseg_loffset=np.roll(nseg_cs, 1)
nseg_loffset[0]=0
nseg_loffset
Out[6]:
array([     0,   9802,  31165,  70121, 123654, 157709, 208603, 279227,
       374983])
In [7]:
dmdresolution=(912, 1140)
dmdtotpix=np.product(dmdresolution)
print("nbr DMD pixels :        ", dmdtotpix)
print("2x nbr detector pixels :", 2*nseg_cs[-1])
print("Difference :            ", dmdtotpix- 2*nseg_cs[-1] )
nbr DMD pixels :         1039680
2x nbr detector pixels : 1001294
Difference :             38386
In [8]:
dmdshape=(910, 1101)
dmdnpix=np.product(dmdshape)

fig, (ax1, ax2) = subplots(1, 2, figsize=(16, 8))

for ir, r in enumerate(R):
    circle = Circle((0, 0), r, color=tab20(2*ir), fill=False, linewidth=10)
    ax1.add_artist(circle)
axlim=1.1*R[-1]
ax1.set_xlim(-axlim, axlim)
ax1.set_ylim(-axlim, axlim)
ax1.axis("equal")

background=np.zeros(dmdnpix)
for i, (start, end) in enumerate(zip(2*nseg_loffset, 2*nseg_cs)):
    background[start:end]=1+2*i
background=background.reshape(dmdshape)
ax2.matshow(background, cmap="tab20", vmin=0, vmax=19.5)
Out[8]:
<matplotlib.image.AxesImage at 0x7fe2c3c37110>

Naive DMD representation

In [9]:
for cluster, gp in hits_ev0.sort_values(['event_id','cluster_id', 'layer']).groupby("cluster_id"):
    indices=2*(gp["iphi"]+ nseg_loffset[gp["layer"]]).values
bliparr=np.zeros(dmdnpix)
bliparr[indices]=True

fig, axes = plt.subplots(1, 2, sharey=True, figsize=(18, 8))
axes[0].matshow(background, cmap="tab20", vmin=0, vmax=19.5)
axes[1].matshow(bliparr.reshape(dmdshape), cmap="Blues")
fig.tight_layout()
indices
Out[9]:
array([ 16414,  55400, 127632, 230004, 361544, 400826, 719376, 961342])

Response curve of the OPU

OPU response curve

(Iacopo Poli, personal communication)

Switching at each indice

In [10]:
if (len(indices)%2==1): indices=np.append(indices, -1) 
arr=ones(dmdnpix)
for row in indices.reshape(-1, 2):
    arr[row[0]:row[1]]=False

fig, axes = plt.subplots(1, 2, sharey=True, figsize=(18, 8))
axes=axes.flatten()
axes[0].matshow(background-arr.reshape(dmdshape), cmap="tab20", vmin=0, vmax=19.5)
axes[1].matshow(arr.reshape(dmdshape), cmap="Blues")
fig.tight_layout()

And we iterate over the hits files

In [11]:
from glob import glob
from tqdm import tqdm
from os.path import exists

pbar=tqdm(glob("data/hits*.csv"))

for file in tqdm(pbar):
    break
    events=file[14:18]
    if exists("opu_switch2/swi"+events+".npz"):
        continue
    pbar.set_description("Evt " + events)
    hits=pd.read_csv(file, names=['event_id','cluster_id','layer','iphi', 'x', 'y'])
    clusters={}
    for event, egp in hits.sort_values(['event_id','cluster_id', 'layer']).groupby("event_id"):
        for cluster, gp in egp.sort_values(['event_id','cluster_id', 'layer']).groupby("cluster_id"):
            arr=np.ones((910*1101))
            indices=2*(gp["iphi"]+ nseg_loffset[gp["layer"]]).values
            if (len(indices)%2==1): indices=np.append(indices, -1) 
            indices=indices.reshape(-1, 2)
            for row in indices:
                arr[row[0]:row[1]]=False
            arr=arr.reshape(910,1101)
            clusters[str(event)+"_"+str(cluster)]=arr
    np.savez_compressed("opu_switch2/swi"+events, **clusters)
        
  0%|          | 0/563 [00:00<?, ?it/s]
  0%|          | 0/563 [00:00<?, ?it/s]

To the OPU...

... and back

Loading data

In [12]:
from os.path import exists

X_train=np.array([])
dfy=pd.DataFrame()

for prefile in range(1400, 1900):
    sp=str(prefile)
    fy="data/particles_100_"+sp+".csv"
    fopu="switchoutput/trackml_opu_X_train_30k_2Dswitch_"+sp+".npy.npz"

    if not ( exists (fy) & exists(fopu)): continue
    Xt=np.load(fopu)['train']
    dftmp=pd.read_csv(fy, names=['event','particle','pt', 'phi', 'xVtx', 'yVtx'])

    if (len(Xt) != len(dftmp)) :
        continue
    X_train=Xt if len(X_train)==0 else np.concatenate([X_train, Xt])

    dfy=pd.concat([dfy, dftmp])


print(np.shape(X_train), len(dfy))
(62555, 30000) 62555

Ok but does it even work?

In [13]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from time import time


def Xridge(X, y, gamma=1):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
    ridge=Ridge()
    start = time()
    ridge.fit(Xtr, ytr)
    end = time()
    ypr=ridge.predict(Xte)
    return ypr, yte.values, end-start
In [14]:
n_feats=5000
choice=np.random.choice(30000, n_feats)
X_choice=X_train[:, choice] 
y_pred, y_true, timing=Xridge(X_choice, dfy[["pt", "phi"]])

Visualisation for $P_T$ fit

In [15]:
figsize(12, 12)
scatter(y_true[:, 0], y_pred[:, 0], alpha=0.5, s=20)
xylim=10000
plot( (-xylim, xylim), (-xylim, xylim) , color="red")
xlabel("True value")
ylabel("Prediction")
title("$P_T$ estimate from OPU ridge regression, 5K features")

axis("equal")
print(timing, "s")
10.475732326507568 s

Visualisation for $\phi$ fit

In [16]:
scatter(y_true[:, 1], y_pred[:, 1], alpha=0.5, s=20)
xylim=pi
plot( (-xylim, xylim), (-xylim, xylim) , color="red")
xlabel("True value")
ylabel("Prediction")
title("$\phi$ estimate from OPU ridge regression, 5K features")

axis("equal")
Out[16]:
(-3.459043566740272, 3.4593753916708403, -5.231807773130645, 5.521512745768216)

Visualisation for $1/P_T$ fit

In [17]:
dfy["rPt"]= 1/dfy["pt"]
y_pred, y_true, timing=Xridge(X_choice, dfy.rPt)
In [18]:
scatter(y_true, y_pred, alpha=0.5, s=20)
xylim=0.004
plot( (-xylim, xylim), (-xylim, xylim) , color="red")
xlabel("True value")
ylabel("Prediction")
title("1/Pt estimate from OPU ridge regression, 5K features")
axvline(1e-4, c="g")
axvline(-1e-4, c="g")
annotate("$1 / P_T^{max}}  = 1/10 000$", xy=(1e-4, 3e-3), xytext=(5e-4, 3e-3), color="g", arrowprops=dict(arrowstyle="->", color='green'))
axis("equal")
xlim(-xylim, xylim)
ylim(-xylim, xylim)
Out[18]:
(-0.004, 0.004)

Convenience dataframe to add the difference

In [19]:
dfp=pd.DataFrame.from_dict({"true": y_true, "pred": y_pred})
dfp["dif"]=dfp.pred-dfp.true
dfp.head()
Out[19]:
true pred dif
0 0.000509 0.000451 -0.000058
1 -0.000195 0.000168 0.000363
2 -0.000101 0.000205 0.000306
3 -0.000133 0.000323 0.000456
4 0.000320 0.000029 -0.000291

Residuals

In [20]:
scatter(dfp.true, dfp.dif, alpha=0.5, s=20)
xylim=0.004
xlabel("$y_{true}$")
ylabel("$y_{pred} - y_{true}$")
title("1/Pt estimate from OPU ridge regression, 5K features")
axvline(1e-4, c="g")
axvline(-1e-4, c="g")
annotate("$1 / P_T  = 1/10 000$", xy=(1e-4, 3e-3), xytext=(5e-4, 3e-3), color="g", arrowprops=dict(arrowstyle="->", color='green'))
xlim(-xylim, xylim)
ylim(-xylim, xylim)
Out[20]:
(-0.004, 0.004)

Computing the (IQR) interquartile range

In [21]:
from  scipy import signal 

figsize(15, 6)
window=dfp["dif"].rolling(50, center=True)
xquart=dfp["dif"].sort_values().values
lquart=window.quantile(.25, interpolation='midpoint')
hquart=window.quantile(.75, interpolation='midpoint')
mquart=window.quantile(.50, interpolation='midpoint')

xquart=xquart[~numpy.isnan(mquart)]
lquart=lquart[~numpy.isnan(lquart)]
hquart=hquart[~numpy.isnan(hquart)]
mquart=mquart[~numpy.isnan(mquart)]

lsmooth=signal.savgol_filter(lquart, 5001, 3)
hsmooth=signal.savgol_filter(hquart, 5001, 3)
msmooth=signal.savgol_filter(mquart, 5001, 3)

plot(xquart, lquart, ".", alpha=0.2, color=tab20(1))
plot(xquart, lsmooth, color=tab20(0))
plot(xquart, mquart, ".", alpha=0.2, color=tab20(3))
plot(xquart, msmooth, color=tab20(2))
plot(xquart, hquart, ".", alpha=0.2, color=tab20(5))
plot(xquart, hsmooth, color=tab20(4))
xlabel("$y_{true}$")
ylabel("$y_{pred} - y_{true}$")
Out[21]:
Text(0, 0.5, '$y_{pred} - y_{true}$')

IQR visualisation

In [22]:
figsize(12, 12)
scatter(dfp.true, dfp.dif, alpha=0.5, s=20, label="data")
xylim=0.004
xlabel("$y_{true}$")
ylabel("$y_{pred} - y_{true}$")
title("1/Pt estimate from OPU ridge regression, 5K features")
axvline(1e-4, c="g")
axvline(-1e-4, c="g")
annotate("$1 / P_T  = 1/10 000$", xy=(1e-4, 3e-3), xytext=(5e-4, 3e-3), color="g", arrowprops=dict(arrowstyle="->", color='green'))
xlim(-xylim, xylim)
ylim(-xylim, xylim) 
plot(xquart, msmooth, color="red", label="median")
fill_between(xquart, lsmooth, hsmooth, alpha=0.2, color="red", label="IQR")
legend()
Out[22]:
<matplotlib.legend.Legend at 0x7fe2b9eb9dd0>
In [23]:
scatter(dfp.true, dfp.dif, alpha=0.5, s=20, label="data")
xlabel("$y_{true}$")
ylabel("$y_{pred} - y_{true}$")
title("1/Pt estimate from OPU ridge regression, 5K features")

axvline(1e-4, c="g")
axvline(-1e-4, c="g")

xlim(-0.0025, 0.0025)
ylim(-0.001, 0.001)

plot(xquart, msmooth, color="red", label="median")
fill_between(xquart, lsmooth, hsmooth, alpha=0.2, color="red", label="IQR")
legend()
Out[23]:
<matplotlib.legend.Legend at 0x7fe2ba25f490>

Distribution of residuals

In [24]:
figsize(15, 5)
ax=sns.distplot(dfp.dif, hist=False, rug=True)
ax.set_xlim(-1e-3, 1e-3)
Out[24]:
(-0.001, 0.001)

Just a convenience function

In [25]:
from scipy.stats import iqr

fitres=[{"n_feats":n_feats, 
         "ytrue": y_true,
         "ypred": y_pred,
         "ydiff": y_pred - y_true,
         "time" : timing,
         "std": std(y_pred - y_true),
         "iqr": iqr(y_pred - y_true)
        }]


def appendRidge(n_feats):
    dic={}
    
    choice=np.random.choice(30000, n_feats)
    X_choice=X_train[:, choice]
    
    ypr, yte, timing=Xridge(X_choice, dfy.rPt)
    dic["n_feats"]=n_feats
    dic["ytrue"]=yte
    dic["ypred"]=ypr
    dic["ydiff"]=ypr-yte
    dic["time"]=timing
    dic["std"]=std(ypr-yte)
    dic["iqr"]=iqr(ypr-yte)
    fitres.append(dic)
    

xs=[x["n_feats"] for x in fitres]
ys=[x["iqr"]     for x in fitres]
ts=[x["time"] for x in fitres]
text=["{:.3f} s".format(x["time"]) for x in fitres]
In [26]:
import plotly.graph_objects as go

marker_prop=dict(colorscale="Viridis", size=12, showscale=True, cmin=0, cmax=10, color=ts, colorbar={"title":"Time (s)"})
data=go.Scatter(x=xs, y=ys, hovertext=text, mode="markers", marker=marker_prop, hoverinfo="all")
fig = go.FigureWidget(data=[data])
fig.update_xaxes(range=[0.9, 4.5], type="log")
fig.update_yaxes(range=[0, 6.1e-4])
fig.update_layout(height=600,  title_font_size=20, title_text="Ridge regression fit precision vs N random features ",
                 xaxis_title="Number of random features", yaxis_title="IQR", xaxis_title_font=dict(size=20),  yaxis_title_font=dict(size=20))
trace= fig.data[0]
fig 
In [ ]:
for n_feats in np.logspace(1, log10(20000), 20, dtype=int):#[::-1]:
    appendRidge(n_feats)
    xs=[x["n_feats"] for x in fitres]
    ys=[x["iqr"]     for x in fitres]
    ts=[x["time"] for x in fitres]
    text=["{:.3f} s".format(x["time"]) for x in fitres]
    trace.x = xs
    trace.y = ys
    trace.marker.color=ts
    trace.hovertext=text
    print(n_feats)
    
10
14
22
33
49
73
110
164
245
366
546
814
1215
1813
2706
4037

Conclusions

  • Hands-on talk on how (if) to use the OPU
  • It works!
  • Problem at hand must be carefully designed / crafted to fit the constraints
  • Not obvious how TrackML could fit into this frame
  • OPU itself fast, but additional overhead time for compression, transfer, analysis
  • Is it worth it when compared to neural networks? Under which conditions?