Kaggle's Flavours of Physics: the second-ranked solution

This is a solution ranked second on the Private Leaderboard of the Kaggle "Flavours of Physics: Finding τ → μμμ" competition. The model is based on gradient boosting and implemented in Python with the help of the XGBoost library. It is simply a combination of two XGBoost classifiers (boosters) trained on different sets of features. The first booster is an ensemble of 200 decision trees targeting mostly geometric features (such as impact parameters and track isolation variables). The second booster consists of 100 trees trained on purely kinematic features. Final prediction is a weighted average of the probabilities predicted by the individual classifiers (with a weight of 0.78 assigned to the first booster). Combining two independent classifiers allows us to easily pass the correlation test. To pass the agreement test, the only thing needed is to exclude SPDhits from the features used in the training process.

Dependencies

The XGBoost library should be installed
The standard Python packages numpy, pandas, and csv are required
The training and test datasets (the files training.csv and test.csv) can be downloaded from here

How to generate the solution

Put the data files training.csv and test.csv in the data directory.
To train the XGBoost classifiers, run python train.py. The trained boosters will be saved in the files bst1.model and bst2.model, so you can make predictions on new datasets without re-training the model.
To make a prediction, run python predict.py. Results will be written to submission.csv.

Feature engineering

Some new features were designed in addition to the original ones. The original feature SPDhits was not used since it prevents passing the agreement test. Lists of the features used to train each booster are provided below.

Features for the first booster

Original features: FlightDistance, FlightDistanceError, LifeTime, IP, IPSig, VertexChi2, dira, pt, DOCAone, DOCAtwo, DOCAthree, IP_p0p2, IP_p1p2, isolationa, isolationb, isolationc, isolationd, isolatione, isolationf, iso, CDF1, CDF2, CDF3, ISO_SumBDT, p0_IsoBDT, p1_IsoBDT, p2_IsoBDT, p0_track_Chi2Dof, p1_track_Chi2Dof, p2_track_Chi2Dof, p0_IP, p0_IPSig, p1_IP, p1_IPSig, p2_IP, p2_IPSig.
New features:
- E is the full energy of the mother particle calculated assuming that the final-state particles p0, p1, and p2 are muons (E = E0 + E1 + E2).
- FlightDistanceSig is the ratio (FlightDistance / FlightDistanceError).
- DOCA_sum is the sum (DOCAone + DOCAtwo + DOCAthree).
- isolation_sum is the sum (isolationa + isolationb + isolationc + isolationd + isolatione + isolationf).
- IsoBDT_sum is the sum (p0_IsoBDT + p1_IsoBDT + p2_IsoBDT).
- track_Chi2Dof is calculated as sqrt[(p0_track_Chi2Dof – 1)^2 + (p1_track_Chi2Dof – 1)^2 + (p2_track_Chi2Dof – 1)^2].
- IP_sum is the sum (p0_IP + p1_IP + p2_IP).
- IPSig_sum is the sum (p0_IPSig + p1_IPSig + p2_IPSig).
- CDF_sum is the sum (CDF1 + CDF2 + CDF3).

Features for the second booster

Original features: dira, pt, p0_pt, p0_p, p0_eta, p1_pt, p1_p, p1_eta, p2_pt, p2_p, p2_eta.
New features:
- E is the full energy of the mother particle calculated assuming that the final-state particles p0, p1, and p2 are muons (E = E0 + E1 + E2).
- pz is the longitudinal momentum of the mother particle.
- beta is the relativistic beta of the mother particle (beta = v / c).
- gamma is the relativistic gamma of the mother particle (gamma = 1 / sqrt(1 – beta^2)).
- beta_gamma is beta×gamma calculated as FlightDistance / (LifeTime×c), where c is the speed of light.
- Delta_E is the difference between energies of the mother particle calculated in two different ways.
- Delta_M is the difference between masses of the mother particle calculated in two different ways.
- flag_M equals to 1 if the mass of the mother particle is close to the tau mass; equals to 0 otherwise.
- E0 is the full energy of the particle p0 calculated as E0 = sqrt[(m_mu)^2 + (p0_p)^2], where m_mu is the muon mass.
- E1 is the full energy of the particle p1 calculated as E1 = sqrt[(m_mu)^2 + (p1_p)^2], where m_mu is the muon mass.
- E2 is the full energy of the particle p2 calculated as E2 = sqrt[(m_mu)^2 + (p2_p)^2], where m_mu is the muon mass.
- E0_ratio is the ratio (E0 / E).
- E1_ratio is the ratio (E1 / E).
- E2_ratio is the ratio (E2 / E).
- p0_pt_ratio is the ratio (p0_pt / pt).
- p1_pt_ratio is the ratio (p1_pt / pt).
- p2_pt_ratio is the ratio (p2_pt / pt).
- eta_01 is the difference (p0_eta – p1_eta).
- eta_02 is the difference (p0_eta – p2_eta).
- eta_12 is the difference (p1_eta – p2_eta).
- t_coll is calculated as (p0_pt + p1_pt + p2_pt) / pt (this equals to unity if the final-state particles p0, p1, and p2 are collinear in the transverse plane).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
Gramolin_ALEPH2015.pdf		Gramolin_ALEPH2015.pdf
LICENSE		LICENSE
README.md		README.md
bst1.model		bst1.model
bst2.model		bst2.model
features.py		features.py
parameters.py		parameters.py
predict.py		predict.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

Gramolin_ALEPH2015.pdf

Gramolin_ALEPH2015.pdf

LICENSE

LICENSE

README.md

README.md

bst1.model

bst1.model

bst2.model

bst2.model

features.py

features.py

parameters.py

parameters.py

predict.py

predict.py

train.py

train.py

Repository files navigation

Kaggle's Flavours of Physics: the second-ranked solution

Dependencies

How to generate the solution

Feature engineering

Features for the first booster

Features for the second booster

About

Releases

Packages

Languages

License

gramolin/flavours-of-physics

Folders and files

Latest commit

History

Repository files navigation

Kaggle's Flavours of Physics: the second-ranked solution

Dependencies

How to generate the solution

Feature engineering

Features for the first booster

Features for the second booster

About

Resources

License

Stars

Watchers

Forks

Languages