Speaker
Dr
Giulio Palombo
(University of Milan - Bicocca)
Description
Datasets in modern High Energy Physics (HEP) experiments are often
described by dozens or even hundreds of input variables (features).
Reducing a full feature set to a subset that most completely represents
information about data is therefore an important task in analysis of HEP
data. We compare various feature selection algorithms for supervised
learning using several datasets such as, for instance, imaging gamma-ray
Cherenkov telescope (MAGIC) data found at the UCI repository.
We use classifiers and feature selection methods implemented in the
statistical package StatPatternRecognition (SPR), a free open-source C++
package developed in the HEP community
(http://sourceforge.net/projects/statpatrec/). For each dataset, we select
a powerful classifier and estimate its learning accuracy on feature
subsets obtained by various feature selection algorithms. When possible,
we also estimate the CPU time needed for the feature subset selection. The
results of this analysis are compared with those published previously for
these datasets using other statistical packages such as R and Weka. We
show that the most accurate, yet slowest, method is a wrapper algorithm
known as generalized sequential forward selection ("Add N Remove R")
implemented in SPR.
Author
Dr
Giulio Palombo
(University of Milan - Bicocca)
Co-author
Dr
Ilya Narsky
(California Institute of Technology)