The concept of statistical anomalies, or outliers, has fascinated experimentalists since the earliest attempts to interpret data. We want to know why some data points don’t seem to belong with the others: perhaps we want to eliminate spurious or unrepresentative data from our model. Or, the anomalies themselves may be what we are interested in: an outlier could represent the symptom of a disease, an attack on a computer network, a scientific discovery, or even an unfaithful partner.
We start with some general considerations, such as the relationship between clustering and anomaly detection, the choice between supervised and unsupervised methods, and the difference between global and local anomalies. Then we will survey the most representative anomaly detection algorithms, highlighting what kind of data each approach is best suited to, and discussing their limitations. We will finish with a discussion of the difficulties of anomaly detection in high-dimensional data and some new directions for anomaly detection research.