失效链接处理 |
Anomaly Detection - A Survey PDF 下载
本站整理下载:
提取码:32vk
相关截图:
主要内容:
tection, we provide a detailed discussion of the application domains where anomaly
detection techniques have been used. For each domain we discuss the notion of an
anomaly, the different aspects of the anomaly detection problem, and the challenges
faced by the anomaly detection techniques. We also provide a list of techniques
that have been applied in each application domain.
The existing surveys discuss anomaly detection techniques that detect the simplest form of anomalies. We distinguish the simple anomalies from complex anomalies. The discussion of applications of anomaly detection reveals that for most application domains, the interesting anomalies are complex in nature, while most of
the algorithmic research has focussed on simple anomalies.
1.5 Organization
This survey is organized into three parts and its structure closely follows Figure
2. In Section 2 we identify the various aspects that determine the formulation
of the problem and highlight the richness and complexity associated with anomaly
detection. We distinguish simple anomalies from complex anomalies and define two
types of complex anomalies, viz., contextual and collective anomalies. In Section
3 we briefly describe the different application domains where anomaly detection
has been applied. In subsequent sections we provide a categorization of anomaly
detection techniques based on the research area which they belong to. Majority
of the techniques can be categorized into classification based (Section 4), nearest
neighbor based (Section 5), clustering based (Section 6), and statistical techniques
(Section 7). Some techniques belong to research areas such as information theory
(Section 8), and spectral theory (Section 9). For each category of techniques we also
discuss their computational complexity for training and testing phases. In Section
10 we discuss various contextual anomaly detection techniques. We discuss various
collective anomaly detection techniques in Section 11. We present some discussion
on the limitations and relative performance of various existing techniques in Section
12. Section 13 contains concluding remarks.
2. DIFFERENT ASPECTS OF AN ANOMALY DETECTION PROBLEM
This section identifies and discusses the different aspects of anomaly detection. As
mentioned earlier, a specific formulation of the problem is determined by several
different factors such as the nature of the input data, the availability (or unavailability) of labels as well as the constraints and requirements induced by the application
domain. This section brings forth the richness in the problem domain and justifies
the need for the broad spectrum of anomaly detection techniques.
2.1 Nature of Input Data
A key aspect of any anomaly detection technique is the nature of the input data.
Input is generally a collection of data instances (also referred as object, record, point,
vector, pattern, event, case, sample, observation, entity) [Tan et al. 2005, Chapter
2] . Each data instance can be described using a set of attributes (also referred
to as variable, characteristic, feature, field, dimension). The attributes can be of
different types such as binary, categorical or continuous. Each data instance might
consist of only one attribute (univariate) or multiple attributes (multivariate). In
To Appear in ACM Computing Surveys, 09 2009.
Anomaly Detection : A Survey · 7
the case of multivariate data instances, all attributes might be of same type or
might be a mixture of different data types.
The nature of attributes determine the applicability of anomaly detection techniques. For example, for statistical techniques different statistical models have to
be used for continuous and categorical data. Similarly, for nearest neighbor based
techniques, the nature of attributes would determine the distance measure to be
used. Often, instead of the actual data, the pairwise distance between instances
might be provided in the form of a distance (or similarity) matrix. In such cases,
techniques that require original data instances are not applicable, e.g., many statistical and classification based techniques.
Input data can also be categorized based on the relationship present among data
instances [Tan et al. 2005]. Most of the existing anomaly detection techniques deal
with record data (or point data), in which no relationship is assumed among the
data instances.
In general, data instances can be related to each other. Some examples are
sequence data, spatial data, and graph data. In sequence data, the data instances
are linearly ordered, e.g., time-series data, genome sequences, protein sequences. In
spatial data, each data instance is related to its neighboring instances, e.g., vehicular
traffic data, ecological data. When the spatial data has a temporal (sequential)
component it is referred to as spatio-temporal data, e.g., climate data. In graph
data, data instances are represented as vertices in a graph and are connected to
other vertices with edges. Later in this section we will discuss situations where
such relationship among data instances become relevant for anomaly detection.
2.2 Type of Anomaly
An important aspect of an anomaly detection technique is the nature of the desired
anomaly. Anomalies can be classified into following three categories:
2.2.1 Point Anomalies. If an individual data instance can be considered as
anomalous with respect to the rest of data, then the instance is termed as a point
anomaly. This is the simplest type of anomaly and is the focus of majority of
research on anomaly detection.
For example, in Figure 1, points o1 and o2 as well as points in region O3 lie
outside the boundary of the normal regions, and hence are point anomalies since
they are different from normal data points.
As a real life example, consider credit card fraud detection. Let the data set
correspond to an individual’s credit card transactions. For the sake of simplicity,
let us assume that the data is defined using only one feature: amount spent. A
transaction for which the amount spent is very high compared to the normal range
of expenditure for that person will be a point anomaly.
2.2.2 Contextual Anomalies. If a data instance is anomalous in a specific context (but not otherwise), then it is termed as a contextual anomaly (also referred
to as conditional anomaly [Song et al. 2007]).
The notion of a context is induced by the structure in the data set and has to be
specified as a part of the problem formulation. Each data instance is defined using
following two sets of attributes:
To Appear in ACM Computing Surveys, 09 2009.
Monthly Temp
Time
Mar Jun Sept Dec Mar Jun Sept Dec Mar Jun Sept Dec
t t2 1 8 · Chandola, Banerjee and Kumar
(1) Contextual attributes. The contextual attributes are used to determine the
context (or neighborhood) for that instance. For example, in spatial data sets,
the longitude and latitude of a location are the contextual attributes. In timeseries data, time is a contextual attribute which determines the position of an
instance on the entire sequence.
(2) Behavioral attributes. The behavioral attributes define the non-contextual characteristics of an instance. For example, in a spatial data set describing the
average rainfall of the entire world, the amount of rainfall at any location is a
behavioral attribute.
The anomalous behavior is determined using the values for the behavioral attributes
within a specific context. A data instance might be a contextual anomaly in a given
context, but an identical data instance (in terms of behavioral attributes) could
be considered normal in a different context. This property is key in identifying
contextual and behavioral attributes for a contextual anomaly detection technique
|