失效链接处理 |
Survey on Models and Techniques for Root-Cause Analysis PDF 下载
本站整理下载:
相关截图:
主要内容:
complex human decisions becomes essential to manage large and
distributed systems in the Cloud and IoT era. Understanding
the root cause of an observed symptom in a complex system
has been a major problem for decades. As industry dives into
the IoT world and the amount of data generated per year
grows at an amazing speed, an important question is how to
find appropriate mechanisms to determine root causes that can
handle huge amounts of data or may provide valuable feedback
in real-time. While many survey papers aim at summarizing
the landscape of techniques for modelling system behavior and
infering the root cause of a problem based in the resulting
models, none of those focuses on analyzing how the different
techniques in the literature fit growing requirements in terms
of performance and scalability. In this survey, we provide a
review of root-cause analysis, focusing on these particular aspects.
We also provide guidance to choose the best root-cause analysis
strategy depending on the requirements of a particular system
and application.
Index Terms—Big data, failure diagnosis, root-cause analysis.
I. INTRODUCTION
With the onset of the SMAC industry (i.e. Social, Mobile,
Analytics and Cloud), Software as a Service, and the Internet
of Things (IoT), more organizations in all industry sectors
recognize they are evolving into technology and data companies1
. While one may be tempted to believe that this may only
affect some limited number of markets, all indicators show a
much more aggressive trend for companies in all industries
to transform businesses through software, exploring much
further market adjacencies. 54% of CEOs have entered a new
sector or sub-sector, or considered it, in the past three years2.
Also from the same source by PricewaterhouseCoopers, 55%
of entertainment and media CEOs, 52% of communications
CEOs, 48% of power and utilities CEOs and 47% of banking
and capital markets CEOs say a significant competitor from
the technology sector is emerging or will emerge.
Several factors speed up the growing importance of software. The cloud has become necessary for the survival of
technology companies. Revenue for SaaS is expected to grow
at a compound annual rate of more than 20% throughout
this decade. By 2018, 59% of the total cloud workloads will
be SaaS3
. Besides, we are seeing an integration of digital
M. Sol´e and V. Munt´es are with CA Technologies.
A. Ibrahim and G. Estrada are with Intel
1McKendrick, J. (2015, April 30). Every Company Now A Technology
Company: Latest Round Of Mergers And Acquisitions Confirms It. Forbes.
Retrieved from http://tinyurl.com/j6f7ub5
218th Annual Global CEO Survey. PricewaterhouseCoopers. Retrieved
from: http://www.pwc.com/gx/en/ceo-survey/2015/download.jhtml
3Global Cloud Index: Forecast and Methodology, 20142019. Retrieved
from: http://tinyurl.com/q9vxgcx
sensors, processing, connectivity and security into virtually
every industry’s products. As IoT moves out of the hype
phase, it will drive demand for network infrastructure, sensors,
software applications, and all technologies needed to operate
IoT applications including data analytics4
. Cisco predicted
that IoT will unleash $19 trillion USD in new profits and
cost savings globally in the next decade (Burrows, 2014).
According to Gartner, there will be nearly 26 billion devices on
the Internet of Things by 2020, that will potentially generate
zetabytes of data annually.
The growth of cloud and IoT poses serious challenges to IT
leaders. Developing and deploying smart, connected products
and retrofitting existing equipment is very challenging, requiring coordination of network connectivity, application protocols, data analytics, and system management. IoT platforms
are being developed5
to simplify the processes of developing,
connecting, controlling, and capturing insight from connected
products and assets, allowing firms to sense and respond
to changing customer needs. In particular, controlling these
complex and distributed IoT systems will require advanced
Root-Cause Analysis (RCA) capable of precisely synthesizing
the status of the system for human beings to make decisions.
Specifically, human beings will no longer be capable of
controlling so complex system through traditional dashboards
and will require a higher level of automation to generate
hypotheses of potential root-causes much more accurately.
While several decades of research have produced a large
number of algorithms and techniques to perform root cause
analysis in many different fields, there is still a lack of
understanding on how they can be used and adapted to the
growing complexity of IoT and other similar environments,
where scalability and real-time reaction become essential. In
particular, the appropriate interpretation and management of
the vast amounts of data generated in these environments need
to be underpinned by IoT and Cloud platforms in order for
them to be genuinely viable [1].
In this survey we will thus focus on the RCA models
available and the existing generation and inference algorithms
that have been developed for them, paying special attention
to performance aspects. Although there is a vast literature on
RCA, we will restrict to techniques that can be applied to
IT systems. Our survey builds on the shoulders of previous
contributions from other general surveys like [2]. For surveys
4Global technology M&A 1Q15: first look. Ernst & Young. Retrieved from:
http://www.ey.com/GL/en/Industries/Technology/EY-global-technology-ma-1q15-first-look;
and Manyika, J., Chui, M., Bisson, P., Woetzel, J., Dobbs, R., Bughin, J. and
Aharon, D. (2015, June). Unlocking the potential of the Internet of Things.
McKinsey Global Institute. Retrieved from: http://tinyurl.com/jqpymqu
52015 Technology Industry Outlook. Deloitte. Retrieved from:
http://tinyurl.com/o7sb52h
arXiv:1701.08546v2 [cs.AI] 3 Jul 2017
2
on specific areas, the reader may refer to [3], [4] for computer
networks, [5], [6] for software, [7], [8] for industrial systems,
[9] for smart buildings, [10] for buildings, [11] for machinery,
[12] for swarm systems, [13], [14] for automatic control systems, [15], [16] for automotive systems and [17] for aerospace
systems.
The remainder of the paper is organized as follows. Section II introduces the main concepts and terminology used in
the rest of the paper. Section III describes the main models
used for RCA, the different ways in which they can be
obtained and some of the learning algorithms available for
each model. The inference algorithms that can be used on
these models are the subject of Section IV while Section V
concludes this paper.
II. MAIN CONCEPTS AND TERMINOLOGY
In this section, we discuss the main concepts and terminology used in the area of RCA. In particular, we provide
some essential terminology that we will use throughout the
remainder of the survey and provide a classification of RCA
tasks.
The core concepts behind RCA are causality and explanation. Although central to any scientific endeavour, there
is no consensus on their formal definition despite centuries
long discussions on the subject [18], [19]. This is relevant,
especially in the case of explanation, because an explanation
is often a desired output of RCA, thus, as we will see in
Section IV, many alternatives have been proposed.
A. Terminology
In this survey, we follow the terminology proposed in [2]:
Event is an exceptional condition occurring in the operation
of a system.
Faults/problems/root causes are events that can cause other
events but are not themselves caused by other events.
According to their duration they can be classified as: permanent if fault will persist until reparation, intermittent
if they are discontinuous and periodic, and transient if
temporary.
Error An Error is caused by one or more faults and is a
discrepancy between a condition of the system and its
theoretically correct condition.
Failure A Failure is an error that is observable from outside
the system.
Symptom A Symptom is an external manifestations of failures. This includes a direct observation of failures themselves and externally visible indicators that a failure
happened that are not failures by themselves, like alarms
raised by anomaly detectors.
Root Cause Analysis also referred as fault localization, fault
isolation or alarm/event correlation, is the process of
inferring the set of faults that generated a given set
of symptoms. Note that this process might be trivial if
faults are directly observable, in which case they are also
symptoms as well. However, this is not the usual case in
complex systems. In such cases, a model that explains the
relationship between faults and symptoms must be used
to be able to perform this inference process.
|