失效链接处理 |
Bayesian Networks With Examples in R PDF 下载
本站整理下载:
相关截图:
主要内容:
The Discrete Case: Multinomial Bayesian
Networks
In this chapter we will introduce the fundamental ideas behind Bayesian networks (BNs) and their basic interpretation, using a hypothetical survey on
the usage of different means of transport. We will focus on modelling discrete
data, leaving continuous data to Chapter 2 and more complex data types to
Chapter 3.
1.1 Introductory Example: Train Use Survey
Consider a simple, hypothetical survey whose aim is to investigate the usage
patterns of different means of transport, with a focus on cars and trains. Such
surveys are used to assess customer satisfaction across different social groups,
to evaluate public policies or for urban planning. Some real-world examples
can be found, for example, in Kenett et al. (2012).
In our current example we will examine, for each individual, the following
six discrete variables (labels used in computations and figures are reported in
parenthesis):
• Age (A): the age, recorded as young (young) for individuals below 30 years
old, adult (adult) for individuals between 30 and 60 years old, and old
(old) for people older than 60.
• Sex (S): the biological sex of the individual, recorded as male (M) or female
(F).
• Education (E): the highest level of education or training completed by
the individual, recorded either as up to high school (high) or university
degree (uni).
• Occupation (O): whether the individual is an employee (emp) or a selfemployed (self) worker.
• Residence (R): the size of the city the individual lives in, recorded as
either small (small) or big (big).
1
2 Bayesian Networks: With Examples in R
• Travel (T): the means of transport favoured by the individual, recorded
either as car (car), train (train) or other (other).
In the scope of this survey, each variable falls into one of three groups. Age and
Sex are demographic indicators. In other words, they are intrinsic characteristics of the individual; they may result in different patterns of behaviour, but
are not influenced by the individual himself. On the other hand, the opposite
is true for Education, Occupation and Residence. These variables are socioeconomic indicators, and describe the individual’s position in society. Therefore,
they provide a rough description of the individual’s expected lifestyle; for example, they may characterise his spending habits or his work schedule. The
last variable, Travel, is the target of the survey, the quantity of interest whose
behaviour is under investigation.
1.2 Graphical Representation
The nature of the variables recorded in the survey, and more in general of the
three categories they belong to, suggests how they may be related with each
other. Some of those relationships will be direct, while others will be mediated
by one or more variables (indirect).
Both kinds of relationships can be represented effectively and intuitively
by means of a directed graph, which is one of the two fundamental entities
characterising a BN. Each node in the graph corresponds to one of the variables in the survey. In fact, they are usually referred to interchangeably in
literature. Therefore, the graph produced from this example will contain 6
nodes, labelled after the variables (A, S, E, O, R and T). Direct dependence
relationships are represented as arcs between pairs of variables (i.e., A → E
means that E depends on A). The node at the tail of the arc is called the
parent, while that at the head (where the arrow is) is called the child. Indirect
dependence relationships are not explicitly represented. However, they can be
read from the graph as sequences of arcs leading from one variable to the other
through one or more mediating variables (i.e., the combination of A → E and
E → R means that R depends on A through E). Such sequences of arcs are
said to form a path leading from one variable to the other; these two variables
must be distinct. Paths of the form A → . . . → A, which are known as cycles,
are not allowed. For this reason, the graphs used in BNs are called directed
acyclic graphs (DAGs).
Note, however, that some caution must be exercised in interpreting both
direct and indirect dependencies. The presence of arrows or arcs seems to
imply, at an intuitive level, that for each arc one variable should be interpreted
as a cause and the other as an effect (i.e. A → E means that A causes E). This
interpretation, which is called causal, is difficult to justify in most situations;
for this reason, in general we speak about dependence relationships instead
The Discrete Case: Multinomial Bayesian Networks 3
of causal effects. The assumptions required for causal BN modelling will be
discussed in Section 4.7.
To create and manipulate DAGs in the context of BNs, we will use mainly
the bnlearn package (short for “Bayesian network learning”).
> library(bnlearn)
As a first step, we create a DAG with one node for each variable in the survey
and no arcs.
> dag <- empty.graph(nodes = c("A", "S", "E", "O", "R", "T"))
Such a DAG is usually called an empty graph, because it has an empty arc
set. The DAG is stored in an object of class bn, which looks as follows when
printed.
> dag
Random/Generated Bayesian network
model:
[A][S][E][O][R][T]
nodes: 6
arcs: 0
undirected arcs: 0
directed arcs: 0
average markov blanket size: 0.00
average neighbourhood size: 0.00
average branching factor: 0.00
generation algorithm: Empty
Now we can start adding the arcs that encode the direct dependencies
between the variables in the survey. As we said in the previous section, Age
and Sex are not influenced by any of the other variables. Therefore, there are
no arcs pointing to either variable. On the other hand, both Age and Sex
have a direct influence on Education. It is well known, for instance, that the
number of people attending universities has increased over the years. As a
consequence, younger people are more likely to have a university degree than
older people.
> dag <- set.arc(dag, from = "A", to = "E")
Similarly, Sex also influences Education; the gender gap in university applications has been widening for many years, with women outnumbering and
outperforming men.
> dag <- set.arc(dag, from = "S", to = "E")
|