失效链接处理 |
Graph Databases PDF 下载
本站整理下载:
提取码:fkr1
相关截图:
主要内容:
Graph Databases
Adrian Silvescu, Doina Caragea, Anna Atramentov
Artificial Intelligence Research Laboratory
Department of Computer Science
Iowa State University
Ames, Iowa 50011
Abstract
Gathering huge amounts of complex information (data and knowledge) is very common nowadays. This calls
for the necessity to represent, store and manipulate complex information (e.g. detect correlations and patterns,
discover explanations, construct predictive models etc.). Furthermore, being autonomously maintained, data can
change in time or even change its base structure, making it difficult for representation systems to accommodate
these changes.
Current representation and storage systems are not very flexible in dealing with big changes and also they are
not concerned with the ability of performing complex data manipulations of the sort mentioned above. On the
other hand, data manipulation systems cannot easily work with structural or relational data, but just with flat data
representations. We want to bridge the gap between the two, by introducing a new type of database structure,
called Graph Databases (GDB), based on a natural graph representation. Our Graph Databases are able to
represent as graphs any kind of information, naturally accommodate changes in data, and they also make easier for
Machine Learning methods to use the stored information.
We are mainly concerned with the design and implementation of GDB in this paper. Thus, we define the Data
Definition Language (DDL) that contains extensional definitions as well as intentional definitions, and Data
Manipulation Language (DML) that we use to pose queries. Then we show conceptually how to do updates, how
to accommodate changes, and how to define new concepts or ask higher order queries that are not easily answer
by SQL language. We also show how to transform a relational database into a GDB.
Although, we show how all these things can be done using our graph databases, we do not implement all of
them. This is a very laborious project that goes much beyond our class project goals. We do implement the graph
databases, show how we can ask queries on them, and also how to transform relational databases into graph
databases in order to be able to reuse relational data already existent.
1. Introduction
Huge amounts of information (data and knowledge) become increasingly common in nowadays
sciences. The sheer volume and diversity make it increasingly difficult to the unaided contemporary
scientist to be able to exploit this information at its full potential. This observation calls for automatic and
semi-automatic (human-driven, but not in every single detail) methods of acquiring, integrating,
representing and manipulating complex data (detecting correlations and patterns, discovering hidden
variables, explanations, a-priori unknown regularities, constructing predictive models etc.).
Usually data resides in multiple locations, is autonomously maintained, and may change in time, thus
posing even more challenges. Being autonomous, it is possible that the data sources don't have a unifying
schema or we cannot control their schemas. Furthermore, these schemas can be dynamically modified, and
it may be difficult to accommodate changes. In this setting, several major issues arise: how to structurally
match schemas of different data sources, how to obtain a global schema from multiple schemas, and how to
change the global schema according to possible changes in data sources schemas.
Sometimes the attempts to solve these problems depart from the traditional approach to databases,
namely RDBMS, leading into the realm of Object-Oriented and XML technologies. The new approaches
can deal with increasingly complex data, namely objects in Object-Oriented Databases and trees in XML.
However, neither the objects nor the trees can naturally represent graphs, which are the most general data
structure. Subsequently, they do not support natural queries on graphs. Moreover, when changes in schema
of the data sources occur, these approaches may lead to major restructuring (redesigning) of global schema.
This may imply very sophisticated and expensive operations, such as renormalization, reindexing etc.,
which may not be performed automatically. Thus, the necessity to represent, store and manipulate complex
data makes RDBMS somewhat obsolete.
We identify three major problems that can arise in the kind of dynamic, distributed, autonomous
environments described above. First, violations of the 1NF can appear when dealing with multi-valued
attributes, complex attributes, or combination of multi-valued and complex attributes. We denote this
problem by [P1] in what follows. Second, when acquiring data from autonomous dynamic sources (e.g.
gene expression data) or from Web (e.g. data about restaurants), a good data representation should be able
to accommodate changes not only in data, but also in the data schema. We call this problem [P2]. It should
be noted that RDBMS might require schema renormalization in such cases, which is not always easy to do.
The third problem, which we denote by [P3], refers to the ability of a database to represent data in a
unified way. Thus, it would be very good if data, knowledge (e.g. schemas), queries (or more generally,
goals) and models (e.g. concepts) would have a unified representation. This would give us a better
understanding of a data problem.
We should note that the three problems mentioned above are very relevant to the Scientific Discovery
(SD) process using Artificial Intelligence (AI), and especially Machine Learning tools. This is because,
much of the data available for SD is structural in nature, needs be often updated, can change easily in
structure etc. Although most of the traditional machines learning algorithms work with data found in flat
data representations, currently there is a considerable tendency to design intelligent learning algorithms that
can deal with structural [Cook and Holder, 2000] or relational data [Getoor et. al., 2001]. However, these
approaches, one based on graphs structures, the other on relational databases, do not take into consideration
the problems [P1], [P2], [P3] described above. They work with data that does not change in time, and they
are not dealing with any updates. Also, they do not address the unified representation problem. The
problem they solve is to extract useful knowledge from data that is already represented as graphs or as
relations. In [Getoor et. al., 2001], data and query do not have a unified representation at all. Although, in
[Cook and Holder, 2000] the authors regard both data and query as graphs, they are not able to put together
data and knowledge. Furthermore, they are not interested in these structures from a databases perspective at
all, but just from a learning perspective. However, we want to stress on the fact that more and more
graphical representations are used in ML lately [Jordan, 1998; Pearl, 2000; Jensen, 2001]. Although quite
distant from what we are doing here, this suggests the importance of being able to deal with the natural
graph representation of information.
We aim at designing database structures that can take into account problems [P1], [P2], [P3], facilitate
the application of learning and reasoning methods on these structures, and also be able to reuse data that is
already stored in relational databases. We call our newly designed structures graph databases, and we think
about them as being at the core of information storage, information integration, and information
exploration. The big picture showing how all these components come into place into the DS process is
shown in Figure 1.
We are not interested in the integration process in this paper. We can assume that the available
information is already integrated [Reinoso et al., 2001]. The problem that we are trying to solve is how to
represent this integrated available information in a way that makes it easier to accommodate changes,
updates, and also provides a unified representaion for data and knowledge, so that the learning process is
facilitated.
K N O W L E D G E
S O U R C E S
L IN K
S O U R C E S
D A T A
S O U R C E S
A V A IL A B L E
IN F O R M A T IO N
IN T H E W O R L D
... ... ...
In fo rm a tio n
In te g ra ti o n
D is c o ve r y & M o d e l li n g Q u e r y = G o a l S p e c i fic a ti o n
U S E R
In fo rm a ti o n
E xp lo ita ti o n In fo rm a tio n
IN F O R M A T IO N IN T E G R A T IO N
Figure 1: Big Picture of the Discovery Science Process
2. Related Work
We mentioned above that RDBMS systems are somewhat obsolete regarding the ability to represent,
store and manipulate complex information that can change in time. Thus, they might require very
sophisticated and expensive operations, such as renormalization, reindexing etc., which may not be
performed automatically. Schema renormalization in such cases is nor desirable neither easy to do.
Therefore, the attempts to solve these problems depart from the traditional approach to databases,
namely RDBMS, leading into the realm of Object-Oriented and XML technologies. The new approaches
can deal with increasingly complex data, namely objects in Object-Oriented Databases and trees in XML.
However, neither the objects nor the trees can naturally represent graphs, which are the most general data
structure. Subsequently, they do not support natural queries on graphs. Moreover, when changes in schema
of the data sources occur, these approaches may lead to major restructuring (redesigning) of global schema.
We briefly mention some of the existing older and newer databases systems, showing how they
address the problems [P1], [P2] and [P3] that we are interested in. We do not go into too many details,
because, as we mentioned, we are not only interested into database system appropriate for representing,
storing and querying data, but also in systems that facilitate the information extraction process. On the
other hand, we do not talk more about graphical learning methods [Jordan, 1998; Pearl, 2000; Jensen, 2001;
Cook and Holder, 2000; Getoor et. al., 2001] because they assume fixed representations, not appropriate for changes,
being concerned only with the learning part. They will be relevant to a discussion about how to use GDB for learning,
which is beyond the purpose of this current paper, but very probable in future work.
To be more precise, we are not aware of any system that tries to bridge the gap between representation and
learning.
|