失效链接处理 |
谷歌等推出基于机器学习的新型数据库SageDB相关论文 PDF 下载
本站整理下载:
相关截图:
主要内容:
1. INTRODUCTION
Database systems have a long history of automatically selecting efficient algorithms, e.g., a merge vs hash-join, based
on data statistics. Yet, existing databases remain general
purpose systems and are not engineered on a case-by-case
basis for the specific workload and data characteristics of a
user, because doing so manually would be hugely time consuming. Yet, specialized solutions can be much more effi-
cient. For example, if the goal is to build a highly-tuned
system to store and query ranges of fixed-length records
with continuous integer keys (e.g., the keys 1 to 100M),
one should not use a conventional index. Using B+Trees
for such range queries would make not much sense, since
the key itself can be used as an offset, making it a constant
O(1) operation to look-up the first key of a range.1
Indeed,
a simple C program that loads 100M integers into an array and performs a summation over a range runs in about
300ms on a modern desktop, but doing the same operations
in a modern database (Postgres 9.6) takes about 150 seconds. This represents a 500x overhead for a general purpose
design that isn’t aware of the specific data distribution.
Similar benefits extend to operations like sorting or joins.
For sorting, if we know keys come from a dense integer domain, we can simplify the sorting of incoming records based
1Note, that we use the big O-notation here over the particular instance of a database, similar to the notation of
instance-optimality[10], except that our class of databases
is exactly one.
on the primary key, as the key can be again used as an off-
set to put the data directly into a sorted array, reducing the
complexity of sorting from O(N log N) to O(N) for this particular instance of the data. Joins over dense integer keys
also simplify to lookups, requiring only a direct lookup using the key as an offset. Inserts and concurrency control
might also become easier as we do not need to keep index
structures in sync.
Perhaps surprisingly, the same optimizations are also possible for other data patterns. For example, if the data contains only even numbers or follows any other deterministic
distribution we can apply similar optimizations. In short, we
can optimize almost any algorithm or data structure used by
the database with knowledge of the exact data distribution.
These optimizations can sometimes even change the complexity class of well-known data processing algorithms. Of
course, in most real-world use cases the data does not perfectly follow a known pattern, and it is usually not worthwhile to engineer a specialized system for every use case.
However, if it were be possible to learn the data pattern,
correlations, etc. of the data, we argue that we can automatically synthesize index structures, sorting and join algorithms, and even entire query optimizers that leverage these
patterns for performance gains. Ideally, even with imperfect
knowledge of the patterns it will be possible to change the
complexity class of algorithms, as in the above example.
In this paper we present our vision of SageDB, a new
class of data management system that specializes itself to
exploit the distributions of the data it stores and the queries
it serves. Prior work has explored the use of learning to tune
knobs [37, 34, 7, 9, 35], choosing indexes [12, 4, 36] partitioning schemes [6, 2, 27, 28], or materialized views (see [5]
for a general overview) but our goal is different. We argue that learned components can fully replace core
components of a database system such as index structures, sorting algorithms, or even the query executor. This
may seem counter-intuitive because machine learning cannot provide the semantic guarantees we traditionally associate with these components, and because machine learning
models are traditionally thought of as being very expensive
to evaluate. This vision paper argues that neither of these
apparent obstacles are as problematic as they might seem.
In terms of performance, we observe that more and more
devices are equipped with specialized hardware for machine
learning. For example, the iPhone has the “Neural Engine”,
Google’s phone have a “Visual Core,” Google’s Cloud has
Cloud TPUs, and Microsoft developed BrainWave. As it
is easier to scale the simple (math) operations required for
machine learning, these devices already deliver a stunning
performance. For instance, Nvidia’s TESLA V100 product
is capable of performing 120 TeraFlops of neural net operations. It was also stated that GPUs will increase 1000× in
performance by 2025, whereas Moore’s law for CPUs essentially is dead [1]. By replacing branch-heavy algorithms with
neural networks, the DBMS can profit from these hardware
trends.
Similarly, it is often surprisingly easy to provide the the
same semantic guarantees. For example, a B-Tree can be
seen as a model that points to a page containing records
with a particular key, requiring a scan of the page to return
all records that actually satisfy a query. In this sense, a BTree already trades off execution performance for accuracy
[19]. Learning-based models will simply have more flexibility
to explore that trade-off.
Finally, aggressive use of synthesis and code generation
will allow us to automatically create efficient data structures, which combine models with more traditional algorithms. Here our goal is to (1) generate code tailored to the
user’s data distribution, and (2) synthesize database components with embedded machine learning models, which balance the trade-off between memory, latency, and compute
for a given use case while providing the same semantic guarantees as the traditional components.
Building SageDB requires a radical departure from the
way database systems are currently developed and involves
challenges from databases, machine learning and programming systems. SageDB is currently being developed as part
of MIT’s new Data Systems for AI Lab (DSAIL), which consists of an interdisciplinary team with expertise in all of these
areas. Our focus is on building a new analytical (OLAP) engine but we believe the approach also has significant advantages for OLTP or hybrid workloads. The remainder of the
paper outlines the overall architecture of SageDB, individual
challenges and presents some promising initial results.
2. MODEL-DRIVEN APPROACH
The core idea behind SageDB is to build one or more
models about the data and workload distribution and based
on them automatically build the best data structures and
algorithms for for all components of the database system.
This approach, which we call “database synthesis” will allow
us to achieve unprecedented performance by specializing the
implementation of every database component to the specific
database, query workload, and execution environment.
2.1 Types of Customization
The proposed customization goes far beyond the current
use of statistics and models about the data, hardware or
performance of algorithms, which can be roughly classified
in the following levels:
Customization through Configuration: The most
basic form of customization is configuring the systems, aka
knob tuning. Most systems and heuristics have a huge number of settings (e.g., page-size, buffer-pool size, etc.). Traditionally, database administrators tune those knobs to con-
figure the general purpose database to a particular use case.
In that sense the creation of indexes, finding the right partitioning scheme, or the creation of materialized views for
performance can also be considered as finding the best con-
figuration of the systems. It comes also at no surprise, that
there has been a lot of work on automatically tuning those
configurations [37, 34, 7, 9, 35] based on the workload and
data characteristics.
Customization through Algorithm Picking: While
configuring the system is largely static, databases have a
long history of using query optimizers to dynamically “customize” the execution strategy for a particular query based
on statistics about the data and the configuration (e.g.,
available indexes) of the system. That is, the query optimizer decides on the best execution order (e.g., predicate
push-downs, join-ordering, etc.) and picks the best implementation from a set of available algorithms (e.g., nestedloop join vs hash-join). This declarative approach, which
allows the user to specify on a high-level the query, while
the system figures out how to best achieve it, is one of the
most significant contributions of the database community.
Customization through Self-Design: Self-designed systems rely on the notion of mapping the possible space of critical design decisions in a system and automatically generating a design that best fits the workload and hardware [15].
Here the space of possible designs is defined by all combinations and tunings of first principle components, such as fence
pointers, links, temporal partitioning, etc., which together
form a “periodic table of data structures” [14]. This goes far
beyond algorithm picking or configuring a system because
new combinations of these primitives might yield previously
unknown algorithms/data structures and can lead to significant performance gains [15].
Customization through Learning: In contrast to selfdesign, learned systems replace core data systems components through learned models. For example, in [19] we show
how indexes can be replaced by models, whereas [21] shows
how to learn workload-specific scheduling strategies. Models make it possible to capture data and workload properties
traditional data structures and algorithms have a hard time
supporting well. As a result, under certain conditions these
data structures can provide the best-case complexity, e.g.,
O(N) instead of O(N log N), and yield even higher performance gains than customization through self-design. Furthermore, they change the type of computation from traditional control-flow heavy computation to data-dependencyfocused computation, which often can be more efficiently
execute on CPUs and the upcoming ML accelerators.
These different types of customization can be composed.
Especially, customization through self-design and customization through learning go hand in hand as the learned models
often have to be combined with more traditional algorithms
and data structures in order to provide the same semantic
guarantees. More interestingly, models can potentially be
shared among different components of a database system.
In that sense, we argue in this paper that customization
through learning is the most powerful form of customization
and outline how SageDB deeply embeds models into all algorithms and data structures, making the models the brain
of the database (see Figure 2).
|