谷歌等推出基于机器学习的新型数据库SageDB相关论文 PDF 下载

失效链接处理

谷歌等推出基于机器学习的新型数据库SageDB相关论文 PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1Kbon89pNdzz6hVX1eYOURg

提取码：sc1i

相关截图：

主要内容：

1. INTRODUCTION

Database systems have a long history of automatically selecting efficient algorithms, e.g., a merge vs hash-join, based

on data statistics. Yet, existing databases remain general

purpose systems and are not engineered on a case-by-case

basis for the specific workload and data characteristics of a

user, because doing so manually would be hugely time consuming. Yet, specialized solutions can be much more effi-

cient. For example, if the goal is to build a highly-tuned

system to store and query ranges of fixed-length records

with continuous integer keys (e.g., the keys 1 to 100M),

one should not use a conventional index. Using B+Trees

for such range queries would make not much sense, since

the key itself can be used as an offset, making it a constant

O(1) operation to look-up the first key of a range.1

Indeed,

a simple C program that loads 100M integers into an array and performs a summation over a range runs in about

300ms on a modern desktop, but doing the same operations

in a modern database (Postgres 9.6) takes about 150 seconds. This represents a 500x overhead for a general purpose

design that isn’t aware of the specific data distribution.

Similar benefits extend to operations like sorting or joins.

For sorting, if we know keys come from a dense integer domain, we can simplify the sorting of incoming records based

1Note, that we use the big O-notation here over the particular instance of a database, similar to the notation of

instance-optimality[10], except that our class of databases

is exactly one.

on the primary key, as the key can be again used as an off-

set to put the data directly into a sorted array, reducing the

complexity of sorting from O(N log N) to O(N) for this particular instance of the data. Joins over dense integer keys

also simplify to lookups, requiring only a direct lookup using the key as an offset. Inserts and concurrency control

might also become easier as we do not need to keep index

structures in sync.

Perhaps surprisingly, the same optimizations are also possible for other data patterns. For example, if the data contains only even numbers or follows any other deterministic

distribution we can apply similar optimizations. In short, we

can optimize almost any algorithm or data structure used by

the database with knowledge of the exact data distribution.

These optimizations can sometimes even change the complexity class of well-known data processing algorithms. Of

course, in most real-world use cases the data does not perfectly follow a known pattern, and it is usually not worthwhile to engineer a specialized system for every use case.

However, if it were be possible to learn the data pattern,

correlations, etc. of the data, we argue that we can automatically synthesize index structures, sorting and join algorithms, and even entire query optimizers that leverage these

patterns for performance gains. Ideally, even with imperfect

knowledge of the patterns it will be possible to change the

complexity class of algorithms, as in the above example.

In this paper we present our vision of SageDB, a new

class of data management system that specializes itself to

exploit the distributions of the data it stores and the queries

it serves. Prior work has explored the use of learning to tune

knobs [37, 34, 7, 9, 35], choosing indexes [12, 4, 36] partitioning schemes [6, 2, 27, 28], or materialized views (see [5]

for a general overview) but our goal is different. We argue that learned components can fully replace core

components of a database system such as index structures, sorting algorithms, or even the query executor. This

may seem counter-intuitive because machine learning cannot provide the semantic guarantees we traditionally associate with these components, and because machine learning

models are traditionally thought of as being very expensive

to evaluate. This vision paper argues that neither of these

apparent obstacles are as problematic as they might seem.

In terms of performance, we observe that more and more

devices are equipped with specialized hardware for machine

learning. For example, the iPhone has the “Neural Engine”,

Google’s phone have a “Visual Core,” Google’s Cloud has

Cloud TPUs, and Microsoft developed BrainWave. As it

is easier to scale the simple (math) operations required for

machine learning, these devices already deliver a stunning

performance. For instance, Nvidia’s TESLA V100 product

is capable of performing 120 TeraFlops of neural net operations. It was also stated that GPUs will increase 1000× in

performance by 2025, whereas Moore’s law for CPUs essentially is dead [1]. By replacing branch-heavy algorithms with

neural networks, the DBMS can profit from these hardware

trends.

Similarly, it is often surprisingly easy to provide the the

same semantic guarantees. For example, a B-Tree can be

seen as a model that points to a page containing records

with a particular key, requiring a scan of the page to return

all records that actually satisfy a query. In this sense, a BTree already trades off execution performance for accuracy

[19]. Learning-based models will simply have more flexibility

to explore that trade-off.

Finally, aggressive use of synthesis and code generation

will allow us to automatically create efficient data structures, which combine models with more traditional algorithms. Here our goal is to (1) generate code tailored to the

user’s data distribution, and (2) synthesize database components with embedded machine learning models, which balance the trade-off between memory, latency, and compute

for a given use case while providing the same semantic guarantees as the traditional components.

Building SageDB requires a radical departure from the

way database systems are currently developed and involves

challenges from databases, machine learning and programming systems. SageDB is currently being developed as part

of MIT’s new Data Systems for AI Lab (DSAIL), which consists of an interdisciplinary team with expertise in all of these

areas. Our focus is on building a new analytical (OLAP) engine but we believe the approach also has significant advantages for OLTP or hybrid workloads. The remainder of the

paper outlines the overall architecture of SageDB, individual

challenges and presents some promising initial results.

2. MODEL-DRIVEN APPROACH

The core idea behind SageDB is to build one or more

models about the data and workload distribution and based

on them automatically build the best data structures and

algorithms for for all components of the database system.

This approach, which we call “database synthesis” will allow

us to achieve unprecedented performance by specializing the

implementation of every database component to the specific

database, query workload, and execution environment.

2.1 Types of Customization

The proposed customization goes far beyond the current

use of statistics and models about the data, hardware or

performance of algorithms, which can be roughly classified

in the following levels:

Customization through Configuration: The most

basic form of customization is configuring the systems, aka

knob tuning. Most systems and heuristics have a huge number of settings (e.g., page-size, buffer-pool size, etc.). Traditionally, database administrators tune those knobs to con-

figure the general purpose database to a particular use case.

In that sense the creation of indexes, finding the right partitioning scheme, or the creation of materialized views for

performance can also be considered as finding the best con-

figuration of the systems. It comes also at no surprise, that

there has been a lot of work on automatically tuning those

configurations [37, 34, 7, 9, 35] based on the workload and

data characteristics.

Customization through Algorithm Picking: While

configuring the system is largely static, databases have a

long history of using query optimizers to dynamically “customize” the execution strategy for a particular query based

on statistics about the data and the configuration (e.g.,

available indexes) of the system. That is, the query optimizer decides on the best execution order (e.g., predicate

push-downs, join-ordering, etc.) and picks the best implementation from a set of available algorithms (e.g., nestedloop join vs hash-join). This declarative approach, which

allows the user to specify on a high-level the query, while

the system figures out how to best achieve it, is one of the

most significant contributions of the database community.

Customization through Self-Design: Self-designed systems rely on the notion of mapping the possible space of critical design decisions in a system and automatically generating a design that best fits the workload and hardware [15].

Here the space of possible designs is defined by all combinations and tunings of first principle components, such as fence

pointers, links, temporal partitioning, etc., which together

form a “periodic table of data structures” [14]. This goes far

beyond algorithm picking or configuring a system because

new combinations of these primitives might yield previously

unknown algorithms/data structures and can lead to significant performance gains [15].

Customization through Learning: In contrast to selfdesign, learned systems replace core data systems components through learned models. For example, in [19] we show

how indexes can be replaced by models, whereas [21] shows

how to learn workload-specific scheduling strategies. Models make it possible to capture data and workload properties

traditional data structures and algorithms have a hard time

supporting well. As a result, under certain conditions these

data structures can provide the best-case complexity, e.g.,

O(N) instead of O(N log N), and yield even higher performance gains than customization through self-design. Furthermore, they change the type of computation from traditional control-flow heavy computation to data-dependencyfocused computation, which often can be more efficiently

execute on CPUs and the upcoming ML accelerators.

These different types of customization can be composed.

Especially, customization through self-design and customization through learning go hand in hand as the learned models

often have to be combined with more traditional algorithms

and data structures in order to provide the same semantic

guarantees. More interestingly, models can potentially be

shared among different components of a database system.

In that sense, we argue in this paper that customization

through learning is the most powerful form of customization

and outline how SageDB deeply embeds models into all algorithms and data structures, making the models the brain

of the database (see Figure 2).

最新Java全栈就业实战课程(免费)

springcloud分布式电商秒杀实战课程

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

谷歌等推出基于机器学习的新型数据库SageDB相关论文 PDF 下载

Java1234官方群25：
Java1234官方群25：	838462530