Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

springcloud分布式电商秒杀实战课程

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > Java基础相关 >

谷歌等推出基于机器学习的新型数据库SageDB相关论文 PDF 下载


分享到:
时间:2020-11-24 09:10来源:http://www.java1234.com 作者:转载  侵权举报
谷歌等推出基于机器学习的新型数据库SageDB相关论文 PDF 下载
失效链接处理
谷歌等推出基于机器学习的新型数据库SageDB相关论文  PDF 下载


本站整理下载:
提取码:sc1i 
 
 
相关截图:
 
主要内容:

1. INTRODUCTION
Database systems have a long history of automatically se￾lecting efficient algorithms, e.g., a merge vs hash-join, based
on data statistics. Yet, existing databases remain general
purpose systems and are not engineered on a case-by-case
basis for the specific workload and data characteristics of a
user, because doing so manually would be hugely time con￾suming. Yet, specialized solutions can be much more effi-
cient. For example, if the goal is to build a highly-tuned
system to store and query ranges of fixed-length records
with continuous integer keys (e.g., the keys 1 to 100M),
one should not use a conventional index. Using B+Trees
for such range queries would make not much sense, since
the key itself can be used as an offset, making it a constant
O(1) operation to look-up the first key of a range.1
Indeed,
a simple C program that loads 100M integers into an ar￾ray and performs a summation over a range runs in about
300ms on a modern desktop, but doing the same operations
in a modern database (Postgres 9.6) takes about 150 sec￾onds. This represents a 500x overhead for a general purpose
design that isn’t aware of the specific data distribution.
Similar benefits extend to operations like sorting or joins.
For sorting, if we know keys come from a dense integer do￾main, we can simplify the sorting of incoming records based
1Note, that we use the big O-notation here over the par￾ticular instance of a database, similar to the notation of
instance-optimality[10], except that our class of databases
is exactly one.
on the primary key, as the key can be again used as an off-
set to put the data directly into a sorted array, reducing the
complexity of sorting from O(N log N) to O(N) for this par￾ticular instance of the data. Joins over dense integer keys
also simplify to lookups, requiring only a direct lookup us￾ing the key as an offset. Inserts and concurrency control
might also become easier as we do not need to keep index
structures in sync.
Perhaps surprisingly, the same optimizations are also pos￾sible for other data patterns. For example, if the data con￾tains only even numbers or follows any other deterministic
distribution we can apply similar optimizations. In short, we
can optimize almost any algorithm or data structure used by
the database with knowledge of the exact data distribution.
These optimizations can sometimes even change the com￾plexity class of well-known data processing algorithms. Of
course, in most real-world use cases the data does not per￾fectly follow a known pattern, and it is usually not worth￾while to engineer a specialized system for every use case.
However, if it were be possible to learn the data pattern,
correlations, etc. of the data, we argue that we can auto￾matically synthesize index structures, sorting and join algo￾rithms, and even entire query optimizers that leverage these
patterns for performance gains. Ideally, even with imperfect
knowledge of the patterns it will be possible to change the
complexity class of algorithms, as in the above example.
In this paper we present our vision of SageDB, a new
class of data management system that specializes itself to
exploit the distributions of the data it stores and the queries
it serves. Prior work has explored the use of learning to tune
knobs [37, 34, 7, 9, 35], choosing indexes [12, 4, 36] parti￾tioning schemes [6, 2, 27, 28], or materialized views (see [5]
for a general overview) but our goal is different. We ar￾gue that learned components can fully replace core
components of a database system such as index struc￾tures, sorting algorithms, or even the query executor. This
may seem counter-intuitive because machine learning can￾not provide the semantic guarantees we traditionally asso￾ciate with these components, and because machine learning
models are traditionally thought of as being very expensive
to evaluate. This vision paper argues that neither of these
apparent obstacles are as problematic as they might seem.
In terms of performance, we observe that more and more
devices are equipped with specialized hardware for machine
learning. For example, the iPhone has the “Neural Engine”,
Google’s phone have a “Visual Core,” Google’s Cloud has
Cloud TPUs, and Microsoft developed BrainWave. As it
is easier to scale the simple (math) operations required for
machine learning, these devices already deliver a stunning
performance. For instance, Nvidia’s TESLA V100 product
is capable of performing 120 TeraFlops of neural net operations. It was also stated that GPUs will increase 1000× in
performance by 2025, whereas Moore’s law for CPUs essentially is dead [1]. By replacing branch-heavy algorithms with
neural networks, the DBMS can profit from these hardware
trends.
Similarly, it is often surprisingly easy to provide the the
same semantic guarantees. For example, a B-Tree can be
seen as a model that points to a page containing records
with a particular key, requiring a scan of the page to return
all records that actually satisfy a query. In this sense, a BTree already trades off execution performance for accuracy
[19]. Learning-based models will simply have more flexibility
to explore that trade-off.
Finally, aggressive use of synthesis and code generation
will allow us to automatically create efficient data structures, which combine models with more traditional algorithms. Here our goal is to (1) generate code tailored to the
user’s data distribution, and (2) synthesize database components with embedded machine learning models, which balance the trade-off between memory, latency, and compute
for a given use case while providing the same semantic guarantees as the traditional components.
Building SageDB requires a radical departure from the
way database systems are currently developed and involves
challenges from databases, machine learning and programming systems. SageDB is currently being developed as part
of MIT’s new Data Systems for AI Lab (DSAIL), which consists of an interdisciplinary team with expertise in all of these
areas. Our focus is on building a new analytical (OLAP) engine but we believe the approach also has significant advantages for OLTP or hybrid workloads. The remainder of the
paper outlines the overall architecture of SageDB, individual
challenges and presents some promising initial results.
2. MODEL-DRIVEN APPROACH
The core idea behind SageDB is to build one or more
models about the data and workload distribution and based
on them automatically build the best data structures and
algorithms for for all components of the database system.
This approach, which we call “database synthesis” will allow
us to achieve unprecedented performance by specializing the
implementation of every database component to the specific
database, query workload, and execution environment.
2.1 Types of Customization
The proposed customization goes far beyond the current
use of statistics and models about the data, hardware or
performance of algorithms, which can be roughly classified
in the following levels:
Customization through Configuration: The most
basic form of customization is configuring the systems, aka
knob tuning. Most systems and heuristics have a huge number of settings (e.g., page-size, buffer-pool size, etc.). Traditionally, database administrators tune those knobs to con-
figure the general purpose database to a particular use case.
In that sense the creation of indexes, finding the right partitioning scheme, or the creation of materialized views for
performance can also be considered as finding the best con-
figuration of the systems. It comes also at no surprise, that
there has been a lot of work on automatically tuning those
configurations [37, 34, 7, 9, 35] based on the workload and
data characteristics.
Customization through Algorithm Picking: While
configuring the system is largely static, databases have a
long history of using query optimizers to dynamically “customize” the execution strategy for a particular query based
on statistics about the data and the configuration (e.g.,
available indexes) of the system. That is, the query optimizer decides on the best execution order (e.g., predicate
push-downs, join-ordering, etc.) and picks the best implementation from a set of available algorithms (e.g., nestedloop join vs hash-join). This declarative approach, which
allows the user to specify on a high-level the query, while
the system figures out how to best achieve it, is one of the
most significant contributions of the database community.
Customization through Self-Design: Self-designed systems rely on the notion of mapping the possible space of critical design decisions in a system and automatically generating a design that best fits the workload and hardware [15].
Here the space of possible designs is defined by all combinations and tunings of first principle components, such as fence
pointers, links, temporal partitioning, etc., which together
form a “periodic table of data structures” [14]. This goes far
beyond algorithm picking or configuring a system because
new combinations of these primitives might yield previously
unknown algorithms/data structures and can lead to significant performance gains [15].
Customization through Learning: In contrast to selfdesign, learned systems replace core data systems components through learned models. For example, in [19] we show
how indexes can be replaced by models, whereas [21] shows
how to learn workload-specific scheduling strategies. Models make it possible to capture data and workload properties
traditional data structures and algorithms have a hard time
supporting well. As a result, under certain conditions these
data structures can provide the best-case complexity, e.g.,
O(N) instead of O(N log N), and yield even higher performance gains than customization through self-design. Furthermore, they change the type of computation from traditional control-flow heavy computation to data-dependencyfocused computation, which often can be more efficiently
execute on CPUs and the upcoming ML accelerators.
These different types of customization can be composed.
Especially, customization through self-design and customization through learning go hand in hand as the learned models
often have to be combined with more traditional algorithms
and data structures in order to provide the same semantic
guarantees. More interestingly, models can potentially be
shared among different components of a database system.
In that sense, we argue in this paper that customization
through learning is the most powerful form of customization
and outline how SageDB deeply embeds models into all algorithms and data structures, making the models the brain
of the database (see Figure 2).

 
 
------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐