《深度学习理论》笔记 PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

《深度学习理论》笔记 PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1mVcczPtBWVk_h0141ZaA2A

提取码：bfdd

相关截图：

主要内容：

Introduction

Machine learning aims to solve the following problem:

R(f) → min

f∈F

. (1.1)

Here R(f) = Ex,y∼Dr(y, f(x)) is a true risk of a model f from a class F, and D is a data distribution. However,

we do not have an access to the true data distribution; instead we have a finite set of i.i.d. samples from it:

Sn = {(xi

, yi)}n

minimization:

i=1 ∼ Dn. For this reason, instead of approaching (1.1), we substitite it with an empirical risk

ˆRn(f) → min

f∈F

, (1.2)

where ˆRn(f) = Ex,y∈Sn r(y, f(x)) is an empirical risk of a model f from a class F.

1.1 Generalization ability

How does the solution of (4.2) relate to (1.1)? In other words, we aim to upper-bound the difference between the

two risks:

R(fˆn) ˆRn(fˆn) ≤ bound(fˆn, F, n, δ) w.p. ≥ 1 δ over Sn, (1.3)

where fˆn ∈ F is a result of training the model on the dataset Sn.

We call the bound (1.3) a-posteriori if it depends on the resulting model fˆn, and we call it a-priori if it does not.

An a-priori bound allows one to estimate the risk difference before training, while an a-posteriori bound estimates

the risk difference based on the final model.

Uniform bounds are instances of an a-priori class:

R(fˆn) ˆRn(fˆn) ≤ sup

f∈F

|R(f) ˆRn(f)| ≤ ubound(F, n, δ) w.p. ≥ 1 δ over Sn, (1.4)

A typical form of the uniform bound is the following:

ubound(F, n, δ) = O rC(F) + log(1/δ) n ! , (1.5)

where C(F) is a complexity of the class F.

The bound above suggests that the generalization ability, measured by the risk difference, decays as the model

class becomes larger. This suggestion conforms the classical bias-variance trade-off curve. The curve can be reproduced if we fit the Runge function with a polynomial using a train set of equidistant points; the same phenomena

can be observed for decision trees.

A typical notion of model class complexity is VC-dimension [Vapnik and Chervonenkis, 1971]. For neural networks, VC-dimension grows at least linearly with the number of parameters [Bartlett et al., 2019]. Hence the

bound (1.5) becomes vacuous for large enough nets. However, as we observe, the empirical (train) risk ˆRn vanishes,

while the true (test) risk saturates for large enough width (see Figure 1 of [Neyshabur et al., 2015]).

One might hypothesize that the problem is in VC-dimension, which overestimates the complexity of neural nets.

However, the problem turns out to be in uniform bounds in general. Indeed, if the class F contains a bad network,

i.e. a network that perfectly fits the train data but fails desperately on the true data distribution, the uniform

bound (1.4) becomes at least nearly vacuous. In realistic scenarios, such a bad network can be found explicitly:

[Zhang et al., 2016] demonstrated that practically large nets can fit data with random labels; similarly, these nets

can fit the training data plus some additional data with random labels. Such nets fit the training data perfectly

but generalize poorly.

Up to this point, we know that among the networks with zero training risk, some nets generalize well, while

some generalize poorly. Suppose we managed to come with some model complexity measure that is symptomatic

for poor generalization: bad nets have higher complexity than good ones. If we did, we can come up with a better

bound by prioritizing less complex models.

Such prioritization is naturally supported by a PAC-bayesian paradigm. First, we come up with a prior distribution P over models. This distribution should not depend on the train dataset Sn. Then we build a posterior

distribution Q | Sn over models based on observed data. For instance, if we fix random seeds, a usual network

training procedure gives a posterior distribution concentrated in a single model fˆn. The PAC-bayesian bound

[McAllester, 1999b] takes the following form:

R(Q | Sn) ˆRn(Q | Sn) ≤ O r

KL(Q | SnkP) + log(1/δ) n !

w.p. ≥ 1 δ over Sn, (1.6)

where R(Q) is an expected risk for models sampled from Q; similarly for ˆRn(Q). If more complex models are less

likely to be found, then we can embed this information into prior, thus making the KL-divergence typically smaller.

The PAC-bayesian bound (1.6) is an example of an a-posteriori bound, since the bound depends on Q. However,

it is possible to obtain an a-priori bound using the same paradigm [Neyshabur et al., 2018].

The bound (1.6) becomes better when our training procedure tends to find models that are probable according

to the prior. But what kind of models does the gradient descent typically find? Does it implicitly minimize some

complexity measure of the resulting model? Despite the existence of bad networks, minimizing the train loss using

a gradient descent typically reveals well-performing solutions. This phenomenon is referred as an implicit bias of

gradient descent.

Another problem with a-priori bounds is that they all are effectively two-sided: all of them are bounding an

absolute value of the risk difference, rather then the risk difference itself. Two-sided bounds fail if there exist

networks that generalize well, while failing on a given train set. [Nagarajan and Kolter, 2019] have constructed a

problem for which such networks are typically found by gradient descent.

1.2 Global convergence

We have introduced the empirical minimization problem (4.2) because we were not able to minimize the true risk

directly: see (1.1). But are we able to minimize the empirical risk? Let f(x; θ) be a neural net evaluated at input

x with parameters θ. Consider a loss function ℓ that is a convex surrogate of a risk r. Then minimizing the train

loss will imply empirical risk minimization:

Lˆn(θ) = Ex,y∈Sn ℓ(y, f(x; θ)) → min

θ . (1.7)

Neural nets are complex non-linear functions of both inputs and weights; we can hardly expect the loss landscape

ˆLn(θ) induced by such functions to be simple. At least, for non-trivial neural nets Lˆn is a non-convex function of

θ. Hence it can have local minima that are not global.

The most widely-used method of solving the problem (1.7) for deep learning is gradient descent (GD), or some of

its variants. Since GD is a local method, it cannot have any global convergence guarantees in general case. However,

for practically-sized neural nets it always succeeds in finding a global minimum.

Given this observation, it is tempting to hypothesize that despite of the non-convexity, all local minima of ˆLn(θ)

are global. This turns to be true for linear nets [Kawaguchi, 2016, Lu and Kawaguchi, 2017, Laurent and Brecht, 2018],

and for non-linear nets if they are sufficiently wide [Nguyen, 2019].

While globality of local minima implies almost sure convergence of gradient descent [Lee et al., 2016, Panageas and Piliouras, 2017],

there are no guarantees on convergence speed. Generally, convergence speed depends on initialization. For instance,

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

《深度学习理论》笔记 PDF 下载

Java1234官方群25：
Java1234官方群25：	838462530