失效链接处理 |
《深度学习理论》笔记 PDF 下载
Machine learning aims to solve the following problem:
R(f) → min
. (1.1)
Here R(f) = Ex,y∼Dr(y, f(x)) is a true risk of a model f from a class F, and D is a data distribution. However,
we do not have an access to the true data distribution; instead we have a finite set of i.i.d. samples from it:
Sn = {(xi
, yi)}n
i=1 ∼ Dn. For this reason, instead of approaching (1.1), we substitite it with an empirical risk
ˆRn(f) → min
, (1.2)
where ˆRn(f) = Ex,y∈Sn r(y, f(x)) is an empirical risk of a model f from a class F.
1.1 Generalization ability
How does the solution of (4.2) relate to (1.1)? In other words, we aim to upper-bound the difference between the
two risks:
R(fˆn) ˆRn(fˆn) ≤ bound(fˆn, F, n, δ) w.p. ≥ 1 δ over Sn, (1.3)
where fˆn ∈ F is a result of training the model on the dataset Sn.
We call the bound (1.3) a-posteriori if it depends on the resulting model fˆn, and we call it a-priori if it does not.
An a-priori bound allows one to estimate the risk difference before training, while an a-posteriori bound estimates
the risk difference based on the final model.
Uniform bounds are instances of an a-priori class:
R(fˆn) ˆRn(fˆn) ≤ sup
|R(f) ˆRn(f)| ≤ ubound(F, n, δ) w.p. ≥ 1 δ over Sn, (1.4)
A typical form of the uniform bound is the following:
ubound(F, n, δ) = O rC(F) + log(1/δ) n ! , (1.5)
where C(F) is a complexity of the class F.
The bound above suggests that the generalization ability, measured by the risk difference, decays as the model
class becomes larger. This suggestion conforms the classical bias-variance trade-off curve. The curve can be reproduced if we fit the Runge function with a polynomial using a train set of equidistant points; the same phenomena
can be observed for decision trees.
A typical notion of model class complexity is VC-dimension [Vapnik and Chervonenkis, 1971]. For neural networks, VC-dimension grows at least linearly with the number of parameters [Bartlett et al., 2019]. Hence the
bound (1.5) becomes vacuous for large enough nets. However, as we observe, the empirical (train) risk ˆRn vanishes,
while the true (test) risk saturates for large enough width (see Figure 1 of [Neyshabur et al., 2015]).
One might hypothesize that the problem is in VC-dimension, which overestimates the complexity of neural nets.
However, the problem turns out to be in uniform bounds in general. Indeed, if the class F contains a bad network,
i.e. a network that perfectly fits the train data but fails desperately on the true data distribution, the uniform
bound (1.4) becomes at least nearly vacuous. In realistic scenarios, such a bad network can be found explicitly:
[Zhang et al., 2016] demonstrated that practically large nets can fit data with random labels; similarly, these nets
can fit the training data plus some additional data with random labels. Such nets fit the training data perfectly
but generalize poorly.
Up to this point, we know that among the networks with zero training risk, some nets generalize well, while
some generalize poorly. Suppose we managed to come with some model complexity measure that is symptomatic
for poor generalization: bad nets have higher complexity than good ones. If we did, we can come up with a better
bound by prioritizing less complex models.
Such prioritization is naturally supported by a PAC-bayesian paradigm. First, we come up with a prior distribution P over models. This distribution should not depend on the train dataset Sn. Then we build a posterior
distribution Q | Sn over models based on observed data. For instance, if we fix random seeds, a usual network
training procedure gives a posterior distribution concentrated in a single model fˆn. The PAC-bayesian bound
[McAllester, 1999b] takes the following form:
R(Q | Sn) ˆRn(Q | Sn) ≤ O r
KL(Q | SnkP) + log(1/δ) n !
w.p. ≥ 1 δ over Sn, (1.6)
where R(Q) is an expected risk for models sampled from Q; similarly for ˆRn(Q). If more complex models are less
likely to be found, then we can embed this information into prior, thus making the KL-divergence typically smaller.
The PAC-bayesian bound (1.6) is an example of an a-posteriori bound, since the bound depends on Q. However,
it is possible to obtain an a-priori bound using the same paradigm [Neyshabur et al., 2018].
The bound (1.6) becomes better when our training procedure tends to find models that are probable according
to the prior. But what kind of models does the gradient descent typically find? Does it implicitly minimize some
complexity measure of the resulting model? Despite the existence of bad networks, minimizing the train loss using
a gradient descent typically reveals well-performing solutions. This phenomenon is referred as an implicit bias of
gradient descent.
Another problem with a-priori bounds is that they all are effectively two-sided: all of them are bounding an
absolute value of the risk difference, rather then the risk difference itself. Two-sided bounds fail if there exist
networks that generalize well, while failing on a given train set. [Nagarajan and Kolter, 2019] have constructed a
problem for which such networks are typically found by gradient descent.
1.2 Global convergence
We have introduced the empirical minimization problem (4.2) because we were not able to minimize the true risk
directly: see (1.1). But are we able to minimize the empirical risk? Let f(x; θ) be a neural net evaluated at input
x with parameters θ. Consider a loss function ℓ that is a convex surrogate of a risk r. Then minimizing the train
loss will imply empirical risk minimization:
Lˆn(θ) = Ex,y∈Sn ℓ(y, f(x; θ)) → min
θ . (1.7)
Neural nets are complex non-linear functions of both inputs and weights; we can hardly expect the loss landscape
ˆLn(θ) induced by such functions to be simple. At least, for non-trivial neural nets Lˆn is a non-convex function of
θ. Hence it can have local minima that are not global.
The most widely-used method of solving the problem (1.7) for deep learning is gradient descent (GD), or some of
its variants. Since GD is a local method, it cannot have any global convergence guarantees in general case. However,
for practically-sized neural nets it always succeeds in finding a global minimum.
Given this observation, it is tempting to hypothesize that despite of the non-convexity, all local minima of ˆLn(θ)
are global. This turns to be true for linear nets [Kawaguchi, 2016, Lu and Kawaguchi, 2017, Laurent and Brecht, 2018],
and for non-linear nets if they are sufficiently wide [Nguyen, 2019].
While globality of local minima implies almost sure convergence of gradient descent [Lee et al., 2016, Panageas and Piliouras, 2017],
there are no guarantees on convergence speed. Generally, convergence speed depends on initialization. For instance,