失效链接处理 |
Contrastive Multiview Coding笔记 PDF 下载
本站整理下载:
相关截图:
主要内容:
Humans view the world through many sensory channels,
e.g., the long-wavelength light channel, viewed by the left
eye, or the high-frequency vibrations channel, heard by the
right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend
to be shared between all views (e.g., a “dog” can be seen,
heard, and felt). We investigate the classic hypothesis that
a powerful representation is one that models view-invariant
factors. We study this hypothesis under the framework of
multiview contrastive learning, where we learn a representation that aims to maximize mutual information between
different views of the same scene but is otherwise compact.
Our approach scales to any number of views, and is viewagnostic. We analyze key properties of the approach that
make it work, finding that the contrastive loss outperforms
a popular alternative based on cross-view prediction, and
that the more views we learn from, the better the resulting
representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video
unsupervised learning benchmarks. Code is released at:
http://github.com/HobbitLong/CMC/.
1. Introduction
A foundational idea in coding theory is to learn compressed representations that nonetheless can be used to reconstruct the raw data. This idea shows up in contemporary
representation learning in the form of autoencoders [65] and
generative models [40, 24], which try to represent a data
point or distribution as losslessly as possible. Yet lossless
representation might not be what we really want, and indeed
it is trivial to achieve – the raw data itself is a lossless representation. What we might instead prefer is to keep the
“good” information (signal) and throw away the rest (noise).
How can we identify what information is signal and what is
noise?
To an autoencoder, or a max likelihood generative model,
a bit is a bit. No one bit is better than any other. Our conjecture in this paper is that some bits are in fact better than
others. Some bits code important properties like semantics, physics, and geometry, while others code attributes that
we might consider less important, like incidental lighting
c c fc✓1 f✓2 vi1 2 V1 vi2 2 V2 zi1 zi2 ! ! , , , c c f✓3 f✓4 zi3 zi4 vi3 2 V3 vi4 2 V4 f✓1 vj1 2 V1zj1
Matching views
Unmatching view
Figure 1: Given a set of sensory views, a deep representation is
learnt by bringing views of the same scene together in embedding
space, while pushing views of different scenes apart. Here we show
and example of a 4-view dataset (NYU RGBD [53]) and its learned
representation. The encodings for each view may be concatenated
to form the full representation of a scene.
conditions or thermal noise in a camera’s sensor.
We revisit the classic hypothesis that the good bits are the
ones that are shared between multiple views of the world,
for example between multiple sensory modalities like vision,
sound, and touch [70]. Under this perspective “presence of
dog” is good information, since dogs can be seen, heard,
and felt, but “camera pose” is bad information, since a camera’s pose has little or no effect on the acoustic and tactile
properties of the imaged scene. This hypothesis corresponds
to the inductive bias that the way you view a scene should
not affect its semantics. There is significant evidence in
the cognitive science and neuroscience literature that such
view-invariant representations are encoded by the brain (e.g.,
[70, 15, 32]). In this paper, we specifically study the setting
where the different views are different image channels, such
as luminance, chrominance, depth, and optical flow. The fundamental supervisory signal we exploit is the co-occurrence,
in natural data, of multiple views of the same scene. For
example, we consider an image in Lab color space to be a
paired example of the co-occurrence of two views of the
scene, the L view and the ab view: {L, ab}.
Our goal is therefore to learn representations that capture
1
arXiv:1906.05849v4 [cs.CV] 11 Mar 2020
information shared between multiple sensory channels but
that are otherwise compact (i.e. discard channel-specific
nuisance factors). To do so, we employ contrastive learning,
where we learn a feature embedding such that views of the
same scene map to nearby points (measured with Euclidean
distance in representation space) while views of different
scenes map to far apart points. In particular, we adapt the
recently proposed method of Contrastive Predictive Coding
(CPC) [57], except we simplify it – removing the recurrent
network – and generalize it – showing how to apply it to
arbitrary collections of image channels, rather than just to
temporal or spatial predictions. In reference to CPC, we term
our method Contrastive Multiview Coding (CMC), although
we note that our formulation is arguably equally related to
Instance Discrimination [79]. The contrastive objective in
our formulation, as in CPC and Instance Discrimination,
can be understood as attempting to maximize the mutual
information between the representations of multiple views
of the data.
We intentionally leave “good bits” only loosely defined
and treat its definition as an empirical question. Ultimately,
the proof is in the pudding: we consider a representation
to be good if it makes subsequent problem solving easy, on
tasks of human interest. For example, a useful representation
of images might be a feature space in which it is easy to learn
to recognize objects. We therefore evaluate our method by
testing if the learned representations transfer well to standard
semantic recognition tasks. On several benchmark tasks, our
method achieves results competitive with the state of the
art, compared to other methods for self-supervised representation learning. We additionally find that the quality of
the representation improves as a function of the number of
views used for training. Finally, we compare the contrastive
formulation of multiview learning to the recently popular
approach of cross-view prediction, and find that in head-tohead comparisons, the contrastive approach learns stronger
representations.
The core ideas that we build on: contrastive learning,
mutual information maximization, and deep representation
learning, are not new and have been explored in the literature
on representation and multiview learning for decades [64,
45, 80, 3]. Our main contribution is to set up a framework to
extend these ideas to any number of views, and to empirically
study the factors that lead to success in this framework. A
review of the related literature is given in Section 2; and Fig.
1 gives a pictorial overview of our framework. Our main
contributions are:
• We apply contrastive learning to the multiview setting,
attempting to maximize mutual information between
representations of different views of the same scene (in
particular, between different image channels).
• We extend the framework to learn from more than two
views, and show that the quality of the learned representation improves as number of views increase. Ours is
the first work to explicitly show the benefits of multiple
views on representation quality.
• We conduct controlled experiments to measure the effect of mutual information estimates on representation
quality. Our experiments show that the relationship
between mutual information and views is a subtle one.
• Our representations rival state of the art on popular
benchmarks.
• We demonstrate that the contrastive objective is superior to cross-view prediction.
2. Related work
Unsupervised representation learning is about learning
transformations of the data that make subsequent problem
solving easier [7]. This field has a long history, starting with
classical methods with well established algorithms, such as
principal components analysis (PCA [37]) and independent
components analysis (ICA [33]). These methods tend to
learn representations that focus on low-level variations in
the data, which are not very useful from the perspective of
downstream tasks such as object recognition.
Representations better suited to such tasks have been
learnt using deep neural networks, starting with seminal
techniques such as Boltzmann machines [71, 65], autoencoders [30], variational autoencoders [40], generative adversarial networks [24] and autoregressive models [56]. Numerous other works exist, for a review see [7]. A powerful family of models for unsupervised representations are
collected under the umbrella of “self-supervised” learning
[64, 35, 85, 84, 78, 60, 83]. In these models, an input X to
the model is transformed into an output ˆX, which is supposed to be close to another signal Y (usually in Euclidean
space), which itself is related to X in some meaningful way.
Examples of such X/Y pairs are: luminance and chrominance color channels of an image [85], patches from a single
image [57], modalities such as vision and sound [58] or the
frames of a video [78]. Clearly, such examples are numerous
in the world, and provides us with nearly infinite amounts
of training data: this is one of the appeals of this paradigm.
Time contrastive networks [68] use a triplet loss framework
to learn representations from aligned video sequences of
the same scene, taken by different video cameras. Closely
related to self-supervised learning is the idea of multi-view
learning, which is a general term involving many different
approaches such as co-training [8], multi-kernel learning
[13] and metric learning [6, 87]; for comprehensive surveys
please see [80, 45]. Nearly all existing works have dealt with
one or two views such as video or image/sound. However, in
many situations, many more views are available to provide
training signals for any representation
|