Contrastive Multiview Coding笔记 PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

Contrastive Multiview Coding笔记 PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1HRO7Wlw5nviftXBA91ZyWA

提取码：vwaa

相关截图：

主要内容：

Humans view the world through many sensory channels,

e.g., the long-wavelength light channel, viewed by the left

eye, or the high-frequency vibrations channel, heard by the

right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend

to be shared between all views (e.g., a “dog” can be seen,

heard, and felt). We investigate the classic hypothesis that

a powerful representation is one that models view-invariant

factors. We study this hypothesis under the framework of

multiview contrastive learning, where we learn a representation that aims to maximize mutual information between

different views of the same scene but is otherwise compact.

Our approach scales to any number of views, and is viewagnostic. We analyze key properties of the approach that

make it work, finding that the contrastive loss outperforms

a popular alternative based on cross-view prediction, and

that the more views we learn from, the better the resulting

representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video

unsupervised learning benchmarks. Code is released at:

http://github.com/HobbitLong/CMC/.

1. Introduction

A foundational idea in coding theory is to learn compressed representations that nonetheless can be used to reconstruct the raw data. This idea shows up in contemporary

representation learning in the form of autoencoders [65] and

generative models [40, 24], which try to represent a data

point or distribution as losslessly as possible. Yet lossless

representation might not be what we really want, and indeed

it is trivial to achieve – the raw data itself is a lossless representation. What we might instead prefer is to keep the

“good” information (signal) and throw away the rest (noise).

How can we identify what information is signal and what is

noise?

To an autoencoder, or a max likelihood generative model,

a bit is a bit. No one bit is better than any other. Our conjecture in this paper is that some bits are in fact better than

others. Some bits code important properties like semantics, physics, and geometry, while others code attributes that

we might consider less important, like incidental lighting

c c fc✓1 f✓2 vi1 2 V1 vi2 2 V2 zi1 zi2 ! ! , , , c c f✓3 f✓4 zi3 zi4 vi3 2 V3 vi4 2 V4 f✓1 vj1 2 V1zj1

Matching views

Unmatching view

Figure 1: Given a set of sensory views, a deep representation is

learnt by bringing views of the same scene together in embedding

space, while pushing views of different scenes apart. Here we show

and example of a 4-view dataset (NYU RGBD [53]) and its learned

representation. The encodings for each view may be concatenated

to form the full representation of a scene.

conditions or thermal noise in a camera’s sensor.

We revisit the classic hypothesis that the good bits are the

ones that are shared between multiple views of the world,

for example between multiple sensory modalities like vision,

sound, and touch [70]. Under this perspective “presence of

dog” is good information, since dogs can be seen, heard,

and felt, but “camera pose” is bad information, since a camera’s pose has little or no effect on the acoustic and tactile

properties of the imaged scene. This hypothesis corresponds

to the inductive bias that the way you view a scene should

not affect its semantics. There is significant evidence in

the cognitive science and neuroscience literature that such

view-invariant representations are encoded by the brain (e.g.,

[70, 15, 32]). In this paper, we specifically study the setting

where the different views are different image channels, such

as luminance, chrominance, depth, and optical flow. The fundamental supervisory signal we exploit is the co-occurrence,

in natural data, of multiple views of the same scene. For

example, we consider an image in Lab color space to be a

paired example of the co-occurrence of two views of the

scene, the L view and the ab view: {L, ab}.

Our goal is therefore to learn representations that capture

arXiv:1906.05849v4 [cs.CV] 11 Mar 2020

information shared between multiple sensory channels but

that are otherwise compact (i.e. discard channel-specific

nuisance factors). To do so, we employ contrastive learning,

where we learn a feature embedding such that views of the

same scene map to nearby points (measured with Euclidean

distance in representation space) while views of different

scenes map to far apart points. In particular, we adapt the

recently proposed method of Contrastive Predictive Coding

(CPC) [57], except we simplify it – removing the recurrent

network – and generalize it – showing how to apply it to

arbitrary collections of image channels, rather than just to

temporal or spatial predictions. In reference to CPC, we term

our method Contrastive Multiview Coding (CMC), although

we note that our formulation is arguably equally related to

Instance Discrimination [79]. The contrastive objective in

our formulation, as in CPC and Instance Discrimination,

can be understood as attempting to maximize the mutual

information between the representations of multiple views

of the data.

We intentionally leave “good bits” only loosely defined

and treat its definition as an empirical question. Ultimately,

the proof is in the pudding: we consider a representation

to be good if it makes subsequent problem solving easy, on

tasks of human interest. For example, a useful representation

of images might be a feature space in which it is easy to learn

to recognize objects. We therefore evaluate our method by

testing if the learned representations transfer well to standard

semantic recognition tasks. On several benchmark tasks, our

method achieves results competitive with the state of the

art, compared to other methods for self-supervised representation learning. We additionally find that the quality of

the representation improves as a function of the number of

views used for training. Finally, we compare the contrastive

formulation of multiview learning to the recently popular

approach of cross-view prediction, and find that in head-tohead comparisons, the contrastive approach learns stronger

representations.

The core ideas that we build on: contrastive learning,

mutual information maximization, and deep representation

learning, are not new and have been explored in the literature

on representation and multiview learning for decades [64,

45, 80, 3]. Our main contribution is to set up a framework to

extend these ideas to any number of views, and to empirically

study the factors that lead to success in this framework. A

review of the related literature is given in Section 2; and Fig.

1 gives a pictorial overview of our framework. Our main

contributions are:

• We apply contrastive learning to the multiview setting,

attempting to maximize mutual information between

representations of different views of the same scene (in

particular, between different image channels).

• We extend the framework to learn from more than two

views, and show that the quality of the learned representation improves as number of views increase. Ours is

the first work to explicitly show the benefits of multiple

views on representation quality.

• We conduct controlled experiments to measure the effect of mutual information estimates on representation

quality. Our experiments show that the relationship

between mutual information and views is a subtle one.

• Our representations rival state of the art on popular

benchmarks.

• We demonstrate that the contrastive objective is superior to cross-view prediction.

2. Related work

Unsupervised representation learning is about learning

transformations of the data that make subsequent problem

solving easier [7]. This field has a long history, starting with

classical methods with well established algorithms, such as

principal components analysis (PCA [37]) and independent

components analysis (ICA [33]). These methods tend to

learn representations that focus on low-level variations in

the data, which are not very useful from the perspective of

downstream tasks such as object recognition.

Representations better suited to such tasks have been

learnt using deep neural networks, starting with seminal

techniques such as Boltzmann machines [71, 65], autoencoders [30], variational autoencoders [40], generative adversarial networks [24] and autoregressive models [56]. Numerous other works exist, for a review see [7]. A powerful family of models for unsupervised representations are

collected under the umbrella of “self-supervised” learning

[64, 35, 85, 84, 78, 60, 83]. In these models, an input X to

the model is transformed into an output ˆX, which is supposed to be close to another signal Y (usually in Euclidean

space), which itself is related to X in some meaningful way.

Examples of such X/Y pairs are: luminance and chrominance color channels of an image [85], patches from a single

image [57], modalities such as vision and sound [58] or the

frames of a video [78]. Clearly, such examples are numerous

in the world, and provides us with nearly infinite amounts

of training data: this is one of the appeals of this paradigm.

Time contrastive networks [68] use a triplet loss framework

to learn representations from aligned video sequences of

the same scene, taken by different video cameras. Closely

related to self-supervised learning is the idea of multi-view

learning, which is a general term involving many different

approaches such as co-training [8], multi-kernel learning

[13] and metric learning [6, 87]; for comprehensive surveys

please see [80, 45]. Nearly all existing works have dealt with

one or two views such as video or image/sound. However, in

many situations, many more views are available to provide

training signals for any representation

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

Contrastive Multiview Coding笔记 PDF 下载

Java1234官方群25：
Java1234官方群25：	838462530