| 失效链接处理 | 
| Contrastive Multiview Coding笔记  PDF 下载 
	本站整理下载: 
	相关截图:  
	主要内容: 
		Humans view the world through many sensory channels, 
		e.g., the long-wavelength light channel, viewed by the left 
		eye, or the high-frequency vibrations channel, heard by the 
		right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend 
		to be shared between all views (e.g., a “dog” can be seen, 
		heard, and felt). We investigate the classic hypothesis that 
		a powerful representation is one that models view-invariant 
		factors. We study this hypothesis under the framework of 
		multiview contrastive learning, where we learn a representation that aims to maximize mutual information between 
		different views of the same scene but is otherwise compact. 
		Our approach scales to any number of views, and is viewagnostic. We analyze key properties of the approach that 
		make it work, finding that the contrastive loss outperforms 
		a popular alternative based on cross-view prediction, and 
		that the more views we learn from, the better the resulting 
		representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video 
		unsupervised learning benchmarks. Code is released at: 
		http://github.com/HobbitLong/CMC/. 
		1. Introduction 
		A foundational idea in coding theory is to learn compressed representations that nonetheless can be used to reconstruct the raw data. This idea shows up in contemporary 
		representation learning in the form of autoencoders [65] and 
		generative models [40, 24], which try to represent a data 
		point or distribution as losslessly as possible. Yet lossless 
		representation might not be what we really want, and indeed 
		it is trivial to achieve – the raw data itself is a lossless representation. What we might instead prefer is to keep the 
		“good” information (signal) and throw away the rest (noise). 
		How can we identify what information is signal and what is 
		noise? 
		To an autoencoder, or a max likelihood generative model, 
		a bit is a bit. No one bit is better than any other. Our conjecture in this paper is that some bits are in fact better than 
		others. Some bits code important properties like semantics, physics, and geometry, while others code attributes that 
		we might consider less important, like incidental lighting 
		c c fc✓1 f✓2 vi1 2 V1 vi2 2 V2 zi1 zi2 ! ! , , , c c f✓3 f✓4 zi3 zi4 vi3 2 V3 vi4 2 V4 f✓1 vj1 2 V1zj1 
		Matching views 
		Unmatching view 
		Figure 1: Given a set of sensory views, a deep representation is 
		learnt by bringing views of the same scene together in embedding 
		space, while pushing views of different scenes apart. Here we show 
		and example of a 4-view dataset (NYU RGBD [53]) and its learned 
		representation. The encodings for each view may be concatenated 
		to form the full representation of a scene. 
		conditions or thermal noise in a camera’s sensor. 
		We revisit the classic hypothesis that the good bits are the 
		ones that are shared between multiple views of the world, 
		for example between multiple sensory modalities like vision, 
		sound, and touch [70]. Under this perspective “presence of 
		dog” is good information, since dogs can be seen, heard, 
		and felt, but “camera pose” is bad information, since a camera’s pose has little or no effect on the acoustic and tactile 
		properties of the imaged scene. This hypothesis corresponds 
		to the inductive bias that the way you view a scene should 
		not affect its semantics. There is significant evidence in 
		the cognitive science and neuroscience literature that such 
		view-invariant representations are encoded by the brain (e.g., 
		[70, 15, 32]). In this paper, we specifically study the setting 
		where the different views are different image channels, such 
		as luminance, chrominance, depth, and optical flow. The fundamental supervisory signal we exploit is the co-occurrence, 
		in natural data, of multiple views of the same scene. For 
		example, we consider an image in Lab color space to be a 
		paired example of the co-occurrence of two views of the 
		scene, the L view and the ab view: {L, ab}. 
		Our goal is therefore to learn representations that capture 
		1 
		arXiv:1906.05849v4 [cs.CV] 11 Mar 2020 
		information shared between multiple sensory channels but 
		that are otherwise compact (i.e. discard channel-specific 
		nuisance factors). To do so, we employ contrastive learning, 
		where we learn a feature embedding such that views of the 
		same scene map to nearby points (measured with Euclidean 
		distance in representation space) while views of different 
		scenes map to far apart points. In particular, we adapt the 
		recently proposed method of Contrastive Predictive Coding 
		(CPC) [57], except we simplify it – removing the recurrent 
		network – and generalize it – showing how to apply it to 
		arbitrary collections of image channels, rather than just to 
		temporal or spatial predictions. In reference to CPC, we term 
		our method Contrastive Multiview Coding (CMC), although 
		we note that our formulation is arguably equally related to 
		Instance Discrimination [79]. The contrastive objective in 
		our formulation, as in CPC and Instance Discrimination, 
		can be understood as attempting to maximize the mutual 
		information between the representations of multiple views 
		of the data. 
		We intentionally leave “good bits” only loosely defined 
		and treat its definition as an empirical question. Ultimately, 
		the proof is in the pudding: we consider a representation 
		to be good if it makes subsequent problem solving easy, on 
		tasks of human interest. For example, a useful representation 
		of images might be a feature space in which it is easy to learn 
		to recognize objects. We therefore evaluate our method by 
		testing if the learned representations transfer well to standard 
		semantic recognition tasks. On several benchmark tasks, our 
		method achieves results competitive with the state of the 
		art, compared to other methods for self-supervised representation learning. We additionally find that the quality of 
		the representation improves as a function of the number of 
		views used for training. Finally, we compare the contrastive 
		formulation of multiview learning to the recently popular 
		approach of cross-view prediction, and find that in head-tohead comparisons, the contrastive approach learns stronger 
		representations. 
		The core ideas that we build on: contrastive learning, 
		mutual information maximization, and deep representation 
		learning, are not new and have been explored in the literature 
		on representation and multiview learning for decades [64, 
		45, 80, 3]. Our main contribution is to set up a framework to 
		extend these ideas to any number of views, and to empirically 
		study the factors that lead to success in this framework. A 
		review of the related literature is given in Section 2; and Fig. 
		1 gives a pictorial overview of our framework. Our main 
		contributions are: 
		• We apply contrastive learning to the multiview setting, 
		attempting to maximize mutual information between 
		representations of different views of the same scene (in 
		particular, between different image channels). 
		• We extend the framework to learn from more than two 
		views, and show that the quality of the learned representation improves as number of views increase. Ours is 
		the first work to explicitly show the benefits of multiple 
		views on representation quality. 
		• We conduct controlled experiments to measure the effect of mutual information estimates on representation 
		quality. Our experiments show that the relationship 
		between mutual information and views is a subtle one. 
		• Our representations rival state of the art on popular 
		benchmarks. 
		• We demonstrate that the contrastive objective is superior to cross-view prediction. 
		2. Related work 
		Unsupervised representation learning is about learning 
		transformations of the data that make subsequent problem 
		solving easier [7]. This field has a long history, starting with 
		classical methods with well established algorithms, such as 
		principal components analysis (PCA [37]) and independent 
		components analysis (ICA [33]). These methods tend to 
		learn representations that focus on low-level variations in 
		the data, which are not very useful from the perspective of 
		downstream tasks such as object recognition. 
		Representations better suited to such tasks have been 
		learnt using deep neural networks, starting with seminal 
		techniques such as Boltzmann machines [71, 65], autoencoders [30], variational autoencoders [40], generative adversarial networks [24] and autoregressive models [56]. Numerous other works exist, for a review see [7]. A powerful family of models for unsupervised representations are 
		collected under the umbrella of “self-supervised” learning 
		[64, 35, 85, 84, 78, 60, 83]. In these models, an input X to 
		the model is transformed into an output ˆX, which is supposed to be close to another signal Y (usually in Euclidean 
		space), which itself is related to X in some meaningful way. 
		Examples of such X/Y pairs are: luminance and chrominance color channels of an image [85], patches from a single 
		image [57], modalities such as vision and sound [58] or the 
		frames of a video [78]. Clearly, such examples are numerous 
		in the world, and provides us with nearly infinite amounts 
		of training data: this is one of the appeals of this paradigm. 
		Time contrastive networks [68] use a triplet loss framework 
		to learn representations from aligned video sequences of 
		the same scene, taken by different video cameras. Closely 
		related to self-supervised learning is the idea of multi-view 
		learning, which is a general term involving many different 
		approaches such as co-training [8], multi-kernel learning 
		[13] and metric learning [6, 87]; for comprehensive surveys 
		please see [80, 45]. Nearly all existing works have dealt with 
		one or two views such as video or image/sound. However, in 
		many situations, many more views are available to provide 
		training signals for any representation | 



 
     苏公网安备 32061202001004号
苏公网安备 32061202001004号


 
    