Swin3D：一个用于3D室内场景理解的预先训练的Transformer主干 PDF 下载

Swin3D：一个用于3D室内场景理解的预先训练的Transformer主干 PDF 下载

转载自：http://www.python222.com/article/1272

相关截图：

主要内容：

. Introduction

Pretrained backbones with fine-tuning have been widely

applied to various 2D vision and NLP tasks [13, 2, 10, 3],

where a backbone network pretrained on a large dataset is

concatenated with task-specific back-end and then fine-tuned

for different downstream tasks. This approach demonstrates

Interns at Microsoft Research Asia. †Contact person.

its superior performance and great advantages in reducing

the workload of network design and training, as well as the

amount of labeled data required for different vision tasks.

In the work, we present a pretrained 3D backbone, named

SWIN3D, for 3D indoor scene understanding tasks. Our

method represents the 3D point cloud of an input 3D scene as

sparse voxels in 3D space and adapts the Swin Transformer

[30] designed for regular 2D images to unorganized 3D

points as the 3D backbone. We analyze the key issues that

prevent the na¨ıve 3D extension of Swin Transformer from

exploring large models and achieving high performance,

i.e., the high memory complexity, the ignorance of signal

irregularity. Based on our analysis, we develop a novel

3D self-attention operator to compute the self-attentions of

sparse voxels within each local window, which reduces the

memory cost of self-attention from quadratic to linear with

respect to the number of sparse voxels within a window and

computes efficiently; enhances self-attention via capturing

various signal irregularities by our generalized contextual

relative positional embedding [48, 26].

The novel design of our SWIN3D backbone enables us to

scale up the backbone model and the amount of data used

for pretraining. To this end, we pretrained a large SWIN3D

model with 60M parameters via a 3D semantic segmenta

tion task over a synthetic 3D indoor scene dataset [60] that

includes 21K rooms and is about ten times larger than the

ScanNet dataset. After pretraining, we cascade the pretrained

SWIN3D backbone with task-specific back-end decoders

and fine-tune the models for various downstream 3D indoor

scene understanding tasks.