Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

        
AI编程,程序员挑战年入30~100万高级指南 - 职业规划
SpringBoot+SpringSecurity+Vue权限系统高级实战课程        

IDEA永久激活

Java微信小程序电商实战课程(SpringBoot+VUe)

     

AI人工智能学习大礼包

     

PyCharm永久激活

66套java实战课程无套路领取

     

Cursor+Claude AI编程 1天快速上手视频教程

     
当前位置: 主页 > Java文档 > 人工智能AI >

Swin3D:一个用于3D室内场景理解的预先训练的Transformer主干 PDF 下载


时间:2025-05-31 11:01来源:http://www.java1234.com 作者:转载  侵权举报
Swin3D:一个用于3D室内场景理解的预先训练的Transformer主干
失效链接处理
Swin3D:一个用于3D室内场景理解的预先训练的Transformer主干  PDF 下载

 
 
相关截图:
 

主要内容:
 

 

. Introduction
Pretrained backbones with fine-tuning have been widely
applied to various 2D vision and NLP tasks [132103],
where a backbone network pretrained on a large dataset is
concatenated with task-specific back-end and then fine-tuned
for different downstream tasks. This approach demonstrates
*
Interns at Microsoft Research Asia. †Contact person.
its superior performance and great advantages in reducing
the workload of network design and training, as well as the
amount of labeled data required for different vision tasks.
In the work, we present a pretrained 3D backbone, named
SWIN3D, for 3D indoor scene understanding tasks. Our
method represents the 3D point cloud of an input 3D scene as
sparse voxels in 3D space and adapts the Swin Transformer
[30] designed for regular 2D images to unorganized 3D
points as the 3D backbone. We analyze the key issues that
prevent the na¨ıve 3D extension of Swin Transformer from
exploring large models and achieving high performance,
i.e., the high memory complexitythe ignorance of signal
irregularity. Based on our analysis, we develop a novel
3D self-attention operator to compute the self-attentions of
sparse voxels within each local window, which reduces the
memory cost of self-attention from quadratic to linear with
respect to the number of sparse voxels within a window and
computes efficiently; enhances self-attention via capturing
various signal irregularities by our generalized contextual
relative positional embedding [4826].
The novel design of our SWIN3D backbone enables us to
scale up the backbone model and the amount of data used
for pretraining. To this end, we pretrained a large SWIN3D
model with 60M parameters via a 3D semantic segmenta
tion task over a synthetic 3D indoor scene dataset [60] that
includes 21K rooms and is about ten times larger than the
ScanNet dataset. After pretraining, we cascade the pretrained
SWIN3D backbone with task-specific back-end decoders
and fine-tune the models for various downstream 3D indoor
scene understanding tasks.
 


 

------分隔线----------------------------


锋哥推荐