TY - JOUR
T1 - TMSDNet
T2 - Transformer with multi-scale dense network for single and multi-view 3D reconstruction
AU - Zhu, Xiaoqiang
AU - Yao, Xinsheng
AU - Zhang, Junjie
AU - Zhu, Mengyao
AU - You, Lihua
AU - Yang, Xiaosong
AU - Zhang, Jianjun
AU - Zhao, He
AU - Zeng, Dan
N1 - Publisher Copyright:
© 2023 John Wiley & Sons Ltd.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - 3D reconstruction is a long-standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer-based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi-scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi-scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi-scale dense network) for single-view and multi-view 3D reconstruction with transformer to solve this problem. Based on our well-designed combined-transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi-scale global features in parallel using a multi-scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state-of-the-art reconstruction methods substantially.
AB - 3D reconstruction is a long-standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer-based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi-scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi-scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi-scale dense network) for single-view and multi-view 3D reconstruction with transformer to solve this problem. Based on our well-designed combined-transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi-scale global features in parallel using a multi-scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state-of-the-art reconstruction methods substantially.
KW - deep learning
KW - multi-scale
KW - single-view and multi-view 3D reconstruction
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85166617447&partnerID=8YFLogxK
U2 - 10.1002/cav.2201
DO - 10.1002/cav.2201
M3 - Article
AN - SCOPUS:85166617447
SN - 1546-4261
VL - 35
JO - Computer Animation and Virtual Worlds
JF - Computer Animation and Virtual Worlds
IS - 1
M1 - e2201
ER -