TMSDNet: Transformer with multi-scale dense network for single and multi-view 3D reconstruction

Xiaoqiang Zhu; Xinsheng Yao; Junjie Zhang; Mengyao Zhu; Lihua You; Xiaosong Yang; Jianjun Zhang; He Zhao; Dan Zeng

doi:10.1002/cav.2201

TMSDNet: Transformer with multi-scale dense network for single and multi-view 3D reconstruction

Xiaoqiang Zhu, Xinsheng Yao, Junjie Zhang^*, Mengyao Zhu, Lihua You, Xiaosong Yang, Jianjun Zhang, He Zhao, Dan Zeng

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

3D reconstruction is a long-standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer-based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi-scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi-scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi-scale dense network) for single-view and multi-view 3D reconstruction with transformer to solve this problem. Based on our well-designed combined-transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi-scale global features in parallel using a multi-scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state-of-the-art reconstruction methods substantially.

Original language	English
Article number	e2201
Journal	Computer Animation and Virtual Worlds
Volume	35
Issue number	1
DOIs	https://doi.org/10.1002/cav.2201
Publication status	Published - 1 Jan 2024
Externally published	Yes

Keywords

deep learning
multi-scale
single-view and multi-view 3D reconstruction
transformer

Access to Document

10.1002/cav.2201

Cite this

@article{145bc7dd52444b34a6a00c0402f326e7,

title = "TMSDNet: Transformer with multi-scale dense network for single and multi-view 3D reconstruction",

abstract = "3D reconstruction is a long-standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer-based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi-scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi-scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi-scale dense network) for single-view and multi-view 3D reconstruction with transformer to solve this problem. Based on our well-designed combined-transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi-scale global features in parallel using a multi-scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state-of-the-art reconstruction methods substantially.",

keywords = "deep learning, multi-scale, single-view and multi-view 3D reconstruction, transformer",

author = "Xiaoqiang Zhu and Xinsheng Yao and Junjie Zhang and Mengyao Zhu and Lihua You and Xiaosong Yang and Jianjun Zhang and He Zhao and Dan Zeng",

note = "Publisher Copyright: {\textcopyright} 2023 John Wiley & Sons Ltd.",

year = "2024",

month = jan,

day = "1",

doi = "10.1002/cav.2201",

language = "English",

volume = "35",

journal = "Computer Animation and Virtual Worlds",

issn = "1546-4261",

number = "1",

}

TY - JOUR

T1 - TMSDNet

T2 - Transformer with multi-scale dense network for single and multi-view 3D reconstruction

AU - Zhu, Xiaoqiang

AU - Yao, Xinsheng

AU - Zhang, Junjie

AU - Zhu, Mengyao

AU - You, Lihua

AU - Yang, Xiaosong

AU - Zhang, Jianjun

AU - Zhao, He

AU - Zeng, Dan

PY - 2024/1/1

Y1 - 2024/1/1

N2 - 3D reconstruction is a long-standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer-based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi-scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi-scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi-scale dense network) for single-view and multi-view 3D reconstruction with transformer to solve this problem. Based on our well-designed combined-transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi-scale global features in parallel using a multi-scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state-of-the-art reconstruction methods substantially.

AB - 3D reconstruction is a long-standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer-based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi-scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi-scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi-scale dense network) for single-view and multi-view 3D reconstruction with transformer to solve this problem. Based on our well-designed combined-transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi-scale global features in parallel using a multi-scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state-of-the-art reconstruction methods substantially.

KW - deep learning

KW - multi-scale

KW - single-view and multi-view 3D reconstruction

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85166617447&partnerID=8YFLogxK

U2 - 10.1002/cav.2201

DO - 10.1002/cav.2201

M3 - Article

AN - SCOPUS:85166617447

SN - 1546-4261

VL - 35

JO - Computer Animation and Virtual Worlds

JF - Computer Animation and Virtual Worlds

IS - 1

M1 - e2201

ER -

TMSDNet: Transformer with multi-scale dense network for single and multi-view 3D reconstruction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this