Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

Chaolong Yang; Yuyao Yan; Weiguang Zhao; Jianan Ye; Xi Yang; Amir Hussain; Bin Dong; Kaizhu Huang

doi:10.1007/978-981-99-8184-7_1

Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

Chaolong Yang, Yuyao Yan, Weiguang Zhao, Jianan Ye, Xi Yang, Amir Hussain, Bin Dong, Kaizhu Huang^*

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become a mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network’s flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network’s inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.

Original language	English
Title of host publication	Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings
Editors	Biao Luo, Long Cheng, Zheng-Guang Wu, Hongyi Li, Chaojie Li
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	3-15
Number of pages	13
ISBN (Print)	9789819981830
DOIs	https://doi.org/10.1007/978-981-99-8184-7_1
Publication status	Published - 2023
Event	30th International Conference on Neural Information Processing, ICONIP 2023 - Changsha, China Duration: 20 Nov 2023 → 23 Nov 2023

Publication series

Name	Communications in Computer and Information Science
Volume	1969 CCIS
ISSN (Print)	1865-0929
ISSN (Electronic)	1865-0937

Conference

Conference	30th International Conference on Neural Information Processing, ICONIP 2023
Country/Territory	China
City	Changsha
Period	20/11/23 → 23/11/23

Keywords

Multi-view fusion
Point cloud
Semantic segmentation

Access to Document

10.1007/978-981-99-8184-7_1

Cite this

Yang, C., Yan, Y., Zhao, W., Ye, J., Yang, X., Hussain, A., Dong, B., & Huang, K. (2023). Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation. In B. Luo, L. Cheng, Z.-G. Wu, H. Li, & C. Li (Eds.), Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings (pp. 3-15). (Communications in Computer and Information Science; Vol. 1969 CCIS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-8184-7_1

Yang, Chaolong ; Yan, Yuyao ; Zhao, Weiguang et al. / Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation. Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings. editor / Biao Luo ; Long Cheng ; Zheng-Guang Wu ; Hongyi Li ; Chaojie Li. Springer Science and Business Media Deutschland GmbH, 2023. pp. 3-15 (Communications in Computer and Information Science).

@inproceedings{08dbdd0474214745ab1c366dfe69712a,

title = "Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation",

abstract = "3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become a mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network{\textquoteright}s flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network{\textquoteright}s inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.",

keywords = "Multi-view fusion, Point cloud, Semantic segmentation",

author = "Chaolong Yang and Yuyao Yan and Weiguang Zhao and Jianan Ye and Xi Yang and Amir Hussain and Bin Dong and Kaizhu Huang",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.; 30th International Conference on Neural Information Processing, ICONIP 2023 ; Conference date: 20-11-2023 Through 23-11-2023",

year = "2023",

doi = "10.1007/978-981-99-8184-7_1",

language = "English",

isbn = "9789819981830",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "3--15",

editor = "Biao Luo and Long Cheng and Zheng-Guang Wu and Hongyi Li and Chaojie Li",

booktitle = "Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings",

}

Yang, C, Yan, Y, Zhao, W, Ye, J, Yang, X, Hussain, A, Dong, B & Huang, K 2023, Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation. in B Luo, L Cheng, Z-G Wu, H Li & C Li (eds), Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings. Communications in Computer and Information Science, vol. 1969 CCIS, Springer Science and Business Media Deutschland GmbH, pp. 3-15, 30th International Conference on Neural Information Processing, ICONIP 2023, Changsha, China, 20/11/23. https://doi.org/10.1007/978-981-99-8184-7_1

Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation. / Yang, Chaolong; Yan, Yuyao; Zhao, Weiguang et al.
Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings. ed. / Biao Luo; Long Cheng; Zheng-Guang Wu; Hongyi Li; Chaojie Li. Springer Science and Business Media Deutschland GmbH, 2023. p. 3-15 (Communications in Computer and Information Science; Vol. 1969 CCIS).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

AU - Yang, Chaolong

AU - Yan, Yuyao

AU - Zhao, Weiguang

AU - Ye, Jianan

AU - Yang, Xi

AU - Hussain, Amir

AU - Dong, Bin

AU - Huang, Kaizhu

PY - 2023

Y1 - 2023

N2 - 3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become a mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network’s flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network’s inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.

AB - 3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become a mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network’s flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network’s inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.

KW - Multi-view fusion

KW - Point cloud

KW - Semantic segmentation

UR - http://www.scopus.com/inward/record.url?scp=85178593734&partnerID=8YFLogxK

U2 - 10.1007/978-981-99-8184-7_1

DO - 10.1007/978-981-99-8184-7_1

M3 - Conference Proceeding

AN - SCOPUS:85178593734

SN - 9789819981830

T3 - Communications in Computer and Information Science

SP - 3

EP - 15

BT - Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings

A2 - Luo, Biao

A2 - Cheng, Long

A2 - Wu, Zheng-Guang

A2 - Li, Hongyi

A2 - Li, Chaojie

PB - Springer Science and Business Media Deutschland GmbH

T2 - 30th International Conference on Neural Information Processing, ICONIP 2023

Y2 - 20 November 2023 through 23 November 2023

ER -

Yang C, Yan Y, Zhao W, Ye J, Yang X, Hussain A et al. Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation. In Luo B, Cheng L, Wu ZG, Li H, Li C, editors, Neural Information Processing - 30th International Conference, ICONIP 2023, Proceedings. Springer Science and Business Media Deutschland GmbH. 2023. p. 3-15. (Communications in Computer and Information Science). doi: 10.1007/978-981-99-8184-7_1