Hybrid Multi-Class Token Vision Transformer Convolutional Network for DOA Estimation

Yuxuan Xie; Aifei Liu; Xinyu Lu; Dufei Chong

doi:10.1109/LSP.2025.3573949

Hybrid Multi-Class Token Vision Transformer Convolutional Network for DOA Estimation

Yuxuan Xie, Aifei Liu^*, Xinyu Lu, Dufei Chong

^*Corresponding author for this work

School of AI and Advanced Computing

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

Abstract

In this letter, we propose an efficient hybrid model, named HMC-ViT, that combines a convolutional neural network (CNN) with a multi-class token vision transformer (ViT) to address the problem of direction of arrival (DOA) estimation. HMC-ViT integrates the local feature extraction capability of CNN with the global feature extraction capability of ViT to enhance DOA estimation performance and improve the computational efficiency of ViT. Additionally, the ViT component employs multiple class tokens in parallel to generate spatial spectra for sub-regions, further enhancing the model's performance. Simulation results demonstrate that the proposed method outperforms existing approaches under low signal-to-noise ratio (SNR) scenarios.

Original language	English
Pages (from-to)	2279-2283
Number of pages	5
Journal	IEEE Signal Processing Letters
Volume	32
DOIs	https://doi.org/10.1109/LSP.2025.3573949
Publication status	Published - 2025

Keywords

Convolutional neural network (CNN)
deep learning
direction of arrival (DOA) estimation
vision transformer (ViT)

Access to Document

10.1109/LSP.2025.3573949

Cite this

@article{7d44ec590e5449faa2bc38d811fec673,

title = "Hybrid Multi-Class Token Vision Transformer Convolutional Network for DOA Estimation",

abstract = "In this letter, we propose an efficient hybrid model, named HMC-ViT, that combines a convolutional neural network (CNN) with a multi-class token vision transformer (ViT) to address the problem of direction of arrival (DOA) estimation. HMC-ViT integrates the local feature extraction capability of CNN with the global feature extraction capability of ViT to enhance DOA estimation performance and improve the computational efficiency of ViT. Additionally, the ViT component employs multiple class tokens in parallel to generate spatial spectra for sub-regions, further enhancing the model's performance. Simulation results demonstrate that the proposed method outperforms existing approaches under low signal-to-noise ratio (SNR) scenarios.",

keywords = "Convolutional neural network (CNN), deep learning, direction of arrival (DOA) estimation, vision transformer (ViT)",

author = "Yuxuan Xie and Aifei Liu and Xinyu Lu and Dufei Chong",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2025",

doi = "10.1109/LSP.2025.3573949",

language = "English",

volume = "32",

pages = "2279--2283",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

}

TY - JOUR

T1 - Hybrid Multi-Class Token Vision Transformer Convolutional Network for DOA Estimation

AU - Xie, Yuxuan

AU - Liu, Aifei

AU - Lu, Xinyu

AU - Chong, Dufei

PY - 2025

Y1 - 2025

N2 - In this letter, we propose an efficient hybrid model, named HMC-ViT, that combines a convolutional neural network (CNN) with a multi-class token vision transformer (ViT) to address the problem of direction of arrival (DOA) estimation. HMC-ViT integrates the local feature extraction capability of CNN with the global feature extraction capability of ViT to enhance DOA estimation performance and improve the computational efficiency of ViT. Additionally, the ViT component employs multiple class tokens in parallel to generate spatial spectra for sub-regions, further enhancing the model's performance. Simulation results demonstrate that the proposed method outperforms existing approaches under low signal-to-noise ratio (SNR) scenarios.

AB - In this letter, we propose an efficient hybrid model, named HMC-ViT, that combines a convolutional neural network (CNN) with a multi-class token vision transformer (ViT) to address the problem of direction of arrival (DOA) estimation. HMC-ViT integrates the local feature extraction capability of CNN with the global feature extraction capability of ViT to enhance DOA estimation performance and improve the computational efficiency of ViT. Additionally, the ViT component employs multiple class tokens in parallel to generate spatial spectra for sub-regions, further enhancing the model's performance. Simulation results demonstrate that the proposed method outperforms existing approaches under low signal-to-noise ratio (SNR) scenarios.

KW - Convolutional neural network (CNN)

KW - deep learning

KW - direction of arrival (DOA) estimation

KW - vision transformer (ViT)

UR - http://www.scopus.com/inward/record.url?scp=105006801068&partnerID=8YFLogxK

U2 - 10.1109/LSP.2025.3573949

DO - 10.1109/LSP.2025.3573949

M3 - Article

AN - SCOPUS:105006801068

SN - 1070-9908

VL - 32

SP - 2279

EP - 2283

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

Hybrid Multi-Class Token Vision Transformer Convolutional Network for DOA Estimation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this