MAN and CAT: mix attention to nn and concatenate attention to YOLO

Runwei Guan; Ka Lok Man; Haocheng Zhao; Ruixiao Zhang; Shanliang Yao; Jeremy Smith; Eng Gee Lim; Yutao Yue

doi:10.1007/s11227-022-04726-7

MAN and CAT: mix attention to nn and concatenate attention to YOLO

Runwei Guan, Ka Lok Man, Haocheng Zhao, Ruixiao Zhang, Shanliang Yao, Jeremy Smith, Eng Gee Lim, Yutao Yue^*

^*Corresponding author for this work

School of Advanced Technology

Research output: Contribution to journal › Article › peer-review

8 Citations (Scopus)

Abstract

CNNs have achieved remarkable image classification and object detection results over the past few years. Due to the locality of the convolution operation, although CNNs can extract rich features of the object itself, they can hardly obtain global context in images. It means the CNN-based network is not a good candidate for detecting objects by utilizing the information of the nearby objects, especially when the partially obscured object is hard to detect. ViTs can get a rich context and dramatically improve the prediction in complex scenes with multi-head self-attention. However, it suffers from long inference time and huge parameters, which leads ViT-based detection network that is hardly be deployed in the real-time detection system. In this paper, firstly, we design a novel plug-and-play attention module called mix attention (MA). MA combines channel, spatial and global contextual attention together. It enhances the feature representation of individuals and the correlation between multiple individuals. Secondly, we propose a backbone network based on mix attention called MANet. MANet-Base achieves the state-of-the-art performances on ImageNet and CIFAR. Last but not least, we propose a lightweight object detection network called CAT-YOLO, where we make a trade-off between precision and speed. It achieves the AP of 25.7% on COCO 2017 test-dev with only 9.17 million parameters, making it possible to deploy models containing ViT on hardware and ensure real-time detection. CAT-YOLO could better detect obscured objects than other state-of-the-art lightweight models.

Original language	English
Pages (from-to)	2108-2136
Number of pages	29
Journal	Journal of Supercomputing
Volume	79
Issue number	2
DOIs	https://doi.org/10.1007/s11227-022-04726-7
Publication status	Published - 11 Feb 2023

Keywords

Attention mechanism
Lightweight NN
Object detection
Object recognition
Plug-and-play NN

Access to Document

10.1007/s11227-022-04726-7

Cite this

@article{7e4fe4b0154d4030b38e9ebc88374445,

title = "MAN and CAT: mix attention to nn and concatenate attention to YOLO",

abstract = "CNNs have achieved remarkable image classification and object detection results over the past few years. Due to the locality of the convolution operation, although CNNs can extract rich features of the object itself, they can hardly obtain global context in images. It means the CNN-based network is not a good candidate for detecting objects by utilizing the information of the nearby objects, especially when the partially obscured object is hard to detect. ViTs can get a rich context and dramatically improve the prediction in complex scenes with multi-head self-attention. However, it suffers from long inference time and huge parameters, which leads ViT-based detection network that is hardly be deployed in the real-time detection system. In this paper, firstly, we design a novel plug-and-play attention module called mix attention (MA). MA combines channel, spatial and global contextual attention together. It enhances the feature representation of individuals and the correlation between multiple individuals. Secondly, we propose a backbone network based on mix attention called MANet. MANet-Base achieves the state-of-the-art performances on ImageNet and CIFAR. Last but not least, we propose a lightweight object detection network called CAT-YOLO, where we make a trade-off between precision and speed. It achieves the AP of 25.7% on COCO 2017 test-dev with only 9.17 million parameters, making it possible to deploy models containing ViT on hardware and ensure real-time detection. CAT-YOLO could better detect obscured objects than other state-of-the-art lightweight models.",

keywords = "Attention mechanism, Lightweight NN, Object detection, Object recognition, Plug-and-play NN",

author = "Runwei Guan and Man, {Ka Lok} and Haocheng Zhao and Ruixiao Zhang and Shanliang Yao and Jeremy Smith and Lim, {Eng Gee} and Yutao Yue",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = feb,

day = "11",

doi = "10.1007/s11227-022-04726-7",

language = "English",

volume = "79",

pages = "2108--2136",

journal = "Journal of Supercomputing",

issn = "0920-8542",

number = "2",

}

TY - JOUR

T1 - MAN and CAT

T2 - mix attention to nn and concatenate attention to YOLO

AU - Guan, Runwei

AU - Man, Ka Lok

AU - Zhao, Haocheng

AU - Zhang, Ruixiao

AU - Yao, Shanliang

AU - Smith, Jeremy

AU - Lim, Eng Gee

AU - Yue, Yutao

PY - 2023/2/11

Y1 - 2023/2/11

N2 - CNNs have achieved remarkable image classification and object detection results over the past few years. Due to the locality of the convolution operation, although CNNs can extract rich features of the object itself, they can hardly obtain global context in images. It means the CNN-based network is not a good candidate for detecting objects by utilizing the information of the nearby objects, especially when the partially obscured object is hard to detect. ViTs can get a rich context and dramatically improve the prediction in complex scenes with multi-head self-attention. However, it suffers from long inference time and huge parameters, which leads ViT-based detection network that is hardly be deployed in the real-time detection system. In this paper, firstly, we design a novel plug-and-play attention module called mix attention (MA). MA combines channel, spatial and global contextual attention together. It enhances the feature representation of individuals and the correlation between multiple individuals. Secondly, we propose a backbone network based on mix attention called MANet. MANet-Base achieves the state-of-the-art performances on ImageNet and CIFAR. Last but not least, we propose a lightweight object detection network called CAT-YOLO, where we make a trade-off between precision and speed. It achieves the AP of 25.7% on COCO 2017 test-dev with only 9.17 million parameters, making it possible to deploy models containing ViT on hardware and ensure real-time detection. CAT-YOLO could better detect obscured objects than other state-of-the-art lightweight models.

AB - CNNs have achieved remarkable image classification and object detection results over the past few years. Due to the locality of the convolution operation, although CNNs can extract rich features of the object itself, they can hardly obtain global context in images. It means the CNN-based network is not a good candidate for detecting objects by utilizing the information of the nearby objects, especially when the partially obscured object is hard to detect. ViTs can get a rich context and dramatically improve the prediction in complex scenes with multi-head self-attention. However, it suffers from long inference time and huge parameters, which leads ViT-based detection network that is hardly be deployed in the real-time detection system. In this paper, firstly, we design a novel plug-and-play attention module called mix attention (MA). MA combines channel, spatial and global contextual attention together. It enhances the feature representation of individuals and the correlation between multiple individuals. Secondly, we propose a backbone network based on mix attention called MANet. MANet-Base achieves the state-of-the-art performances on ImageNet and CIFAR. Last but not least, we propose a lightweight object detection network called CAT-YOLO, where we make a trade-off between precision and speed. It achieves the AP of 25.7% on COCO 2017 test-dev with only 9.17 million parameters, making it possible to deploy models containing ViT on hardware and ensure real-time detection. CAT-YOLO could better detect obscured objects than other state-of-the-art lightweight models.

KW - Attention mechanism

KW - Lightweight NN

KW - Object detection

KW - Object recognition

KW - Plug-and-play NN

UR - http://www.scopus.com/inward/record.url?scp=85135619062&partnerID=8YFLogxK

U2 - 10.1007/s11227-022-04726-7

DO - 10.1007/s11227-022-04726-7

M3 - Article

AN - SCOPUS:85135619062

SN - 0920-8542

VL - 79

SP - 2108

EP - 2136

JO - Journal of Supercomputing

JF - Journal of Supercomputing

IS - 2

ER -

MAN and CAT: mix attention to nn and concatenate attention to YOLO

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this