Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

Yiming Lin; Xiao Bo Jin; Qiufeng Wang; Kaizhu Huang

doi:10.1109/ICDM58522.2023.00142

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

Yiming Lin, Xiao Bo Jin^*, Qiufeng Wang, Kaizhu Huang

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

3 Citations (Scopus)

Abstract

Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar k image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-k most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-k most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially. Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.

Original language	English
Title of host publication	IEEE International Conference on Data Mining (ICDM), 2023
Editors	Guihai Chen, Latifur Khan, Xiaofeng Gao, Meikang Qiu, Witold Pedrycz, Xindong Wu
Pages	1163-1168
Number of pages	6
ISBN (Electronic)	9798350307887
DOIs	https://doi.org/10.1109/ICDM58522.2023.00142
Publication status	Published - 2023

Keywords

One-stage Method
Panoptic Narrative Grounding
Visual Grounding

Access to Document

10.1109/ICDM58522.2023.00142

Cite this

@inproceedings{f334eddf2f224e66ae97a0b52ccc868f,

title = "Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network",

abstract = "Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar k image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-k most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-k most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially. Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.",

keywords = "One-stage Method, Panoptic Narrative Grounding, Visual Grounding",

author = "Yiming Lin and Jin, {Xiao Bo} and Qiufeng Wang and Kaizhu Huang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.",

year = "2023",

doi = "10.1109/ICDM58522.2023.00142",

language = "English",

pages = "1163--1168",

editor = "Guihai Chen and Latifur Khan and Xiaofeng Gao and Meikang Qiu and Witold Pedrycz and Xindong Wu",

booktitle = "IEEE International Conference on Data Mining (ICDM), 2023",

}

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network. / Lin, Yiming; Jin, Xiao Bo ; Wang, Qiufeng et al.
IEEE International Conference on Data Mining (ICDM), 2023. ed. / Guihai Chen; Latifur Khan; Xiaofeng Gao; Meikang Qiu; Witold Pedrycz; Xindong Wu. 2023. p. 1163-1168.

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Context Does Matter

T2 - End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

AU - Lin, Yiming

AU - Jin, Xiao Bo

AU - Wang, Qiufeng

AU - Huang, Kaizhu

PY - 2023

Y1 - 2023

N2 - Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar k image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-k most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-k most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially. Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.

AB - Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar k image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-k most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-k most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially. Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.

KW - One-stage Method

KW - Panoptic Narrative Grounding

KW - Visual Grounding

UR - http://www.scopus.com/inward/record.url?scp=85185401769&partnerID=8YFLogxK

U2 - 10.1109/ICDM58522.2023.00142

DO - 10.1109/ICDM58522.2023.00142

M3 - Conference Proceeding

AN - SCOPUS:85185401769

SP - 1163

EP - 1168

BT - IEEE International Conference on Data Mining (ICDM), 2023

A2 - Chen, Guihai

A2 - Khan, Latifur

A2 - Gao, Xiaofeng

A2 - Qiu, Meikang

A2 - Pedrycz, Witold

A2 - Wu, Xindong

ER -

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this