Modality-Aware Shot Relating and Comparing for Video Scene Detection

Jiawei Tan; Hongxing Wang; Kang Dang; Jiaxin Li; Zhilong Ou

doi:10.1609/aaai.v39i7.32773

Modality-Aware Shot Relating and Comparing for Video Scene Detection

Jiawei Tan, Hongxing Wang^*, Kang Dang, Jiaxin Li, Zhilong Ou

^*Corresponding author for this work

Chongqing University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, e.g. visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the Modality-Aware Shot Relating and Comparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.

Original language	English
Title of host publication	Special Track on AI Alignment
Editors	Toby Walsh, Julie Shah, Zico Kolter
Publisher	Association for the Advancement of Artificial Intelligence
Pages	7193-7201
Number of pages	9
Edition	7
ISBN (Electronic)	157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978
DOIs	https://doi.org/10.1609/aaai.v39i7.32773
Publication status	Published - 11 Apr 2025
Event	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States Duration: 25 Feb 2025 → 4 Mar 2025

Publication series

Name	Proceedings of the AAAI Conference on Artificial Intelligence
Number	7
Volume	39
ISSN (Print)	2159-5399
ISSN (Electronic)	2374-3468

Conference

Conference	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Country/Territory	United States
City	Philadelphia
Period	25/02/25 → 4/03/25

Access to Document

10.1609/aaai.v39i7.32773

Cite this

Tan, J., Wang, H., Dang, K., Li, J., & Ou, Z. (2025). Modality-Aware Shot Relating and Comparing for Video Scene Detection. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Special Track on AI Alignment (7 ed., pp. 7193-7201). (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 39, No. 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v39i7.32773

@inproceedings{68fe595bf40c49a787f1ea08c6ad8be3,

title = "Modality-Aware Shot Relating and Comparing for Video Scene Detection",

abstract = "Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, e.g. visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the Modality-Aware Shot Relating and Comparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.",

author = "Jiawei Tan and Hongxing Wang and Kang Dang and Jiaxin Li and Zhilong Ou",

note = "Publisher Copyright: Copyright {\textcopyright} 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 ; Conference date: 25-02-2025 Through 04-03-2025",

year = "2025",

month = apr,

day = "11",

doi = "10.1609/aaai.v39i7.32773",

language = "English",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "7193--7201",

editor = "Toby Walsh and Julie Shah and Zico Kolter",

booktitle = "Special Track on AI Alignment",

edition = "7",

}

Tan, J, Wang, H, Dang, K, Li, J & Ou, Z 2025, Modality-Aware Shot Relating and Comparing for Video Scene Detection. in T Walsh, J Shah & Z Kolter (eds), Special Track on AI Alignment. 7 edn, Proceedings of the AAAI Conference on Artificial Intelligence, no. 7, vol. 39, Association for the Advancement of Artificial Intelligence, pp. 7193-7201, 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025, Philadelphia, United States, 25/02/25. https://doi.org/10.1609/aaai.v39i7.32773

Modality-Aware Shot Relating and Comparing for Video Scene Detection. / Tan, Jiawei; Wang, Hongxing; Dang, Kang et al.
Special Track on AI Alignment. ed. / Toby Walsh; Julie Shah; Zico Kolter. 7. ed. Association for the Advancement of Artificial Intelligence, 2025. p. 7193-7201 (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 39, No. 7).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Modality-Aware Shot Relating and Comparing for Video Scene Detection

AU - Tan, Jiawei

AU - Wang, Hongxing

AU - Dang, Kang

AU - Li, Jiaxin

AU - Ou, Zhilong

PY - 2025/4/11

Y1 - 2025/4/11

N2 - Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, e.g. visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the Modality-Aware Shot Relating and Comparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.

AB - Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, e.g. visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the Modality-Aware Shot Relating and Comparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.

UR - http://www.scopus.com/inward/record.url?scp=105003995113&partnerID=8YFLogxK

U2 - 10.1609/aaai.v39i7.32773

DO - 10.1609/aaai.v39i7.32773

M3 - Conference Proceeding

AN - SCOPUS:105003995113

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 7193

EP - 7201

BT - Special Track on AI Alignment

A2 - Walsh, Toby

A2 - Shah, Julie

A2 - Kolter, Zico

PB - Association for the Advancement of Artificial Intelligence

T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

Y2 - 25 February 2025 through 4 March 2025

ER -

Modality-Aware Shot Relating and Comparing for Video Scene Detection

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this