Abstract
Understanding the interactions between human–object (HO) pairs is the key to the human–object interaction (HOI) detection task. Visual understanding research has been significantly impacted by recent advances in linguistic-visual contrastive learning. For HOI detection studies, the alignment of linguistic and visual features is usually required to be performed when linguistic knowledge is used for enhancement. This usually results in the demands of extra training data or extended training time. In this study, an effective approach for utilizing multimodal knowledge to enhance HOI learning from global and instance scales is proposed. Model performance on Rare HOI categories can be prominently improved by using projection guided by linguistic knowledge at a global scale and merging multimodal features at an instance scale. State-of-the-art performance on the HICO-Det benchmark is achieved by the proposed model, and the effectiveness of the proposed global- and local-scale multimodal learning approach is validated.
| Original language | English |
|---|---|
| Article number | 130882 |
| Number of pages | 10 |
| Journal | Neurocomputing |
| Volume | 651 |
| Issue number | 28 |
| Early online date | 18 Jul 2025 |
| DOIs | |
| Publication status | Published - 28 Oct 2025 |
Keywords
- Computer vision
- Human–object interaction (HOI) detection
- Multimodal learning
Fingerprint
Dive into the research topics of 'Exploring interaction concepts for human–object-interaction detection via global- and local-scale enhancing'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver