FD-DeCap: A Front-Door Causal Inference-Based Framework for Debiasing Automatic Audio Captioning

  • Jinyun Liu
  • , Hui Li
  • , Mingjun Wei
  • , Zhanlin Ji*
  • , Haiyang Zhang*
  • , Ivan Ganchev*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Automatic Audio Captioning (AAC) aims at generating natural language descriptions for audio content. However, existing methods are often affected by latent confounders and spurious co-occurrence patterns in the data, leading to bias and semantic inaccuracies. This paper proposes FD-DeCap, a front-door causal inference-based framework, for the AAC task. The framework consists of three core components: 1) an AudioAug module introduces noise perturbations in audio features to enhance robustness against environmental interference; 2) a MedGate module explicitly introduces a mediator variable to satisfy the identifiability conditions of the front-door criterion, thereby disentangling direct and indirect effects; and 3) a MSeCE consistency loss jointly optimizes cross-entropy and MSE constraints, encouraging reliance on mediator representations rather than spurious correlations. Experimental results demonstrate that FD-DeCap achieves stable performance improvements, compared to state-of-the-art frameworks, on the Clotho and AudioCaps datasets, with SPIDEr scores of 0.282 and 0.429, respectively. A multi-perspective causal validation of the front-door adjustment, performed on the Clotho dataset, includes analyzes of similarity-score distributions, feature distributions, and representative case studies. After debiasing, the similarity between generated captions and reference captions shifts upward overall, the mediator feature distributions become more dispersed, and the representative cases more accurately capture true acoustic scenes. These findings indicate that the proposed FD-DeCap framework effectively alleviates bias caused by latent confounders and spurious co-occurrence, enhances semantic consistency and robustness of generated captions, and provides a novel solution for the AAC task in complex acoustic scenarios.

Original languageEnglish
Pages (from-to)6029-6042
Number of pages14
JournalIEEE Access
Volume14
DOIs
Publication statusPublished - 2026

Keywords

  • Automatic audio captioning (AAC)
  • bias
  • causal inference
  • front-door adjustment

Cite this