TY - JOUR
T1 - FD-DeCap
T2 - A Front-Door Causal Inference-Based Framework for Debiasing Automatic Audio Captioning
AU - Liu, Jinyun
AU - Li, Hui
AU - Wei, Mingjun
AU - Ji, Zhanlin
AU - Zhang, Haiyang
AU - Ganchev, Ivan
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2026
Y1 - 2026
N2 - Automatic Audio Captioning (AAC) aims at generating natural language descriptions for audio content. However, existing methods are often affected by latent confounders and spurious co-occurrence patterns in the data, leading to bias and semantic inaccuracies. This paper proposes FD-DeCap, a front-door causal inference-based framework, for the AAC task. The framework consists of three core components: 1) an AudioAug module introduces noise perturbations in audio features to enhance robustness against environmental interference; 2) a MedGate module explicitly introduces a mediator variable to satisfy the identifiability conditions of the front-door criterion, thereby disentangling direct and indirect effects; and 3) a MSeCE consistency loss jointly optimizes cross-entropy and MSE constraints, encouraging reliance on mediator representations rather than spurious correlations. Experimental results demonstrate that FD-DeCap achieves stable performance improvements, compared to state-of-the-art frameworks, on the Clotho and AudioCaps datasets, with SPIDEr scores of 0.282 and 0.429, respectively. A multi-perspective causal validation of the front-door adjustment, performed on the Clotho dataset, includes analyzes of similarity-score distributions, feature distributions, and representative case studies. After debiasing, the similarity between generated captions and reference captions shifts upward overall, the mediator feature distributions become more dispersed, and the representative cases more accurately capture true acoustic scenes. These findings indicate that the proposed FD-DeCap framework effectively alleviates bias caused by latent confounders and spurious co-occurrence, enhances semantic consistency and robustness of generated captions, and provides a novel solution for the AAC task in complex acoustic scenarios.
AB - Automatic Audio Captioning (AAC) aims at generating natural language descriptions for audio content. However, existing methods are often affected by latent confounders and spurious co-occurrence patterns in the data, leading to bias and semantic inaccuracies. This paper proposes FD-DeCap, a front-door causal inference-based framework, for the AAC task. The framework consists of three core components: 1) an AudioAug module introduces noise perturbations in audio features to enhance robustness against environmental interference; 2) a MedGate module explicitly introduces a mediator variable to satisfy the identifiability conditions of the front-door criterion, thereby disentangling direct and indirect effects; and 3) a MSeCE consistency loss jointly optimizes cross-entropy and MSE constraints, encouraging reliance on mediator representations rather than spurious correlations. Experimental results demonstrate that FD-DeCap achieves stable performance improvements, compared to state-of-the-art frameworks, on the Clotho and AudioCaps datasets, with SPIDEr scores of 0.282 and 0.429, respectively. A multi-perspective causal validation of the front-door adjustment, performed on the Clotho dataset, includes analyzes of similarity-score distributions, feature distributions, and representative case studies. After debiasing, the similarity between generated captions and reference captions shifts upward overall, the mediator feature distributions become more dispersed, and the representative cases more accurately capture true acoustic scenes. These findings indicate that the proposed FD-DeCap framework effectively alleviates bias caused by latent confounders and spurious co-occurrence, enhances semantic consistency and robustness of generated captions, and provides a novel solution for the AAC task in complex acoustic scenarios.
KW - Automatic audio captioning (AAC)
KW - bias
KW - causal inference
KW - front-door adjustment
UR - https://www.scopus.com/pages/publications/105028000258
U2 - 10.1109/ACCESS.2026.3651636
DO - 10.1109/ACCESS.2026.3651636
M3 - Article
AN - SCOPUS:105028000258
SN - 2169-3536
VL - 14
SP - 6029
EP - 6042
JO - IEEE Access
JF - IEEE Access
ER -