CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion

Xingwei He; Yeyun Gong; A-Long Jin; Hang Zhang; Anlei Dong; Jian Jiao; Siu-Ming Yiu; Nan Duan

doi:10.18653/v1/2023.emnlp-main.651

CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion

Xingwei He, Yeyun Gong, A-Long Jin, Hang Zhang, Anlei Dong, Jian Jiao, Siu-Ming Yiu, Nan Duan

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.

Original language	English
Title of host publication	EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
Editors	Houda Bouamor, Juan Pino, Kalika Bali
Publisher	Association for Computational Linguistics (ACL)
Pages	10531-10541
Number of pages	11
ISBN (Electronic)	9798891760608
DOIs	https://doi.org/10.18653/v1/2023.emnlp-main.651
Publication status	Published - 2023
Externally published	Yes
Event	2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore Duration: 6 Dec 2023 → 10 Dec 2023

Publication series

Name	EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings

Conference

Conference	2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
Country/Territory	Singapore
City	Hybrid, Singapore
Period	6/12/23 → 10/12/23

Access to Document

10.18653/v1/2023.emnlp-main.651

Cite this

He, X., Gong, Y., Jin, A.-L., Zhang, H., Dong, A., Jiao, J., Yiu, S.-M., & Duan, N. (2023). CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. In H. Bouamor, J. Pino, & K. Bali (Eds.), EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 10531-10541). (EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.651

He, Xingwei ; Gong, Yeyun ; Jin, A-Long et al. / CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings. editor / Houda Bouamor ; Juan Pino ; Kalika Bali. Association for Computational Linguistics (ACL), 2023. pp. 10531-10541 (EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings).

@inproceedings{d32ec4b22e98446eac2d5214e1cbc7d9,

title = "CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion",

abstract = "The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.",

author = "Xingwei He and Yeyun Gong and A-Long Jin and Hang Zhang and Anlei Dong and Jian Jiao and Siu-Ming Yiu and Nan Duan",

note = "Publisher Copyright: {\textcopyright}2023 Association for Computational Linguistics.; 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 ; Conference date: 06-12-2023 Through 10-12-2023",

year = "2023",

doi = "10.18653/v1/2023.emnlp-main.651",

language = "English",

series = "EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings",

publisher = "Association for Computational Linguistics (ACL)",

pages = "10531--10541",

editor = "Houda Bouamor and Juan Pino and Kalika Bali",

booktitle = "EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings",

}

He, X, Gong, Y, Jin, A-L, Zhang, H, Dong, A, Jiao, J, Yiu, S-M & Duan, N 2023, CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. in H Bouamor, J Pino & K Bali (eds), EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, Association for Computational Linguistics (ACL), pp. 10531-10541, 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Hybrid, Singapore, Singapore, 6/12/23. https://doi.org/10.18653/v1/2023.emnlp-main.651

CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. / He, Xingwei; Gong, Yeyun; Jin, A-Long et al.
EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings. ed. / Houda Bouamor; Juan Pino; Kalika Bali. Association for Computational Linguistics (ACL), 2023. p. 10531-10541 (EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion

AU - He, Xingwei

AU - Gong, Yeyun

AU - Jin, A-Long

AU - Zhang, Hang

AU - Dong, Anlei

AU - Jiao, Jian

AU - Yiu, Siu-Ming

AU - Duan, Nan

PY - 2023

Y1 - 2023

N2 - The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.

AB - The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.

UR - http://www.scopus.com/inward/record.url?scp=85184795148&partnerID=8YFLogxK

U2 - 10.18653/v1/2023.emnlp-main.651

DO - 10.18653/v1/2023.emnlp-main.651

M3 - Conference Proceeding

AN - SCOPUS:85184795148

T3 - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings

SP - 10531

EP - 10541

BT - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings

A2 - Bouamor, Houda

A2 - Pino, Juan

A2 - Bali, Kalika

PB - Association for Computational Linguistics (ACL)

T2 - 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023

Y2 - 6 December 2023 through 10 December 2023

ER -

He X, Gong Y, Jin AL, Zhang H, Dong A, Jiao J et al. CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion. In Bouamor H, Pino J, Bali K, editors, EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings. Association for Computational Linguistics (ACL). 2023. p. 10531-10541. (EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings). doi: 10.18653/v1/2023.emnlp-main.651

CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this