TY - JOUR
T1 - Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
AU - Sun, Mingjie
AU - Xiao, Jimin
AU - Lim, Eng Gee
AU - Liu, Si
AU - Goulermas, John Y.
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2021/11/1
Y1 - 2021/11/1
N2 - In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage. In traditional methods, an object region that best matches the referring expression is picked out, and then the query sentence is reconstructed from the selected region, where the reconstruction difference serves as the loss for back-propagation. The existing methods, however, conduct both the matching and the reconstruction approximately as they ignore the fact that the matching correctness is unknown. To overcome this limitation, a discriminative triad is designed here as the basis to the solution, through which a query can be converted into one or multiple discriminative triads in a very scalable way. Based on the discriminative triad, we further propose the triad-level matching and reconstruction modules which are lightweight yet effective for the weakly-supervised training, making it three times lighter and faster than the previous state-of-the-art methods. One important merit of our work is its superior performance despite the simple and neat design. Specifically, the proposed method achieves a new state-of-the-art accuracy when evaluated on RefCOCO (39.21 percent), RefCOCO+ (39.18 percent) and RefCOCOg (43.24 percent) datasets, that is 4.17, 4.08 and 7.8 percent higher than the previous one, respectively. The code is available at https://github.com/insomnia94/DTWREG.
AB - In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage. In traditional methods, an object region that best matches the referring expression is picked out, and then the query sentence is reconstructed from the selected region, where the reconstruction difference serves as the loss for back-propagation. The existing methods, however, conduct both the matching and the reconstruction approximately as they ignore the fact that the matching correctness is unknown. To overcome this limitation, a discriminative triad is designed here as the basis to the solution, through which a query can be converted into one or multiple discriminative triads in a very scalable way. Based on the discriminative triad, we further propose the triad-level matching and reconstruction modules which are lightweight yet effective for the weakly-supervised training, making it three times lighter and faster than the previous state-of-the-art methods. One important merit of our work is its superior performance despite the simple and neat design. Specifically, the proposed method achieves a new state-of-the-art accuracy when evaluated on RefCOCO (39.21 percent), RefCOCO+ (39.18 percent) and RefCOCOg (43.24 percent) datasets, that is 4.17, 4.08 and 7.8 percent higher than the previous one, respectively. The code is available at https://github.com/insomnia94/DTWREG.
KW - Referring expression grounding
KW - discriminative triad matching
KW - weakly supervised training
UR - http://www.scopus.com/inward/record.url?scp=85100846675&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2021.3058684
DO - 10.1109/TPAMI.2021.3058684
M3 - Article
C2 - 33571088
AN - SCOPUS:85100846675
SN - 0162-8828
VL - 43
SP - 4189
EP - 4195
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 11
ER -