Text-to-image person retrieval,a fine-grained cross-modal retrieval problem,aims to search for person images from an image library that match a given textual caption.Existing text-to-image person retrieval methods usu...Text-to-image person retrieval,a fine-grained cross-modal retrieval problem,aims to search for person images from an image library that match a given textual caption.Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two modalities and perform multi-granularity alignment between modalities in the embedding space.However,owing to the inherent mutual one-to-many correspondence between images and texts,it is often difficult for fixed-point embedding methods to adequately capture this relationship,leading to erroneous retrieval results.To address this problem,we propose a novel uncertainty-aware coarse-to-fine alignment method,which first maps fixed-point embedding to probability distributions and then aligns two modalities in terms of distributions and sampling points at a coarse-to-fine granularity,for accurate text-to-image person retrieval.Specifically,we first introduce two contrastive learning tasks of distribution contrast learning and point contrast learning,to achieve coarse-grained inter-modal alignment with uncertainty-aware.The distribution contrast learning task ensures that distributions with the same identity are as similar as possible across modalities through distribution-based contrastive learning.The point contrast learning task performs the contrastive learning of inter-modal and intra-modal sampling points,which not only models rich and diverse cross-modal associations,but also optimizes the learning of distributions.For the fine-grained association requirements of text-to-image person retrieval,we design the task of uncertainty-aware attribute masking language reconstruction,which achieves fine-grained alignment by randomly masking attribute words in the text and reconstructing them via inter-modal sample point interactions.Extensive experiments on two public datasets demonstrate the superior performance of our method.展开更多
基金supported by the National Natural Science Foundation of China(No.62376004)the Natural Science Foundation of Anhui Province(No.2208085J18)the University Synergy Innovation Program of Anhui Province(No.GXXT-2022-033).
文摘Text-to-image person retrieval,a fine-grained cross-modal retrieval problem,aims to search for person images from an image library that match a given textual caption.Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two modalities and perform multi-granularity alignment between modalities in the embedding space.However,owing to the inherent mutual one-to-many correspondence between images and texts,it is often difficult for fixed-point embedding methods to adequately capture this relationship,leading to erroneous retrieval results.To address this problem,we propose a novel uncertainty-aware coarse-to-fine alignment method,which first maps fixed-point embedding to probability distributions and then aligns two modalities in terms of distributions and sampling points at a coarse-to-fine granularity,for accurate text-to-image person retrieval.Specifically,we first introduce two contrastive learning tasks of distribution contrast learning and point contrast learning,to achieve coarse-grained inter-modal alignment with uncertainty-aware.The distribution contrast learning task ensures that distributions with the same identity are as similar as possible across modalities through distribution-based contrastive learning.The point contrast learning task performs the contrastive learning of inter-modal and intra-modal sampling points,which not only models rich and diverse cross-modal associations,but also optimizes the learning of distributions.For the fine-grained association requirements of text-to-image person retrieval,we design the task of uncertainty-aware attribute masking language reconstruction,which achieves fine-grained alignment by randomly masking attribute words in the text and reconstructing them via inter-modal sample point interactions.Extensive experiments on two public datasets demonstrate the superior performance of our method.