针对当前视频-文本检索方法未能有效结合时间信息与相关性信息进行联合建模的问题,提出一种基于跨模态注意力机制的视频-文本检索方法.首先,利用预训练的大规模图像-文本模型提取文本和视频帧的嵌入表示,通过知识迁移缓解不同模态数据...针对当前视频-文本检索方法未能有效结合时间信息与相关性信息进行联合建模的问题,提出一种基于跨模态注意力机制的视频-文本检索方法.首先,利用预训练的大规模图像-文本模型提取文本和视频帧的嵌入表示,通过知识迁移缓解不同模态数据之间的异质性问题.然后,使用联合文本-帧跨模态注意力机制模块,同时编码视频帧之间的时间信息以及视频帧与文本之间的相关性信息,捕获更具竞争力的视频特征表示.最后,利用交叉熵损失函数约束模型训练.通过对比实验验证,该方法能够有效捕获视频帧的时间信息和相关性信息,在MSR-VTT(microsoft research video to text)和LSMDC(large-scale movie description challenge)数据集上取得具有竞争力的效果.展开更多
The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary ...The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD.展开更多
文摘针对当前视频-文本检索方法未能有效结合时间信息与相关性信息进行联合建模的问题,提出一种基于跨模态注意力机制的视频-文本检索方法.首先,利用预训练的大规模图像-文本模型提取文本和视频帧的嵌入表示,通过知识迁移缓解不同模态数据之间的异质性问题.然后,使用联合文本-帧跨模态注意力机制模块,同时编码视频帧之间的时间信息以及视频帧与文本之间的相关性信息,捕获更具竞争力的视频特征表示.最后,利用交叉熵损失函数约束模型训练.通过对比实验验证,该方法能够有效捕获视频帧的时间信息和相关性信息,在MSR-VTT(microsoft research video to text)和LSMDC(large-scale movie description challenge)数据集上取得具有竞争力的效果.
基金supported by the National Key Research and Development Program of China(No.2020YFB1406800).
文摘The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD.