摘要
【目的/意义】大数据环境下,科学文献涵盖了大量不同领域、主题和模态的信息,例如文本、图像、音频等,因此如何有效地滤除多模态冗余数据是现阶段难点之一。【方法/过程】为此,本研究提出大数据环境下科学文献多维语义跨模态检索算法。采用有偏卡尔曼滤波过滤掉科学文献数据库冗余数据。基于此,采用文本频次-逆文档频次(Term Frequency-Inverse Document Frequency,TF-IDF)算法提取表征文本的特征词。通过文献检索元素值计算特征词之间的语义相似度,并结合检索元素的关联度生成检索矩阵,完成大数据环境下科学文献多维语义跨模态的检索。【结果/结论】实验结果显示,所提算法的检索精度高,NDCG数值高,且检索时间更短。【创新/局限】该算法的研究对解决传统关键词检索方法的局限性,通过融合多模态数据、利用丰富的语义信息和解决语义鸿沟问题,提高科学文献检索的效果和准确性,为研究者和学者提供更便捷、全面的信息检索服务。
【Purpose/significance】In the Big data environment,scientific literature covers a large number of information in different fields,themes and modes,such as text,images,audio,etc.【Method/process】Therefore,how to effectively filter multimodal redundant data is one of the difficulties at this stage.Therefore,this study proposes a multi-dimensional semantic cross modal retrieval algorithm for scientific literature in the Big data environment.The biased Kalman filter is used to filter the redundant data of scientific literature database.Based on this,the TF-IDF algorithm is used to extract feature words that represent the text.The semantic similarity between feature words is calculated by the value of Document retrieval elements,and the retrieval matrix is generated by combining the relevance of the retrieval elements to complete the multi-dimensional semantic cross modal retrieval of scientific documents in the Big data environment.【Result/conclusion】The experimental results show that the proposed algorithm has high retrieval accuracy,high NDCG values,and shorter retrieval time.【Innovation/limitation】The research on this algorithm aims to address the limitations of traditional keyword retrieval methods by integrating multimodal data,utilizing rich semantic information,and addressing semantic gaps.It improves the effectiveness and accuracy of scientific literature retrieval,providing researchers and scholars with more convenient and comprehensive information retrieval services.
作者
岑丹
闫奕文
CEN Dan;YAN Yiwen(Library,Jilin University,Changchun 130012,China;Library of Jilin Jianzhu University,Changchun 130000,China)
出处
《情报科学》
北大核心
2025年第9期133-138,共6页
Information Science
基金
吉林省科技发展计划重点研发项目“智能旅游体验平台关键技术的研究”(20250203107SF)。
关键词
科学文献数据库过滤
文本特征词
均值化词频
语义相似度
元素隶属度
scientific literature database filtering
text feature words
averaging word frequency
semantic similarity
element membership degree