摘要
为提升基于RAG架构的智能问答系统文本召回率,研究分析了当前常用的文本向量化策略。针对各种策略存在的上下文语义不连贯及词嵌入后其向量中被引入噪声等问题,提出一种语义特征空间模型以及利用文本要点进行语义检索的向量化策略。通过该模型分析并证明基于文本要点策略构造的语义特征空间能够更好地逼近领域知识空间,并得到将文本向量投影到低秩语义特征空间进行语义检索以提高文本召回率的方法。整体应用该模型、策略、方法所形成的方案优化并改进了RAG架构,实验结果显示,其召回率较传统的RAG架构有显著提升,以大语言模型为底座实现了科技政策法规智能问答。该方案进一步完善了RAG应用开发技术栈,其语义特征空间可用于改进向量数据库的搜索算法。
To enhance the text recall rate of smart Q&A systems based on the Retrieval Augmented Generation(RAG)architecture,research has analyzed commonly used text vectorization strategies.Addressing issues such as contextual semantic inconsistency and noise introduced in⁃to word embeddings after vectorization,a semantic feature space model and a vectorization strategy utilizing text key points for semantic retriev⁃al are proposed.Through this model,it is analyzed and proven that the semantic feature space constructed based on the text key point strategy can better approximate the domain knowledge space.Furthermore,a method is derived for improving text recall rates by projecting text vectors into a low-rank semantic feature space for semantic retrieval.The overall application of this model,strategy,and method forms a scheme that optimizes and improves the RAG architecture.Experimental results demonstrate a significant increase in text recall rates compared to tradition⁃al RAG architectures,with the implementation of a smart Q&A system for science and technology policies and regulations using large language models as the foundation.This scheme further refines the RAG application development technology stack,and the semantic feature space can be applied to improve search algorithms for vector databases.
作者
黄红伟
杜军
卢云涛
马继涛
马健
朱培虎
HUANG Hongwei;DU Jun;LU Yuntao;MA Jitao;MA Jian;ZHU Peihu(Yunnan Academy of Scientific&Technical Information,Kunming 650051,China;Department of Information Systems,City University of Hong Kong,Hong Kong 999077,China;IRISaas Limited,Shenzhen 518063,China)
出处
《软件导刊》
2026年第2期8-13,共6页
Software Guide
基金
云南省科技发展战略与政策研究专项(202404AL030008)
云南省公共科技服务平台专项(202305AH340004)。