基于大语言模型的文献综述智能生成与循证研究被引量：3

Research on Intelligent Generation and Evidence-based of Literature Review Based on Large Language Model

导出

摘要【目的】自动生成带参考文献的结构化综述,辅助科研用户快速了解某一领域科研知识。【方法】选取NSTL平台7万篇论文,对摘要进行语步识别,构建语料库。通过大模型生成与人工修改构建3 000条综述数据,对GLM3-6B模型微调训练。通过将语料库转换成高维向量,利用索引存储向量,再向量检索实现LangChain外挂知识库。为弥补专有名词检索不佳的缺陷,混合BM25检索并重排序,提高检索精度。【结果】通过微调训练模型和混合检索框架构建综述生成系统,BLEU和Rouge-L得分提高了109.64%和40.22%,人工评估真实性得分提高62.17%。【局限】受计算资源限制,本地模型参数规模较小,生成能力有待提高。【结论】利用检索增强生成技术发挥大模型的优势,不仅可以生成高质量的文献综述,也为生成内容提供循证溯源,辅助科研人员智能阅读。 [Objective]This paper aims to generate structured literature reviews with references automatically,to assist researchers quickly grasp a specific area of scientific knowledge.[Methods]A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts.The GLM3-6B model was fine-tuned for training by generating 3,000 reviews using a large language model and then revising them manually.The corpus was then converted into high-dimensional vectors and stored in an index.These vectors were retrieved to implement LangChain’s external knowledge base.To solve the problem of poor retrieval of proper nouns,a hybrid search with BM25 was used and reordered to improve retrieval accuracy.[Results]Fine-tuning and hybrid retrieval frameworks were used to construct the literature review generation system,improving the BLEU and ROUGE scores by 109.64%and 40.22%respectively,as well as the authenticity score of manual evaluation by 62.17%.[Limitations]Due to limitations in computational resources,the scale of the local model parameters is small and its generation ability needs to be improved further.[Conclusions]The retrieval-augmented generation technique uses large language models not only generates high-quality literature reviews,and provides traceable evidence for the generated content,as well as assists researchers in intelligent reading.

作者宋梦鹏白海燕 Song Mengpeng;Bai Haiyan(Institute of Scientific and Technical Information of China,Beijing 100038,China)

机构地区中国科学技术信息研究所

出处《数据分析与知识发现》北大核心 2025年第6期21-34,共14页 Data Analysis and Knowledge Discovery

基金中国科学技术信息研究所创新研究基金青年项目(项目编号:QN2024-15)的研究成果之一。

关键词大语言模型自动综述检索增强生成 Large Language Model Automatic Review Retrieval-Augmented Generation

分类号 TP391 [自动化与计算机技术—计算机应用技术] G35 [文化科学—情报学]

引文网络
相关文献

参考文献11

1无.2022年度中国科技论文统计与分析[J].科学,2024,76(2):59-62. 被引量：2
2杨晓兰,钟义信.基于文本理解的自动文摘系统研究与实现[J].电子学报,1998,26(7):155-158. 被引量：18
3周蔚,王兆毓,魏斌.面向法律裁判文书的生成式自动摘要模型[J].计算机科学,2021,48(12):331-336. 被引量：12
4柯修,王惠临.基于混合方法的多语言多文档自动摘要系统构建及实现[J].图书馆学研究,2013(2):66-72. 被引量：5
5黄文彬,倪少康.多文档自动摘要方法的进展研究[J].情报科学,2017,35(4):160-165. 被引量：5
6马浩,崔运鹏.基于混合深度学习模型的科技文献自动综述模型构建研究[J].情报理论与实践,2021,44(9):176-182. 被引量：7
7郑义,黄萱菁,吴立德.文本自动综述系统的研究与实现[J].计算机研究与发展,2003,40(11):1606-1611. 被引量：3
8唐晓波,翟夏普.基于混合机器学习模型的多文档自动摘要[J].情报理论与实践,2019,42(2):145-150. 被引量：10
9丁恒,阮靖龙.面向自动综述系统的文献价值评估研究[J].情报学报,2022,41(11):1199-1213. 被引量：1
10赵浜,曹树金.国内外生成式AI大模型执行情报领域典型任务的测试分析[J].情报资料工作,2023,44(5):6-17. 被引量：63

二级参考文献63

1张智雄,刘欢,丁良萍,吴朋民,于改红.不同深度学习模型的科技论文摘要语步识别效果对比研究[J].数据分析与知识发现,2019,3(12):1-9. 被引量：31
2葛加银,黄萱菁,吴立德.基于实体名的文本自动综述研究[J].计算机科学,2004,31(9):161-164. 被引量：2
3秦兵,刘挺,李生.多文档自动文摘综述[J].中文信息学报,2005,19(6):13-20. 被引量：51
4刘德喜,何炎祥,姬东鸿,杨华.一种基于演化算法进行句子抽取的多文档自动摘要系统SBGA[J].中文信息学报,2006,20(6):46-53. 被引量：10
5李明.从字频统计出发的中文文摘自动编写[J].现代图书情报技术,1996(3):42-45. 被引量：20
6D Marcu, L Gerber. An inquiry into the nature of multi-document abstracts, extracts, and their evaluation. DUC-01 Workshop on Text Summarization, New Orleans, LA, 2001. http://www.isi.edu/- marcu/papers/multidoceval01, pdf.
7J Goldstein, Automatic text summarization of multiple documents [Ph D dissertation]. Language Technologies Institute, Carnegie Mellon University, Pittsburg, 1999.
8J Goldstein, V Mittal, J Carbonell et al. Multi-document summarization by sentence extraction. The ANLP NAACL Workshop on Automatic Summarization, Seattle, 2000.
9D Radev, K McKeown. Generating natural language summaries from multiple online sources. Computational Linguistics, 1998, 24(3) : 469-501.
10I Mani, E Bloedem. Multi-document summarization by graph search and merging. In: B Kuipers, B Wehber eds. Proc of AAAI-97. Rhode Island: American Association for Artificial Intelligence, 1997. 622-628.