期刊文献+

核物理AI研究助手与arXiv向量数据库 被引量:1

Nuclear physics AI research assistant and arXiv vector database
原文传递
导出
摘要 面对跨学科科学文献的指数级增长与现有检索系统的局限性,本研究基于arXiv平台266万篇论文数据集,创新开发了融合向量语义检索与大型语言模型(Large Language Model,LLM)分析的智能系统。通过构建论文向量数据库实现语义相似性初筛,结合LLM上下文推理优化排序,有效解决了传统关键词搜索的语义鸿沟问题以及LLM的幻觉问题。在核物理领域的应用表明,该系统能精准定位跨学科解决方案,对比特定任务上的关键词检索和向量相似度检索,前10篇文献的查全率从10%跃升到60%,查准率从20%跃升到90%。项目开源提供三大核心模块:1)全量论文向量数据库;2)智能检索优化框架(含查询生成、相关性分析等智能体);3)PDF深度解析工具链。本研究突破性地将语义检索与LLM推理相结合,为应对知识爆炸时代的科研挑战提供了可扩展的解决方案(开源地址:https://gitee.com/lgpang/arxiv_vectordb)。 [Background]The exponential growth of scientific literature,particularly in physics and nuclear physics,poses significant challenges for researchers to track advancements and identify cross-disciplinary solutions.While large language models(LLMs)offer potential for intelligent retrieval,their reliability is hindered by inaccuracies and hallucinations.The arXiv dataset(2.66 million papers)provides an unprecedented resource to address these challenges.[Purpose]This study aims to develop a hybrid retrieval system integrating vector-based semantic search with LLM-driven contextual analysis to enhance the accuracy and accessibility of scientific knowledge across disciplines.[Methods]We processed 2.66 million arXiv paper titles/abstracts using BGE-M3 model to generate 1024-dimensional vector representations.Cosine similarity metrics were computed between user queries(vectorized via the same model)and pre-encoded paper vectors for preliminary semantic ranking.The top 50 candidates underwent contextual relevance analysis by DeepSeek-r1,which evaluated technical depth,methodological alignment,and cross-domain connections through multi-step reasoning.A nuclear physics case study validated the system using 1000 AI-human-annotated documents.The framework incorporating four specialized agents:query generation,relevance scoring,structured data correction,and PDF analysis.[Results]We constructed a vector database comprising 2.66 million arXiv papers(including titles and abstracts),occupying 30 GB of disk space.Our vector-based semantic search system demonstrated superior performance in a nuclear physics query benchmark,achieving 90%precision and 60%recall for the top-10 retrieved documents.This significantly outperformed traditional keyword-based search methods,which yielded only 20%precision and 10%recall under the same evaluation conditions.[Conclusions]By synergizing vector semantics with LLM reasoning,this work establishes a new paradigm for scientific knowledge retrieval that effectively bridges disciplinary divides.The open-sourced system(https://gitee.com/lgpang/arxiv_vectordb)provides researchers with scalable tools to navigate literature complexity,demonstrating particular value in identifying non-obvious interdisciplinary connections.
作者 庞龙刚 PANG Longgang(Key Laboratory of Quark and Lepton Physics(MOE)&Institute of Particle Physics,Central China Normal University,Wuhan 430079,China;Artificial Intelligence and Computational Physics Research Center,Central China Normal University,Wuhan 430079,China)
出处 《核技术》 北大核心 2025年第5期84-94,共11页 Nuclear Techniques
基金 国家自然科学基金(No.12075098,No.12435009,No.CCNU24JC011) 华中师范大学中央高校基本科研业务费项目资助。
关键词 arXiv向量数据库 大语言模型智能体 深度求索 AI科学家 arXiv vector database Large language model agent DeepSeek AI scientist
  • 相关文献

参考文献6

二级参考文献2

共引文献56

同被引文献8

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部