Recent advancements in large language models(LLMs)have driven remarkable progress in text process-ing,opening new avenues for medical knowledge discovery.In this study,we present ERQA,a mEdical knowledge Retrieval and...Recent advancements in large language models(LLMs)have driven remarkable progress in text process-ing,opening new avenues for medical knowledge discovery.In this study,we present ERQA,a mEdical knowledge Retrieval and Question-Answering framework powered by an enhanced LLM that integrates a semantic vector database and a curated literature repository.The ERQA framework leverages domain-specific incremental pretraining and conducts supervised fine-tuning on medical literature,enabling retrieval and question-answering(QA)tasks to be completed with high precision.Performance evaluations implemented on the coronavirus disease 2019(COVID-19)and TripClick data-sets demonstrate the robust capabilities of ERQA across multiple tasks.On the COVID-19 dataset,ERQA-13B achieves state-of-the-art retrieval metrics,with normalized discounted cumulative gain at top 10(NDCG@10)0.297,recall values at top 10(Recall@10)0.347,and mean reciprocal rank(MRR)=0.370;it also attains strong abstract summarization performance,with a recall-oriented understudy for gisting evaluation(ROUGE)-1 score of 0.434,and QA performance,with a bilingual evaluation understudy(BLEU)-1 score of 7.851.The comparable performance achieved on the TripClick dataset further under-scores the adaptability of ERQA across diverse medical topics.These findings suggest that ERQA repre-sents a significant step toward efficient biomedical knowledge retrieval and QA.展开更多
基金supported by the Innovation Fund for Medical Sciences of the Chinese Academy of Medical Sciences(2021-I2M-1-033)the National Key Research and Development Program of China(2022YFF0711900).
文摘Recent advancements in large language models(LLMs)have driven remarkable progress in text process-ing,opening new avenues for medical knowledge discovery.In this study,we present ERQA,a mEdical knowledge Retrieval and Question-Answering framework powered by an enhanced LLM that integrates a semantic vector database and a curated literature repository.The ERQA framework leverages domain-specific incremental pretraining and conducts supervised fine-tuning on medical literature,enabling retrieval and question-answering(QA)tasks to be completed with high precision.Performance evaluations implemented on the coronavirus disease 2019(COVID-19)and TripClick data-sets demonstrate the robust capabilities of ERQA across multiple tasks.On the COVID-19 dataset,ERQA-13B achieves state-of-the-art retrieval metrics,with normalized discounted cumulative gain at top 10(NDCG@10)0.297,recall values at top 10(Recall@10)0.347,and mean reciprocal rank(MRR)=0.370;it also attains strong abstract summarization performance,with a recall-oriented understudy for gisting evaluation(ROUGE)-1 score of 0.434,and QA performance,with a bilingual evaluation understudy(BLEU)-1 score of 7.851.The comparable performance achieved on the TripClick dataset further under-scores the adaptability of ERQA across diverse medical topics.These findings suggest that ERQA repre-sents a significant step toward efficient biomedical knowledge retrieval and QA.