摘要
为评估大语言模型的多方面能力,相关学者已提出众多的测试数据集。这些数据集尽管涉及数十个学科的测试数据,但缺乏地质学领域数据,无法对大模型的地质学知识能力进行评估。本文以公开发布的高等院校地质学相关专业考试题为数据来源,构建了涵盖地质学领域8个学科、1007个单项选择题的测试数据集Geo-Eval。基于该数据集对6个国产大语言模型进行了测评。测试结果表明,这些大模型的平均准确率为46.4%~65.4%,其地质学知识水平距离良好甚至专家级别还有较大差距;它们在地质学知识准确度要求较高的应用场景中表现不够理想,但其知识广度是相较人类领域专家的优势。此外,千亿级以上参数大模型的表现好于十亿级参数大模型的表现。本文通过构建Geo-Eval数据集,重点解决了地质学测试数据集缺少的问题,实现了大语言模型地质学知识能力的量化评估。
In order to evaluate the various capabilities of large language models,numerous evaluation datasets have been proposed.Although these datasets involve test data from dozens of disciplines,there is a shortage of data in the field of geology,making it impossible to evaluate the geological knowledge and capabilities of large models.Using publicly released exam questions related to geology in higher education institutions as the data source,the evaluation dataset Geo-Eval has been created,which covers 8 disciplines of geology and contains 1007 single-choice questions.Based on this dataset,six domestic large language models have been evaluated.The test results show that the average accuracy of these large models ranges from 46.4%to 65.4%,and their geological knowledge level is far from the expert level.Their performance in application scenarios with high requirements for geological knowledge accuracy is not ideal,but their knowledge breadth is their advantage compared to human experts.In addition,large models with parameter of over 100 billion perform better than large models with parameter of over 1 billion.By creating the Geo-Eval dataset,the problem of lacking a geological evaluation dataset has been solved,and a quantitative evaluation of the geological knowledge ability of the large language model has been achieved.
作者
柳顺政
柴新夏
周峰
王春宁
LIU Shunzheng;CHAI Xinxia;ZHOU Feng;WANG Chunning(National Geological Library of China,Beijing 100083,China)
出处
《自然资源信息化》
2025年第4期49-55,共7页
Natural Resources Informatization
关键词
大语言模型
测试数据集
地质学
人工智能
large language model
evaluation dataset
geology
artificial intelligence