期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
MathEval:A Comprehensive Benchmark for Evaluating Large Language Models on Mathematical Reasoning Capabilities 被引量:2
1
作者 Tianqiao Liu Zui Chen +3 位作者 Zhensheng Fang Weiqi Luo Mi Tian Zitao Liu 《Frontiers of Digital Education》 2025年第2期125-143,共19页
Mathematical reasoning is a fundamental aspect of intelligence,encompassing a spectrum from basic arithmetic to intricate problem-solving.Recent investigations into the mathematical abilities of large language models(... Mathematical reasoning is a fundamental aspect of intelligence,encompassing a spectrum from basic arithmetic to intricate problem-solving.Recent investigations into the mathematical abilities of large language models(LLMs)have yielded inconsistent and incomplete assessments.In response,we introduce MathEval,a comprehensive benchmark designed to methodically evaluate the mathematical problem-solving proficiency of LLMs in various contexts,adaptation strategies,and evaluation metrics.MathEval consolidates 22 distinct datasets,encompassing a broad spectrum of mathematical disciplines,languages(including English and Chinese),and problem categories(ranging from arithmetic and competitive mathematics to higher mathematics),with varying degrees of difficulty from elementary to advanced.To address the complexity of mathematical reasoning outputs and adapt to diverse models and prompts,we employ GPT-4 as an automated pipeline for answer_extraction andcomparison.Additionally,we trained a publicly available DeepSeek-LLM-7B-Base model using GPT-4 results,enabling precise_answer validation without requiring GPT-4 access.To mitigate potential test data contamination and truly gauge progress,MathEval incorporates an annually refreshed set of problems from the latest Chinese National College Entrance Examination(Gaokao-2023,Gaokao-2024),,thereby benchmarking genuine advancements in mathematical problem solving skills. 展开更多
关键词 mathematical reasoning large language models BENCHMARK answer grading
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部