Open-source large language models(LLMs)research has made significant progress,but most studies predominantly focus on general-purpose English data,which poses challenges for LLM research in Chinese education.To addres...Open-source large language models(LLMs)research has made significant progress,but most studies predominantly focus on general-purpose English data,which poses challenges for LLM research in Chinese education.To address this,this research first reviewed and synthesized the core technologies of representative open-source LLMs,and designed an advanced 1.5Bparameter LLM tailored for the Chinese education field.Chinese education large language model(CELLM)is trained from scratch,involving two stages,namely,pretraining and instruction fine-tuning.In the pre-training phase,an open-source dataset is utilized for the Chinese education domain.During the instruction fine-tuning stage,the Chinese instruction dataset is developed and open-sourced,comprising over 258,000 data entries.Finally,the results and analysis of CELLM across multiple evaluation datasets are presented,which provides a reference baseline performance for future research.All of the models,data,and codes are opensource to foster community research on LLMs in the Chinese education domain.展开更多
文摘Open-source large language models(LLMs)research has made significant progress,but most studies predominantly focus on general-purpose English data,which poses challenges for LLM research in Chinese education.To address this,this research first reviewed and synthesized the core technologies of representative open-source LLMs,and designed an advanced 1.5Bparameter LLM tailored for the Chinese education field.Chinese education large language model(CELLM)is trained from scratch,involving two stages,namely,pretraining and instruction fine-tuning.In the pre-training phase,an open-source dataset is utilized for the Chinese education domain.During the instruction fine-tuning stage,the Chinese instruction dataset is developed and open-sourced,comprising over 258,000 data entries.Finally,the results and analysis of CELLM across multiple evaluation datasets are presented,which provides a reference baseline performance for future research.All of the models,data,and codes are opensource to foster community research on LLMs in the Chinese education domain.