摘要
【目的】通过知识蒸馏将来源于无监督数据的额外知识以训练数据的形式注入学生实体抽取模型,缓解古籍实体抽取任务有监督数据稀缺的问题。【方法】使用大语言模型作为生成式知识教师模型,在无监督语料上进行知识蒸馏;基于《左传》和GuNer的有监督数据构造词典知识教师模型蒸馏词典知识,共同构建半监督古籍实体抽取数据集,将古籍实体抽取任务转换为序列到序列任务,再微调mT5、UIE等预训练模型。【结果】在《左传》和GuNer数据集上抽取4类实体的F1值分别达到89.15%和95.47%,与使用古籍语料增量微调的基线模型SikuBERT和SikuRoBERTa相比,分别提升8.15和9.27个百分点。【局限】未加入实体额外信息,受限于大模型生成的数据质量。【结论】本文方法在低资源情境下,利用预训练大语言模型和词典资源的知识优势,将知识有效蒸馏到学生实体抽取模型,能显著提升古籍实体抽取的效果。
[Objective]This work aims to address the challenge of scarce supervised data in classical Chinese entity extraction by leveraging knowledge distillation techniques to inject knowledge from unsupervised external sources into a student model.[Methods]A large language model is utilized as a generative knowledge teacher model to perform knowledge distillation on unsupervised corpora.Additionally,a dictionary knowledge teacher model is built using supervised data from the ZuoZhuan and GuNer datasets.The knowledge distilled from both teachers is integrated to compile a semi-supervised dataset for classical Chinese entity extraction.The task is then reformulated as a sequence-to-sequence problem,and pre-trained models such as mT5 and UIE are fine-tuned on this dataset.[Results]On the ZuoZhuan and GuNer datasets,the proposed method achieves F1-Score of 89.15%and 95.47%,respectively,outperforming the baseline models SikuBERT and SikuRoBERTa,which were incrementally fine-tuned on classical Chinese corpora,by 8.15%and 9.27%in F1-Score.[Limitations]The method does not incorporate additional entity type information,and the quality of data pre-retrieved by the LLMs may affectt extraction results.[Conclusions]In low-resource settings,the proposed approach effectively distills the knowledge advantages of pre-trained large language models and dictionary resources into the student entity extraction model,significantly improving the performance on classical Chinese entity extraction tasks.
作者
唐朝
陈波
谭泽霖
赵小兵
Tang Chao;Chen Bo;Tan Zelin;Zhao Xiaobing(School of Philosophy and Religious Studies,Minzu University of China,Beijing 100081,China;Institute of National Security,Minzu University of China,Beijing 100081,China;School of Information Engineering,Minzu University of China,Beijing 100081,China;School of Chinese Ethnic Minority Languages and Literatures,Minzu University of China,Beijing 100081,China;National Language Resource Monitoring and Research Center of Minority Languages,Beijing 100081,China)
出处
《数据分析与知识发现》
北大核心
2025年第7期118-129,共12页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金项目(项目编号:22&ZD035)的研究成果之一。
关键词
命名实体识别
半监督学习
大语言模型
知识蒸馏
Named Entity Recognition
Semi-supervised Learning
LLMs
Knowledge Distillation