摘要
藏医药领域的文本主要以非结构化形式保存,藏医药文本的信息抽取对挖掘藏医药的知识有重要作用。针对现有藏文实体关系抽取模型语义表达能力差、嵌套实体抽取准确率低的问题,该文介绍了一种基于预训练模型的实体关系抽取方法,使用TibetanAI_ALBERT_v2.0预训练语言模型,使得模型更好地识别实体,使用Span方法解决实体嵌套问题。在Dropout的基础上,增加了一个KL散度损失函数项,提升了模型的泛化能力。在TibetanAI_TMIE_v1.0藏医药数据集上进行了实验,实验结果表明,精确率、召回率和F1值分别达到了84.5%、80.1%和82.2%,F1值较基线提升了4.4个百分点,实验结果证明了该文方法的有效性。
The texts in the field of Tibetan medicine are mainly stored in unstructured form.The information extraction of Tibetan medicine texts plays an important role in excavating the knowledge of famous Tibetan medicine.In response to the problems of poor semantic expression ability and low accuracy of nested entity extraction in existing Tibetan entity relation extraction models,this paper introduces a pre-trained entity relation extraction method.The TibetanAI_ALBERT_v2.0 pre-trained language model is used to enable the model to better recognize entities,and the Span method is used to solve the problem of entity nesting.On the basis of Dropout,a KL divergence loss function is added to enhance the model's generalization ability.Experiments on the TibetanAI_TMIE_v1.0 dataset of Tibetan medicine show that the precision,recall,and F 1 score have reached 84.5%,80.1%,and 82.2%,respectively.The F 1 score has increased by 4.4 percentage points compared to the baseline.The results demonstrate the effectiveness of the proposed method.
作者
周青
拥措
拉毛东只
尼玛扎西
ZHOU Qing;YONG Tso;LAMAO Dongzhi;NYIMA Trashi(School of Information Science and Technology,Xizang University,Lhasa,Xizang 850000,China;Key Laboratory of Tibetan Information Technology and Artificial Intelligence of Tibet,Lhasa,Xizang 850000,China;Engineering Research Center for Tibetan Language Information Technology under the Ministry of Education,Lhasa,Xizang 850000,China)
出处
《中文信息学报》
CSCD
北大核心
2024年第8期76-83,共8页
Journal of Chinese Information Processing
基金
西藏自治区科技厅项目(XZ202401JD0010)
科技创新2030——“新一代人工智能”重大项目(2022ZD0116100)。
关键词
藏医药
实体关系抽取
预训练语言模型
Tibetan medicine
entity relation extraction
pre-trained language model