摘要
为了降低生物医学文本中命名实体识别对目标领域标注数据的需求,将生物医学文本中的命名实体识别问题转换为基于迁移学习的隐马尔可夫模型问题。对要进行命名实体识别的目标领域数据集无须进行大量数据标注,通过迁移学习的方法实现对目标领域的识别分类;以相关领域数据为辅助数据集,利用数据引力的方法评估辅助数据集的样本在目标领域学习中的贡献程度,在辅助数据集和目标领域数据集上计算权值进行迁移学习。基于权值学习模型,构建基于迁移学习的隐马尔可夫模型算法BioTrHMM。在GENIA语料库的数据集上的实验表明,BioTrHMM算法比传统的隐马尔可夫模型算法具有更好的性能,仅需要少量的目标领域标注数据即可具有较好的命名实体识别性能。
In order to reduce the requirement of labeled data in target domain for biomedical NER( named entity recognition),this paper transformed the problem of NER in biomedical texts into a hidden Markov model based on transfer learning. The data sets in the target domain for NER did not need a large amount of labeled data to learn a model for the task by transfer learning. With the help of labeled data in source data sets across a different but related domain,it used the data gravitation method to evaluate the contribution of samples in the auxiliary data sets about learning a model for the target domain. And it calculate the weights of the data from the source domain and the data from the target domain. And then it construct the hidden Markov model algorithm( BioTrHMM) based on the transfer learning. The experiment results on GENIA corpus show the BioTrHMM algorithm has better performance than the traditional algorithm of hidden Markov model,only uses small amount of labeled data in target domain.
作者
高冰涛
张阳
刘斌
Gao Bingtao;Zhang Yang;Liu Bin(College of Information Engineering,Northwest A&F University,Yangling Shaanxi 712100,China)
出处
《计算机应用研究》
CSCD
北大核心
2019年第1期45-48,共4页
Application Research of Computers
基金
国家自然科学基金资助项目(61602388)
中央高校基本科研业务费专项资金资助项目(2452015193
2452015194
2452016081)