摘要
职业是人物实体的代表性特征,能够有效地区分人物实体。传统人名消歧算法仅把职业当作一个普通的特征,忽视了它的重要性。针对以上问题,提出了基于职业特征的人名消歧算法。首先通过互联网手动构建基础职业词典;其次以维基百科的所有中文页面为训练语料,通过词激活力模型扩展基础职业词典得到职业特征词典;然后从文本中提取职业特征,并抽取人名和作品名作为其补充特征,弥补文本中职业特征缺失和同一人物具有多个职业的问题;最后采用凝聚层次聚类实现人名消歧。在CLP2010的人名消歧训练语料上进行实验,结果表明文章算法能够有效地实现人名消歧。
Occupation is the representative feature of character entities and can effectively distinguish them. Considering that the traditional algorithm of name disambiguation takes the occupation as a common feature and ignores its importance, this paper puts forward an algorithm of name disambiguation based on occupation. Firstly, a basic occupation dictionary is built manually through the internet; secondly, all Chinese Wikipedia pages are used as training corpus and a basic occupation dictionary is derived by extending the word activation force model; then, occupation is extracted as a feature from the text, supplemented by names and works to make up for the problems of occupation missing and the same person having multiple occupations; finally, name disambiguation is imple-mented by agglomerative hierarchical clustering. Experimental results on CLP2010 of Chinese names disambiguation evaluation corpus show that our algorithm is effective.
出处
《信息工程大学学报》
2016年第5期548-554,共7页
Journal of Information Engineering University
基金
国家社会科学基金资助项目(14BXW028)
关键词
职业特征
亲和度
人名消歧
词激活力
凝聚层次聚类
occupational characteristics
affinity
name disambiguation
word activation force
agglomerative hierarchical clustering