摘要
藏族人名汉译名识别属于人名识别的范畴,但现有的人名识别方法并不能完全切合藏族人名命名特点:藏族人名具有浓厚的宗教文化内涵,字(串)特征和内部构成复杂 其次,藏族人名中含有大量高频单字,使得藏族人名和普通词语之间歧义冲突变得十分突出,同时也使得藏族人名和上下文之间的边界变得非常模糊。本文在大规模藏族人名实例和语料库调查基础上,统计分析了藏族人名的用字(串)特征,并构建了藏族人名属性特征库 通过藏族人名的命名规则及属性特征将藏族人名形式化表示,实现了藏族人名汉译名自动识别系统。真实语料库开放测试F值达到87.12%。
Though recognition of Tibetan names is a kind of person-name recognition, current method for recognition of person-names isn't fit to the characters of Tibetan names: Tibetan names have strong religious and cultural meaning, which results in complicated character (string) features and internal structure of Tibetan names; Secondly, Tibetan names contain a lot of frequent single-character words, which makes the ambiguous conflict more outstanding between names and common words, and blurs the border around the Tibetan names. In this paper, we analysis the attributes of Tibetan names, and make full use of these statistics attributes to build a attributes library; then we build automatic identification system for Tibetan names according to the naming hales and attributes. Test on large scale real corpus shows that the system archives 87.12% for F-measure.
出处
《情报学报》
CSSCI
北大核心
2009年第3期475-480,共6页
Journal of the China Society for Scientific and Technical Information
基金
基金项目:本文得到国家自然科学基金(60572159)、教育部科学技术研究重点项目(107017)的资助.
关键词
藏族人名识别
未登录词
可信度
自动分词
recognition of Tibetan names, out-of-vocabulary words, reliability, segmentation