摘要
基于实例的机器翻译(EBMT)是一种高效的机器翻译方法,如何快速地从海量实例模式库中找出与待翻译句子相似的候选实例,是EBMT研究的关键技术之一。统计分析维吾尔语单词字母的分布特征,构造了基于维吾尔语单词的倒排索引散列表,在等概率条件下,平均查找长度为1.59;依据散列冲突的同义词在维吾尔语料中出现的频率作为权值,提出了一种新颖的解决散列冲突的算法:同义词次优树算法。实验显示,算法的性能比传统的顺序查找和二分查找算法分别高出了27.5%,21.8%,证明了该算法在EBMT中有较高的检索效率。
The efficient retrieval of the candidate translation example from the large scale translation example base is fundamental issue in the study of EBMT. This paper proposes an Uyhur t Hash function designed according to the distribution of the uyhur words and characters, which, on the equiprobable condition, facilitate an average search length of 1.59. To resovle the conflict in the Hash table, a new mechanism name second optimal tree for synonym is established as regards to the frequency of the conflicting Urhur words. The experiments show that the proposed approach achieves 27.5% and 21.8% improvement in the performance compared with the sequential chain and binary search approach respectively.
出处
《中文信息学报》
CSCD
北大核心
2009年第4期124-128,共5页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60663006)
关键词
计算机应用
中文信息处理
EBMT
散列
平均查找长度
次优树
computer application
Chinese information processing
EBMT
hash
average search length
second optimal tree