摘要
对双数组Trie树(Double-Array Trie)分词算法进行了优化:在采用Trie树构造双数组Trie树的过程中,优先处理分支节点多的结点,以减少冲突;构造一个空状态序列;将冲突的结点放入Hash表中,不需要重新分配结点.然后,利用这些方法构造了一个中文分词系统,并与其他几种分词方法进行对比,结果表明,优化后的双数组Trie树插入速度和空间利用率得到了很大提高,且分词查询效率也得到了提高.
This paper proposed some improved strategies for the algorithm of Double-Array Trie. Firstly, the priority was given to the node with most child nodes in order to avoid the collision; secondly, an empty-list was defined; Finally, the collision node was added to a hash table, which avoided re-allocation. Then, we implemented a program for a Chinese word segmentation system based on the improved Double-Array Trie and compared it with several other methods. From the results, it turns out that the insertion time and the space efficiency are achieved, and that search efficiency is improved.
出处
《湖南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2009年第5期77-80,共4页
Journal of Hunan University:Natural Sciences
基金
教育部科学技术研究重点项目资助(106458)
关键词
自然语言处理
双数组
TRIE树
词典
分词
natural language processing systems
double-array
trie
lexicon
word segmentation