摘要
针对后缀树聚类选取基类时,基类短语出现信息不规范、重复和冗余的问题,提出了一种改进后缀树聚类算法。该算法首先以短语互信息算法改进基类的选取,选出遵守维吾尔语语法规则的基类短语;然后,利用短语归并算法对选取的重复基类短语进行归并;最后,在前两步的工作基础上,利用短语去冗余算法处理冗余的基类短语。实验证明,与传统后缀树聚类(STC)相比,改进后缀树聚算法的全面率、准确率都得到了提高。这表明,改进算法有效地改善了聚类效果。
In order to solve the problems of non-standard,repetition and redundancy of information in the process of selecting the base class phrases,an improved Suffix Tree Clustering(STC) method was proposed.Firstly,phrase mutual information algorithm was put forward to choose the base class phrases abiding by Uyghur grammar.Secondly,in order to reduce the repeated base class phrase,the phrase reduction algorithm based on Uyghur grammar was proposed.Thirdly,on the basis of the first two steps,the phrase redundancy algorithm based on Uyghur grammar was constructed to remove redundant phrase.The experimental results show that this method improves the recall and the precision compared with STC.This indicates that the improved algorithm can enhance clustering performance effectively.
出处
《计算机应用》
CSCD
北大核心
2012年第4期1078-1081,共4页
journal of Computer Applications
基金
国家自然科学基金资助项目(60963017)
国家社会科学基金资助项目(10BTQ045
11XTQ007)
新疆大学博士基金资助项目(BS100120)
关键词
维吾尔语
后缀树
互信息
归并
冗余
Uyghur
Suffix Tree(ST)
Mutual Information(MI)
reduction
redundancy