摘要
针对维吾尔语文档自动分类问题,提出一种基于广泛相似度度量和K-means聚类的文档分类方案。将维吾尔语文档进行预处理,通过词频-逆向文档频率(TF-IDF)算法获得关键词集合;利用提出的广泛相似度度量,通过考虑与语料库中其它文档之间的距离,计算文档间的相似度;基于广泛相似度构建一个集群距离矩阵,获得一组基础集群;将基础集群的中心作为K-means聚类的初始中心,完成所有文档的聚类。实验结果表明,该方案具有较高的分类精度和较低的计算时间。
For the issue of the automatic classification of Uyghur documents,a Uygur document classification scheme based on extensive similarity and K-means clustering was proposed.Uighur documents were preprocessed,and term frequency-inverse document frequency(TF-IDF)algorithm was used to get a set of keywords.The extensive similarity was used to calculate the similarity between the documents by considering the distance between the other documents in the corpus.A cluster distance matrix was constructed based on the extensive similarity to obtain a set of basic clusters.The center of the base cluster was used as the initial center of the K-means clustering,so as to make all the documents be clustered.Experimental results show that the proposed scheme has higher classification accuracy and lower computation time.
出处
《计算机工程与设计》
北大核心
2017年第6期1686-1691,共6页
Computer Engineering and Design
基金
新疆维吾尔自治区自然科学基金科研基金项目(2015211A016)