摘要
k最近邻方法是一种简单而有效的文本分类方法,但是传统的k最近邻分类方法在训练集数据量很大情况下,全局的最优搜索几乎是不可能的.因此,加速k个最近邻的搜索是k最近邻方法实用的关键.提出了一种基于k最近邻的快速文本分类方法,它能够保证在海量数据集中进行快速有效的分类.实验结果表明,这一方法较传统方法性能有显著提升.
k-Nearest Neighbor (k-NN) is one of the simplest and most effective algorithms for text categorization. However, k-NN search requires intensive similarity computations, particularly for large training set, the search of the whole set is unacceptable. Therefore, speeding-up k-NN search is a key for making k-NN categorization useful in practice. In this paper a fast text categorization approach based on k-NN, which can classify textual documents quickly and efficiently on condition of searching in the very large training set is presented. Experiment shows that the new algorithm can greatly improve the performance.
出处
《中国科学院研究生院学报》
CAS
CSCD
2005年第5期554-559,共6页
Journal of the Graduate School of the Chinese Academy of Sciences
关键词
文本分类
k最近邻
多维索引
相似检索
text categorization, k-Nearest Neighbor( k-NN), multidimensional index, similarity retrieval