摘要
传统K最近邻一个明显缺陷是样本相似度的计算量很大,在具有大量高维样本的文本分类中,由于复杂度太高而缺乏实用性。为此,将粗糙集理论引入到文本分类中,利用上下近似概念刻画各类训练样本的分布,并在训练过程中计算出各类上下近似的范围。在分类过程中根据待分类文本向量在样本空间中的分布位置,改进算法可以直接判定一些文本的归属,缩小K最近邻搜索范围。实验表明,该算法可以在保持K最近邻分类性能基本不变的情况下,显著提高分类效率。
The traditional K Nearest Neighbor(KNN) has a fatal defect that time of similarity computing is huge. For text classification task with high dimension and huge samples, it has extremely complexity. This is not practicable for real applications. In this paper, rough set theory is introduced into classification process. The distribution of training samples is described with the concepts of upper approximation and lower approximation and also the range of upper approximation space and lower approximation space of each class are computed in the training process. According to the position of the documents in the sample space, this algorithm can label some documents directly. It reduces the searching range of KNN of some documents in the classification process. The results of experiments show that this algorithm can save largely the classification time and has almost the same classification performance as that of the traditional KNN classification algorithm.
出处
《计算机工程》
CAS
CSCD
北大核心
2010年第24期175-177,共3页
Computer Engineering
基金
国家自然科学基金资助项目(60775036
60475019)
博士学科点专项科研基金资助项目(20060247039)
关键词
文本分类
K最近邻
粗糙集
text classification: K Nearest Neighbor(KNN)
rough set