摘要
阐述传统KNN分类器的基本原理和其存在的不足之处;针对样本数量增大,维度上升时KNN算法中相似度计算量急剧增大的问题,提出基于维度索引表的改进KNN分类算法;该算法通过建立特征项维度索引表加速KNN算法中寻找K近邻;以搜狗自然语言实验室的文本分类语料库中的新闻文档作为实验对象,采用宏平均F测度值作为分类效果评价标准,用改进KNN方法和传统KNN方法进行对比实验。实验结果表明:该方法能大幅度减少寻找K近邻时相似度计算的次数。
In addition to elaborate the basic principle and existing shortcomings of traditional KNN classifier, this paper puts forward the improved KNN classification algorithm based on dimension index table, which according to the increasing number of samples and rapidly increasing problems of similarity computation of KNN algorithm when dimension rises. The algorithm accelerates the search of finding K-nearest neighbor in KNN algorithm by establishing the feature dimension index table. With the news docu- ment in the text categorization corpus of Sogou Natural Language Lab as the experimental object, the comparative experiment was carried out with the improved KNN algorithm and traditional KNN algorithm evaluated by Macro-averaging F-measures. The experi- mental result shows that this method can greatly reduce the times of similarity computation when searching K-nearest neighbor.
出处
《情报理论与实践》
CSSCI
北大核心
2014年第5期102-106,共5页
Information Studies:Theory & Application
基金
国家自然科学基金资助项目"面向文本分类的多学科协同建模理论与实验研究"的成果之一
项目编号:71373291
关键词
文本分类
维度索引表
向量空间模型
分类算法
text categorization
dimension index table
vector space model
classification algorithm