摘要
基于向量空间模型的文本分类中特征向量是极度稀疏的高维向量,只有降低向量空间维数才能提高分类效率。在利用统计方法选择文本分类特征降低特征空间维数的基础上,采用隐含语义分析技术,挖掘文档特征间的语义信息,利用矩阵奇异值分解理论进一步降低了特征空间维数。实验结果表明分类结果宏平均F1约提高了5%,验证了该方法的有效性。
The feature vector of Chinese Web page is high dimension and very sparse for text categorization.How to reduce the dimensionality of feature space is a very key problem for practical text classification.In this paper a new method is described.The approach is to take advantage of latent semantic analysis and feature selection that use statistical methods.The K-Nearest Neighbor method is selected as the evaluating classifiers.The experimental result shows that the proposed method for Chinese Web page categorization to be promising.
出处
《计算机工程与应用》
CSCD
北大核心
2007年第24期169-171,共3页
Computer Engineering and Applications
基金
河北省自然科学基金(the Natural Science Foundation of Hebei Province
Grant No.F2006001020)
河北省教育厅科学基金(the Founda-tion of Education Bureau of Hebei Province
Grant No.2005347)
河北大学科学基金(the Fundation of Hebei University
Grant No.Y2004045)
关键词
网页分类隐含语义分析特征选择KNN
Web Page categorization
latent semantic analysis
feature selection
KNN