期刊文献+

基于假设检验的文本分类特征选择

Hypothesis Test-based Feature Selection for Text Categorization
在线阅读 下载PDF
导出
摘要 在T-C(tem-category)双向四格表中,特征与文档类相互独立与它们互不相关是等价的.基于此,本文应用了两种新颖的独立性假设检验方法来度量特征与文档类的相关程度,从文本集特征空间中选择能够高度代表文档内容的特征子集用于文本分类.实验结果表明,把假设检验应用于文本分类特征选择中,有利于提高分类性能. For the feature and the document category from a T-C(term-category) two-way four-fold contingency table, their mutual independence is equivalent to their mutual non-correlation.At this point,this paper uses two novel hypothesis test methods of independence to measure the degree of correlation between features and categories,and accordingly the high representative feature subset of the document content is selected out of the feature space of the text set for text categorization. The results of experiments show that the categorization performance can be improved by applying the hypothesis test-based feature selection to text categorization.
出处 《信息与控制》 CSCD 北大核心 2011年第3期273-277,共5页 Information and Control
基金 国家自然科学基金资助项目(60776806 60672174) 中国民航大学博士点启动基金资助项目(06qd08s)
关键词 特征选择 假设检验 文本分类 T-C双向四格表 feature selection hypothesis test text categorization T-C two-way four-fold contingency table
  • 相关文献

参考文献14

  • 1Yang Y, Pederson J O. A comparative study on feature selection in text categorization[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann, 1997: 412-420.
  • 2Sebastiani E Machine learning in automated text categoriza- tion[J]. ACM Computing Surveys, 2002, 34(1): 1-47.
  • 3周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23. 被引量:166
  • 4Zheng Z, Wu X, Srihafi R. Feature selection for text catego- rization on imbalanced data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 80-89.
  • 5Galavotti L, Sebastiani F, Simi M. Experiments on the use of feature selection and negative evidence in automated text categorization[M]//Lecture Notes in Computer Science: vol. 1923.Berlin, Germany: Springer, 2000: 59-68.
  • 6Zheng Z, Srihari R. Optimally combining positive and negative features for text categorization[C]//Proceedings of ICMP'03, Workshop on Learning from Imbalanced Datasets II. Washing- ton DC, USA: World Scientific, 2003.
  • 7Rice J A. Mathematical statistics and data analysis[M]. 3rd ed. Belmont, CA, USA: Duxbury Press, 2006: 308-312.
  • 8Wasserman L. All of statistics: A concise course in statistical inference[M]. Berlin, Germany: Springer, 2004: 152-156.
  • 9Dixon R J, Griffiths W. Survival on the Titanic: Illustrating Wald and Lagrange multiplier tests for proportions and logits[J]. The Journal of Economic Education, 2006, 37(3): 289-304.
  • 10Conover W J. Practical nonparametric statistics[M]. 3rd ed. New York, USA: Wiley & Sons, Inc., 1998: 227-236.

二级参考文献8

  • 1Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
  • 2Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
  • 3Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.
  • 4Mlademnic,D.,Grobelnik,M.Feature Selection for unbalanced class distribution and Nave Bayees[A].Proceedings of the Sixteenth International Conference on Machine Learning[C].Bled:Morgan Kaufmann,1999:258-267.
  • 5梁久祯 兰东俊 扈旻.基于先验知识的网页特征压缩与线性分类器设计[A]..第十二届全国神经计算学术大会论文集[C].北京:人民邮电出版社,2002.494-501.
  • 6王梦云,曹素青.基于字频向量的中文文本自动分类系统[J].情报学报,2000,19(6):644-649. 被引量:17
  • 7范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量:53
  • 8刘斌,黄铁军,程军,高文.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002,16(6):18-24. 被引量:48

共引文献165

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部