摘要
提出了一种不需分词的n元语法文本分类方法.与传统文本分类模型相比,该方法在字的级别上利用了n元语法模型,文本分类时无需进行分词,并且避免了可能造成有用信息丢失的特征选择过程.由于字的数量远小于词的数量,所以该分类方法与其它在词级别上的分类方法相比,有效地降低了数据稀疏带来的影响.系统地研究了模型中的关键因素以及它们对分类结果的影响.使用中文TREC提供的数据进行实验,结果表明,综合评价指标Fβ=1达到86.8%.
Proposes an approach for Chinese language text classification without word segmentation based on n-gram language modeling. Unlike the case of traditional text classification models, the approach based on character level n-gram modeling avoids word segmentation and explicit feature selection procedures that tends to lose significant amount of useful information. It greatly reduces the problem of sparsity of data, because the size of the vocabulary made up of characters is smaller than that formed from words. Systematic study of key factors in language modeling and their influence on classification shows that the estimated index based on experiments on Chinese TREC attained 86.8%.
出处
《北京理工大学学报》
EI
CAS
CSCD
北大核心
2005年第9期778-781,共4页
Transactions of Beijing Institute of Technology
基金
云南省信息技术基金资助项目(2002IT03)
关键词
文本分类
分词
n元语法模型
text classification
word segmentation
n-gram model