摘要
传统的文本分类大多基于向量空间,分类体系为甲面体系,忽视了类别间的层次关系.根据LSA理论提出了一种多层次web文本分类方法.建立类模型时,根据类别的层次关系树由下到上逐层为具有相同父节点的类别建立一个类模型;分类时,由上到下,根据相应的类模型存LS空间上分类.这种分类方法解决了LSA模型中高维矩阵难以进行奇异值分解的问题.同时体现了web文本中词条的语义关系,注重了词条在网页中的表现形式.实验表明,多层次web文本分类方法比基于平面分类体系的分类方法在查全率和准确率方面要好.
The traditional text classifications are mostly based on the vectorial space, and the structure of classification is flat structure. These methods ignore the structural relationships among the categories. This text put forward a kind of multi-hierarchy web text classification according to LSA theory. This method set up a classifier for nodes that have the same father node from leaves to root according to classification tree. And it classifies a new web text according to the corresponding classifier in LS space from root to leaves. This method solved a flaw of LSA model. This flaw is that it is difficult to execute singular value decomposition for a large sparse matrix. This method not only reflects the semantic relationships of the terms in web text but also pays attention to the expressive form of terms in the webpage. Experiments show such multi-hierarchy web text classification method is more accurate than some methods which based on fiat structure.
出处
《情报学报》
CSSCI
北大核心
2005年第6期684-689,共6页
Journal of the China Society for Scientific and Technical Information
基金
浙江省自然科学基金