摘要
对Web文档进行人工分类可以达到准确的分类效果,但需要大量的时间和人力的投入。传统的基于特征向量的分类方法准确性较低,文中提出把挖掘网站的拓扑结构和现有的文档分类方法相结合,并根据扩展网页的特征提取,挖掘出单个网站的分类模式,再将多个网站的分类模式进行综合,生成搜索引擎的分类模式。
Web text classification by hand can get the exact result,but it will spend a lot in time and manpower. Traditional algorithm based on feature vector will lead to low veracity. This paper put forwards an automatic web text classification that combined the structure mining with the existing text classification. This kind of algorithm mines the classification pattern for each web site first by distilling features of Extended pages,then synthesizes,and creates the classification pattern for the search engine.
出处
《计算机应用》
CSCD
北大核心
2003年第7期37-39,共3页
journal of Computer Applications
基金
天津市科技发展计划项目 (0 2 3 1 0 0 51 1 )
关键词
结构挖掘
Web文档自动分类
分类模式
扩展网页
structure mining
automatic web page categorization
classification pattern
extended page