摘要
Web页面中常包含非主题信息的内容,网页必须剔除这些无用的信息后才能形成有用的文本信息。文本分类对文本信息的进一步加工处理至关重要,是信息搜索领域的另一研究课题。为了剔除网页中的无用信息,提出一种基于HTML自身结构特点的网页正文信息抽取方法,同时结合文章标题信息,实现文本自动分类的简易分类方法。该方法可以提高网页正文提取及其自动文本分类的效率。实验证明,该方法是可行的。
The non-subject information is often contained in the Web homepage. The useless information must be rejected in the process of forming the useful text information. The text classification is very important to the text information further processing. It has become another research topic in the information search field. Proposed a method of extracting the text information based on the HTML unique feature, simultaneously, and unified the article title information, and realized the text automatic categorization. The method is proved to feasible and realizable to enhance the homepage extraction and text categorization through the detailed demonstration.
出处
《计算机技术与发展》
2008年第10期37-39,共3页
Computer Technology and Development
基金
国家自然科学基金(60573064)
关键词
标记
文本分类
信息抽取
lag
text categorization
information extraction