期刊文献+

网页信息抽取及其自动文本分类的实现 被引量:7

Extraction of Homepage Text Information and Realization of Text Automatic Categorization
在线阅读 下载PDF
导出
摘要 Web页面中常包含非主题信息的内容,网页必须剔除这些无用的信息后才能形成有用的文本信息。文本分类对文本信息的进一步加工处理至关重要,是信息搜索领域的另一研究课题。为了剔除网页中的无用信息,提出一种基于HTML自身结构特点的网页正文信息抽取方法,同时结合文章标题信息,实现文本自动分类的简易分类方法。该方法可以提高网页正文提取及其自动文本分类的效率。实验证明,该方法是可行的。 The non-subject information is often contained in the Web homepage. The useless information must be rejected in the process of forming the useful text information. The text classification is very important to the text information further processing. It has become another research topic in the information search field. Proposed a method of extracting the text information based on the HTML unique feature, simultaneously, and unified the article title information, and realized the text automatic categorization. The method is proved to feasible and realizable to enhance the homepage extraction and text categorization through the detailed demonstration.
出处 《计算机技术与发展》 2008年第10期37-39,共3页 Computer Technology and Development
基金 国家自然科学基金(60573064)
关键词 标记 文本分类 信息抽取 lag text categorization information extraction
  • 相关文献

参考文献5

二级参考文献26

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:71
  • 2韩家威.《数据挖掘》[M].北京:高等教育出版社,2001.5..
  • 3林杰斌.《数据挖掘与OLAP》[M].北京:清华大学出版社,2003.1..
  • 4IonMuslea, Steve Minton, and Craig A.Knoblock. A hierarchical approach to wrapper induction[C].Proceedings of the Third International Conference on Autonomous Agents,Seattle,WA,1999.221-227.
  • 5G.Wiederhold. Mediators in the architecture of Future Information Systems [J].IEEE Computer, 1992,(3).
  • 6Michael W Berry, Murray Browne. Understand Search Engines:Mathematical Modeling and Text Retrieval. Philadelphia:Society for Industrial and Applied Mathematics, 1999, 116
  • 7Buyukkokten O, Garcia2Molina H, Paepcke A. Accordion summarization for end -game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems( CHI 2001 ). New York:ACM Press, 2001. 213 -220
  • 8Yi L, Liu B, Li X. Eliminating Noisy Information in Web Pages for Data Mining. http://www, cs. uic. edu/- liub/publications/kdd2003 -WebNoise. pdf ( Accessed Oct. 17,2005 )
  • 9Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm, "DOM -based Content Extraction of HTML Documents", 12th International World Wide Web Conference, 2003 (5) : 207 - 214
  • 10Stenback j, Hegaret P L, Hors A L. Document Object Model ( DOM )Level 2 HTML Specification. http://www. w3. org/TR/2003/REC -DOM - Level - 2 - HTML - 20030109/DOM2 - HTML. html#html -ID - 1176245063,2003 ( Accessed Oct. 17,2005 )

共引文献34

同被引文献53

引证文献7

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部