摘要
数据挖掘一般用于高度结构化的大型数据库,以发现其中所蕴含的知识。随着在线文本的增多,其中所蕴含的知识也越来越丰富,但是,它们却难以被分析利用。因而,研究一套行之有效的方案发现文本中所蕴含的知识是非常重要的,也是当前重要的研究课题。该文利用搜索引擎Google获取相关Web页面,进行过滤和清洗后得到相关文本,然后,进行文本聚类,利用Episode进行事件识别和信息抽取,数据集成及数据挖掘,从而实现知识发现。最后给出了原型系统,对知识发现进行实践检验,收到了很好的效果。
Data mining is typically applied to large databases of highly structured information in order to discover new knowledge.Though the amount of potentially valuable knowledge contained in document collections can be great,they are often difficult to analyze.Therefore,it is important to develop methods to efficiently discover knowledge embedded in these document repositories,and text mining becomes an important research area too.This paper describes an approach for mining knowledge from web pages,at first,gets web pages from the web by search engine Google,then filters out the irrelevant documents,takes text categorization,extracts information and recognizes the event type by episode,integrates and mines the data in order to discover new knowledge.Finally,a prototype based on this theory is developed,and then the result is described in detail.
出处
《计算机工程与应用》
CSCD
北大核心
2004年第30期178-180,220,共4页
Computer Engineering and Applications
关键词
搜索引擎
文本聚类
EPISODE
信息抽取
知识发现
search engine,text categorization,episode,information extraction,knowledge discovery