摘要
在深度网研究领域,通用搜索引擎(比如Google和Yahoo)具有许多不足之处:它们各自所能覆盖的数据量与整个深度网数据总量的比值小于1/3;与表层网中的情况不同,几个搜索引擎相结合所能覆盖的数据量基本没有发生变化.许多深度网站点能够提供大量高质量的信息,并且,深度网正在逐渐成为一个最重要的信息资源.提出了一个三分类器的框架,用于自动识别特定领域的深度网入口.查询接口得到以后,可以将它们进行集成,然后将一个统一的接口提交给用户以方便他们查询信息.通过8组大规模的实验,验证了所提出的方法可以准确高效地发现特定领域的深度网入口.
In hidden Web domain, general-purpose shortcomings. They cover less than one-third of the data combined, they cover roughly the same data. Hidden Web search engines (i.e., Google and Yahoo) have their stored in document databases. Unlike the surface Web, if is a highly important information source since the content provided by many hidden Web sites is often of very high quality. This paper proposes a three-step framework to automatically identify domain-specific hidden Web entries. With those obtained query interfaces, they can be integrated to obtain a unified interface which is given to users to query. Eight large-scale experiments demonstrate that the technique can find domain-specific hidden Web entries accurately and efficiently.
出处
《软件学报》
EI
CSCD
北大核心
2008年第2期246-256,共11页
Journal of Software
基金
Supported by the National Natural Science Foundation of China under Grant No.60373099 (国家自然科学基金)
the Science and Technology Development Program of Jilin Province of China under Grant No.20070533 (吉林省科技发展计划)
关键词
深度网
深度网
表层网
深度网入口
搜索表单
deep Web
hidden Web
surface Web
hidden Web entry
searchable form