摘要
针对海量Web文本信息,利用从网页主题内容提取出来的特征关键词,在倒排索引基础上建立相似度计算模型。对一篇新入库的网页文档,利用所包含的关键词迅速缩小计算范围,提高计算效率。实验结果表明该算法是有效的,小规模评测结果得到较好的效果。
The presence of replicas or near - replicas of documents is very common on the Web. To solve near - replicas of large - scale web pages crawled by search engine, a similarity dealing algorithm was proposed based on keywords extracted from the web pages. The algorithm reduces the scope of web pages that to be processed and improves efficiency largely.
出处
《微计算机应用》
2008年第2期41-45,共5页
Microcomputer Applications
关键词
近似网页
搜索引擎
网页消重
Near - replicas detection, Vector space model, Search engine