摘要
通用搜索引擎与网站提供的站内搜索机制都无法实现基于内容的企业网站信息查找。在分析企业网站信息的类型后,针对该问题提出一个通用站内搜索引擎架构。给出该引擎的设计思想,介绍对象映射匹配方法、加权对象相似度计算算法、索引构建等实现技术。实现基于网页内容、Word与pdf附件内容的查找定位。实验结果显示,该方法具有很高的查准率和查全率。该引擎可为企业网站的内容搜索与个性化服务提供支持。
Neither the general search engine nor the site search mechanism provided by websites is able to achieve the content-based search of corporate websites information. After analysing the types of corporate websites information,we proposed a general site search engine architecture for this problem. Apart from discussing the design ideas of the engine,we also introduced the implementation techniques including the objects mapping and matching method,the algorithm of weighted objects similarity calculation,and the indexes construction,etc. The engine implements the search and positioning based on website contents and the attachment contents of Word and pdf. Experimental results showed that the search engine had high accuracy and recall rate. The engine could also serve the supports to content search and personalised services for corporate websites.
出处
《计算机应用与软件》
CSCD
2016年第4期91-94,共4页
Computer Applications and Software
关键词
站内搜索
对象映射
附件内容
对象相似度
Site search
Objects mapping
Attachment content
Object similarity