摘要
分析了当前Web信息检索的技术现状,指出检索效率不高的根本原因在于搜索引擎所采用的排序函数和标引词加权技术。介绍了传统的信息检索排序函数和标引词加权技术。分析了Web文档的特点,指出其主要形式HTML文档是一种结构化文档,结构由标签显式地定义,不同文档结构对检索性能的贡献不同。对本领域国内外学者的成果作了对比研究。最后探讨了Web信息检索排序函数及标引词加权技术的发展方向。
This paper analyzes current technological status of Web Information Retrieval(IR) and points out the root of its inefficiency is the ranking function and term weighting algorithms that searching engine adopts.Then classic IR ranking function and term weighting technologies are introduced.Characters of Web documents are studied,the fact is most of them are HTML documents,a kind of structured documents.Its structure is defined explicitly by predefined HTML tags,which has different importance and influence on the performance of search engine.The studies of researchers on structures of HTML documents are introduced,that is,making use of the peculiarity of Web documents to extend classic ranking function and term weighting technology to a structured one.Finally we discuss development trend of these technologies mentioned above.
出处
《计算机工程与应用》
CSCD
北大核心
2007年第11期181-184,共4页
Computer Engineering and Applications
基金
国家教育部科学技术重点研究项目(the Key Technologies Project of the Ministry of Education of China No.03144)
海南省自然科学基金(the Natural Science Foundation of Hainan Province of China under Grant No.60533)。
关键词
排序函数
标引词加权
文档结构
搜索引擎
ranking function
term weighting
document structure
search engine