摘要
为进一步解决Web碎片信息所特有的随意性给信息抽取带来的困难,通过对Web碎片信息DOM树的结构特征和Web碎片信息的文本特征(如时间、作者、信息等)进行研究,发现将两者相结合能有效地进行Web碎片信息抽取,提出一种基于特征树的Web碎片信息抽取算法.以新浪微博、腾讯微博、搜狐微博等在内的100个信息分享平台作为实验对象,实验结果表明,该算法具有良好的性能,可以达到较高的召回率与查准率.
So as to resolve the message-extraction difficulty due to the randomness characterized by the web fragment information,the architectural features of DOM-tree and the textual features of the web fragment information,such as time,author and message,were explored and analyzed.Then,the efficient extraction of web fragment information can be accomplished by combining the aforementioned two factors.A new algorithm concerning the web fragment information has been proposed on the basis of DOM-tree.The 100 experimenting data that include Sina,Tencent,Sohu as for the experiment object,the experiments were made,the resuts show that the extracting algorithm to web fragment information has good performance and achieve the higher recalling rate and precision rate.
出处
《兰州理工大学学报》
CAS
北大核心
2014年第1期104-107,共4页
Journal of Lanzhou University of Technology
基金
贵州省优秀科技教育人才省长专项资金项目(黔省专合字(2012)82号)
关键词
WEB
Web碎片信息
DOM树
信息抽取
召回率
Web
DOM tree
web fragment information
information extraction
recalling rate