Focused crawling is an important technique for topical resource discovery on the Web.The key issue in focused crawling is to prioritize uncrawled uniform resource locators(URLs) in the frontier to focus the crawling o...Focused crawling is an important technique for topical resource discovery on the Web.The key issue in focused crawling is to prioritize uncrawled uniform resource locators(URLs) in the frontier to focus the crawling on relevant pages.Traditional focused crawlers mainly rely on content analysis.Link-based techniques are not effectively exploited despite their usefulness.In this paper,we propose a new frontier prioritizing algorithm,namely the on-line topical importance estimation(OTIE) algorithm.OTIE combines link-and content-based analysis to evaluate the priority of an uncrawled URL in the frontier.We performed real crawling experiments over 30 topics selected from the Open Directory Project(ODP) and compared harvest rate and target recall of the four crawling algorithms:breadth-first,link-context-prediction,on-line page importance computation(OPIC) and our OTIE.Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate.Moreover,OTIE is much faster than the traditional focused crawling algorithm.展开更多
The existing fuzzy clustering algorithm (FCM) is sensitive to the initial center point. And simple clustering of distance can neither discovery hot topics on the Network accurately nor solve the problem of semantic di...The existing fuzzy clustering algorithm (FCM) is sensitive to the initial center point. And simple clustering of distance can neither discovery hot topics on the Network accurately nor solve the problem of semantic diversity in Chinese. Aiming at these problems, an improved fuzzy clustering method based on dynamic adaptive step firefly algorithm (FA) was proposed. The clustering center was optimized by improved FA, and the FCM was used to complete the final clustering. First, the step length was adjusted adaptively in the current iteration, and the relationship between fireflies was established according to text similarity, then the topic influence value was applied to fuzzy clustering algorithm to improve fitness function optimization. In this process the topic was categorized into the closest class to the cluster center, which can reduce the impact of topic variation. Finally, according to the level of influence value got hot topics. By collecting real data from Sina micro-blog, the effectiveness of the algorithm was verified by experiments, and the accuracy of topic discovery was improved greatly.展开更多
基金Project (No.2007C23086) supported by the Science and Technology Plan of Zhejiang Province,China
文摘Focused crawling is an important technique for topical resource discovery on the Web.The key issue in focused crawling is to prioritize uncrawled uniform resource locators(URLs) in the frontier to focus the crawling on relevant pages.Traditional focused crawlers mainly rely on content analysis.Link-based techniques are not effectively exploited despite their usefulness.In this paper,we propose a new frontier prioritizing algorithm,namely the on-line topical importance estimation(OTIE) algorithm.OTIE combines link-and content-based analysis to evaluate the priority of an uncrawled URL in the frontier.We performed real crawling experiments over 30 topics selected from the Open Directory Project(ODP) and compared harvest rate and target recall of the four crawling algorithms:breadth-first,link-context-prediction,on-line page importance computation(OPIC) and our OTIE.Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate.Moreover,OTIE is much faster than the traditional focused crawling algorithm.
文摘The existing fuzzy clustering algorithm (FCM) is sensitive to the initial center point. And simple clustering of distance can neither discovery hot topics on the Network accurately nor solve the problem of semantic diversity in Chinese. Aiming at these problems, an improved fuzzy clustering method based on dynamic adaptive step firefly algorithm (FA) was proposed. The clustering center was optimized by improved FA, and the FCM was used to complete the final clustering. First, the step length was adjusted adaptively in the current iteration, and the relationship between fireflies was established according to text similarity, then the topic influence value was applied to fuzzy clustering algorithm to improve fitness function optimization. In this process the topic was categorized into the closest class to the cluster center, which can reduce the impact of topic variation. Finally, according to the level of influence value got hot topics. By collecting real data from Sina micro-blog, the effectiveness of the algorithm was verified by experiments, and the accuracy of topic discovery was improved greatly.