Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a...Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.展开更多
A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is...A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of the sessions. Compared with other algorithms, more successful sessions are additionally detected by semantic outlier analysis.展开更多
To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant...To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and "flying over the plane" moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users' interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.展开更多
Improvement on mining the frequently visited groups of web pages was studied. First, in the data preprocessing phrase, we introduce an extra frame filtering step that reduces the negative influence of frame pages on t...Improvement on mining the frequently visited groups of web pages was studied. First, in the data preprocessing phrase, we introduce an extra frame filtering step that reduces the negative influence of frame pages on the result page groups. Through recognizing the frame pages in the site documents and constructing the frame subframe relation set, the subframe pages that influence the final mining result can be efficiently filtered. Second, we enhance the mining algorithm with the consideration of both the site topology and the content of the web pages. By the introduction of the normalized content link ratio of the web page and the group interlink degree of the page group, the enhanced algorithm concentrates more on the content pages that are less interlinked together. The experiments show that the new approach can effectively reveal more interesting page groups, which would not be found without these enhancements.展开更多
Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguo...Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguous and abundance of the Web log file, the least decision making model based on rough set theory was presented for Web mining. And an example was given to explain the model. The model can predigest the decision making table, so that the least solution of the table can be acquired. According to the least solution, the corresponding decision for individual service can be made in sequence. Web mining based on rough set theory is also currently the original and particular method.展开更多
With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic ...With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic simulated annealing for search engines in Web mining is proposed. According to the proposed strategy, there exists some important relationship among Web statistical studies, search engines and optimization techniques. We have proven experimentally the relevance of our approach to the presented queries by comparing the qualities of output pages with those of the original downloaded pages, as the number of iterations increases better results are obtained with reasonable execution time.展开更多
Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithm...Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients, Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.展开更多
Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfull...Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfully. This paper introduces the basic concepts of rough set and discusses its applications in Web mining. In particular, some applications of rough set theory to intelligent information processing are emphasized.展开更多
A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human underst...A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.展开更多
The diversity of e-commerce Business to Consumer systems and the significant increase in their use during the COVID-19 pandemic as a one of the primary channels of retail commerce, has made all the most important the ...The diversity of e-commerce Business to Consumer systems and the significant increase in their use during the COVID-19 pandemic as a one of the primary channels of retail commerce, has made all the most important the need to measuring their quality using practical methods. This paper presents a quality evaluation framework for web metrics that are B2C specific. The framework uses three dimensions based on end-user interaction categories, metrics internal specs and quality sub-characteristics as defined by ISO25010. Beginning from the existing large corpus of general-purpose web metrics, e-commerce specific metrics are chosen and categorized. Analysis results are subjected to a data mining analysis to provide association rules between the various dimensions of the framework. Finally, an ontology that corresponds to the framework is developed to answer to complicated questions related to metrics use and to facilitate the production of new, user defined meta-metrics.展开更多
This paper describes in detail the web data mining technology, analyzes the relationship between the data on the web site to the tourism electronic commerce (including the server log, tourism commodity database, user...This paper describes in detail the web data mining technology, analyzes the relationship between the data on the web site to the tourism electronic commerce (including the server log, tourism commodity database, user database, the shopping cart), access to relevant user preference information for tourism commodity. Based on these models, the paper presents recommended strategies for the site registered users, and has had the corresponding formulas for calculating the current user of certain items recommended values and the corresponding recommendation algorithm, and the system can get a recommendation for user.展开更多
The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illeg...The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.展开更多
The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many r...The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many researchers to apply Data Mining techniques on it. This paper gives a detailed state-of-the-art survey of on-going research in this new area. It shows the positive effects of Semantic Web Mining, the obstacles faced by researchers and propose number of approaches to deal with the very complex and heterogeneous information and knowledge which are produced by the technologies of Semantic Web.展开更多
基金Supported by the National Natural Science Foundation of China(60472099)Ningbo Natural Science Foundation(2006A610017)
文摘Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.
基金Supported by the Huo Yingdong Education Foundation of China(91101)
文摘A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of the sessions. Compared with other algorithms, more successful sessions are additionally detected by semantic outlier analysis.
基金Supported by the Natural Science Foundation of Jiangsu Province(BK2005046)
文摘To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and "flying over the plane" moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users' interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.
文摘Improvement on mining the frequently visited groups of web pages was studied. First, in the data preprocessing phrase, we introduce an extra frame filtering step that reduces the negative influence of frame pages on the result page groups. Through recognizing the frame pages in the site documents and constructing the frame subframe relation set, the subframe pages that influence the final mining result can be efficiently filtered. Second, we enhance the mining algorithm with the consideration of both the site topology and the content of the web pages. By the introduction of the normalized content link ratio of the web page and the group interlink degree of the page group, the enhanced algorithm concentrates more on the content pages that are less interlinked together. The experiments show that the new approach can effectively reveal more interesting page groups, which would not be found without these enhancements.
文摘Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguous and abundance of the Web log file, the least decision making model based on rough set theory was presented for Web mining. And an example was given to explain the model. The model can predigest the decision making table, so that the least solution of the table can be acquired. According to the least solution, the corresponding decision for individual service can be made in sequence. Web mining based on rough set theory is also currently the original and particular method.
基金Supported by the National Natural Science Foundation of China (60673093)
文摘With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic simulated annealing for search engines in Web mining is proposed. According to the proposed strategy, there exists some important relationship among Web statistical studies, search engines and optimization techniques. We have proven experimentally the relevance of our approach to the presented queries by comparing the qualities of output pages with those of the original downloaded pages, as the number of iterations increases better results are obtained with reasonable execution time.
基金supported by the National Natural Science Foundation of China (No. 61175052,60975039, 61203297, 60933004, 61035003)National High-tech R&D Program of China (863 Program) (No.2012AA011003)supported by the ZTE research found of Parallel Web Mining project
文摘Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients, Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.
文摘Rough set theory is a new soft computing tool, and has received much attention of researchers around the world. It can deal with incomplete and uncertain information. Now, it has been applied in many areas successfully. This paper introduces the basic concepts of rough set and discusses its applications in Web mining. In particular, some applications of rough set theory to intelligent information processing are emphasized.
文摘A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.
文摘The diversity of e-commerce Business to Consumer systems and the significant increase in their use during the COVID-19 pandemic as a one of the primary channels of retail commerce, has made all the most important the need to measuring their quality using practical methods. This paper presents a quality evaluation framework for web metrics that are B2C specific. The framework uses three dimensions based on end-user interaction categories, metrics internal specs and quality sub-characteristics as defined by ISO25010. Beginning from the existing large corpus of general-purpose web metrics, e-commerce specific metrics are chosen and categorized. Analysis results are subjected to a data mining analysis to provide association rules between the various dimensions of the framework. Finally, an ontology that corresponds to the framework is developed to answer to complicated questions related to metrics use and to facilitate the production of new, user defined meta-metrics.
文摘This paper describes in detail the web data mining technology, analyzes the relationship between the data on the web site to the tourism electronic commerce (including the server log, tourism commodity database, user database, the shopping cart), access to relevant user preference information for tourism commodity. Based on these models, the paper presents recommended strategies for the site registered users, and has had the corresponding formulas for calculating the current user of certain items recommended values and the corresponding recommendation algorithm, and the system can get a recommendation for user.
文摘The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.
文摘The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many researchers to apply Data Mining techniques on it. This paper gives a detailed state-of-the-art survey of on-going research in this new area. It shows the positive effects of Semantic Web Mining, the obstacles faced by researchers and propose number of approaches to deal with the very complex and heterogeneous information and knowledge which are produced by the technologies of Semantic Web.