Corporations focus on web based education to train their employees ever more than before. Unlike traditional learning environments, web based education applications store large amount of data. This growing availabilit...Corporations focus on web based education to train their employees ever more than before. Unlike traditional learning environments, web based education applications store large amount of data. This growing availability of data stimulated the emergence of a new field called educational data mining. In this study, the classification method is implemented on a data that is obtained from a company which uses web based education to train their employees. The authors' aim is to find out the most critical factors that influence the users' success. For the classification of the data, two decision tree algorithms, Classification and Regression Tree (CART) and Quick, Unbiased and Efficient Statistical Tree (QUEST) are applied. According to the results, assurance of a certificate at the end of the training is found to be the most critical factor that influences the users' success. Position, number of work years and the education level of the user, are also found as important factors.展开更多
This paper describes in detail the web data mining technology, analyzes the relationship between the data on the web site to the tourism electronic commerce (including the server log, tourism commodity database, user...This paper describes in detail the web data mining technology, analyzes the relationship between the data on the web site to the tourism electronic commerce (including the server log, tourism commodity database, user database, the shopping cart), access to relevant user preference information for tourism commodity. Based on these models, the paper presents recommended strategies for the site registered users, and has had the corresponding formulas for calculating the current user of certain items recommended values and the corresponding recommendation algorithm, and the system can get a recommendation for user.展开更多
采用Web data mining对远程教育进行分析,根据受教育对象存在的个体差异,提出个性化远程学习系统的框架结构思想和个性化服务的理念,对相关信息进行数据挖掘并建立起一个集智能化、个性化为一体的远程教育系统,从而更好地改善远程教育...采用Web data mining对远程教育进行分析,根据受教育对象存在的个体差异,提出个性化远程学习系统的框架结构思想和个性化服务的理念,对相关信息进行数据挖掘并建立起一个集智能化、个性化为一体的远程教育系统,从而更好地改善远程教育服务的现状。展开更多
Improvement on mining the frequently visited groups of web pages was studied. First, in the data preprocessing phrase, we introduce an extra frame filtering step that reduces the negative influence of frame pages on t...Improvement on mining the frequently visited groups of web pages was studied. First, in the data preprocessing phrase, we introduce an extra frame filtering step that reduces the negative influence of frame pages on the result page groups. Through recognizing the frame pages in the site documents and constructing the frame subframe relation set, the subframe pages that influence the final mining result can be efficiently filtered. Second, we enhance the mining algorithm with the consideration of both the site topology and the content of the web pages. By the introduction of the normalized content link ratio of the web page and the group interlink degree of the page group, the enhanced algorithm concentrates more on the content pages that are less interlinked together. The experiments show that the new approach can effectively reveal more interesting page groups, which would not be found without these enhancements.展开更多
One of the leading cancers for both genders worldwide is lung cancer.The occurrence of lung cancer has fully augmented since the early 19th century.In this manuscript,we have discussed various data mining techniques t...One of the leading cancers for both genders worldwide is lung cancer.The occurrence of lung cancer has fully augmented since the early 19th century.In this manuscript,we have discussed various data mining techniques that have been employed for cancer diagnosis.Exposure to air pollution has been related to various adverse health effects.This work is subject to analysis of various air pollutants and associated health hazards and intends to evaluate the impact of air pollution caused by lung cancer.We have introduced data mining in lung cancer to air pollution,and our approach includes preprocessing,data mining,testing and evaluation,and knowledge discovery.Initially,we will eradicate the noise and irrelevant data,and following that,we will join the multiple informed sources into a common source.From that source,we will designate the information relevant to our investigation to be regained from that assortment.Following that,we will convert the designated data into a suitable mining process.The patterns are abstracted by utilizing a relational suggestion rule mining process.These patterns have revealed information,and this information is categorized with the help of an Auto Associative Neural Network classification method(AANN).The proposed method is compared with the existing method in various factors.In conclusion,the projected Auto associative neural network and relational suggestion rule mining methods accomplish a high accuracy status.展开更多
Objective:This study analyzed the data of the medical cases in the book,“Clinical Guide Medical records”using a data mining method,to provide a reference for Ye Tianshi’s academic thoughts.Methods:We used the web v...Objective:This study analyzed the data of the medical cases in the book,“Clinical Guide Medical records”using a data mining method,to provide a reference for Ye Tianshi’s academic thoughts.Methods:We used the web version of the ancient and modern medical records cloud platform to complete distribution statistics,association rules,cluster analysis,and complex network analysis of all the medical records in the“Clinical Guide Medical records.”These methods were used to summarize the baseline data and to identify the core relationship between Chinese medicine diseases and Chinese medicine,as well as the Chinese medicine Classification.Results:A total of 2572 medical records,3136 visits,and 2879 prescriptions of 1127 traditional Chinese medicines were included in this study.The most common diseases(such as hematemesis),syndromes(such as liver–stomach disharmony),symptoms(such as rapid pulse),disease sites(such as gastric cavity),disease properties(such as Yang deficiency),treatment methods(such as activating Yang),and traditional Chinese medicines(such as Poria cocos)were identified.Furthermore,medicines with a warm,flat,cold,sweet,or bitter taste with its effects on the lungs,spleen,and heart were the most common.The observed effects of the drugs included clearing dampness,promoting diuresis,and strengthening the spleen.The association analysis showed that the associations between TCM diseases and traditional Chinese medicines that had a high confidence were“phlegm and fluid retention–Poria cocos,”“diarrhea–Poria cocos,”etc.The cluster analysis showed that traditional Chinese medicines were classified into five categories.The complex network showed the core relationship between nine high-frequency diseases and nine high-frequency traditional Chinese medicine.Conclusion:This study revealed the most important relationships between traditional Chinese medicines diseases and traditional Chinese medicines and classified the most used traditional Chinese medicines.These findings may help the coming generations of doctors to make accurate diagnoses and treat patients effectively and to improve the clinicians’efficacy in clinical diagnosis and treatment.展开更多
The increasing usage of internet requires a significant system for effective communication. To pro- vide an effective communication for the internet users, based on nature of their queries, shortest routing ...The increasing usage of internet requires a significant system for effective communication. To pro- vide an effective communication for the internet users, based on nature of their queries, shortest routing path is usually preferred for data forwarding. But when more number of data chooses the same path, in that case, bottleneck occurs in the traffic this leads to data loss or provides irrelevant data to the users. In this paper, a Rule Based System using Improved Apriori (RBS-IA) rule mining framework is proposed for effective monitoring of traffic occurrence over the network and control the network traffic. RBS-IA framework integrates both the traffic control and decision making system to enhance the usage of internet trendier. At first, the network traffic data are ana- lyzed and the incoming and outgoing data information is processed using apriori rule mining algorithm. After generating the set of rules, the network traffic condition is analyzed. Based on the traffic conditions, the decision rule framework is introduced which derives and assigns the set of suitable rules to the appropriate states of the network. The decision rule framework improves the effectiveness of network traffic control by updating the traffic condition states for identifying the relevant route path for packet data transmission. Experimental evaluation is conducted by extrac- ting the Dodgers loop sensor data set from UCI repository to detect the effectiveness of theproposed Rule Based System using Improved Apriori (RBS-IA) rule mining framework. Performance evaluation shows that the proposed RBS-IA rule mining framework provides significant improvement in managing the network traffic control scheme. RBS-IA rule mining framework is evaluated over the factors such as accuracy of the decision being obtained, interestingness measure and execution time.展开更多
速度和效果是聚类算法面临的两大问题.DBSCAN(density based spatial clustering of applications with noise)是典型的基于密度的一种聚类方法,对于大型数据库的聚类实验显示了它在速度上的优越性.提出了一种基于密度的递归聚类算法(re...速度和效果是聚类算法面临的两大问题.DBSCAN(density based spatial clustering of applications with noise)是典型的基于密度的一种聚类方法,对于大型数据库的聚类实验显示了它在速度上的优越性.提出了一种基于密度的递归聚类算法(recursive density based clustering algorithm,简称RDBC),此算法可以智能地、动态地修改其密度参数.RDBC是基于DBSCAN的一种改进算法,其运算复杂度和DBSCAN相同.通过在Web文档上的聚类实验,结果表明,RDBC不但保留了DBSCAN高速度的优点,而且聚类效果大大优于DBSCAN.展开更多
The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illeg...The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.展开更多
Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “...Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “symptom phase”, “treatment phase”, and “commonly-used phrase” were set. Python web crawler was used to obtain relevant influenza data from the National Influenza Center’s influenza surveillance weekly report and Baidu Index. The establishment of support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), convolutional neural networks (CNN) prediction models through machine learning, took into account the seasonal characteristics of the influenza, also established the time series model (ARMA). The results show that, it is feasible to predict influenza based on web search data. Machine learning shows a certain forecast effect in the prediction of influenza based on web search data. In the future, it will have certain reference value in influenza prediction. The ARMA(3,0) model predicts better results and has greater generalization. Finally, the lack of research in this paper and future research directions are given.展开更多
Purpose: Our study proposes a bootstrapping-based method to automatically extract data- usage statements from academic texts. Design/methodology/approach: The method for data-usage statements extraction starts with ...Purpose: Our study proposes a bootstrapping-based method to automatically extract data- usage statements from academic texts. Design/methodology/approach: The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper. Findings: The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns. Research limitations: While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future. Practical implications: Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation. Originality/value: To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data.展开更多
The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many r...The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many researchers to apply Data Mining techniques on it. This paper gives a detailed state-of-the-art survey of on-going research in this new area. It shows the positive effects of Semantic Web Mining, the obstacles faced by researchers and propose number of approaches to deal with the very complex and heterogeneous information and knowledge which are produced by the technologies of Semantic Web.展开更多
Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of t...Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of the procedure of Web data extraction is presented, as well as the description of crawling and extraction algorithm. Based on the formalization, an XML - based page structure description language, TIDL, is brought out, including the object model, the HTML object reference model and definition of tags. At the final part, a Web data gathering and querying application based on Internet agent technology, named Web Integration Services Kit (WISK) is mentioned.展开更多
A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human underst...A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.展开更多
With user-generated content, anyone can De a content creator. This phenomenon has infinitely increased the amount of information circulated online, and it is beeoming harder to efficiently obtain required information....With user-generated content, anyone can De a content creator. This phenomenon has infinitely increased the amount of information circulated online, and it is beeoming harder to efficiently obtain required information. In this paper, we describe how natural language processing and text mining can be parallelized using Hadoop and Message Passing Interface. We propose a parallel web text mining platform that processes massive amounts data quickly and efficiently. Our web knowledge service platform is designed to collect information about the IT and telecommunications industries from the web and process this in-formation using natural language processing and data-mining techniques.展开更多
It is difficult to detect the anomalies whose matching relationship among some data attributes is very different from others’ in a dataset. Aiming at this problem, an approach based on wavelet analysis for detecting ...It is difficult to detect the anomalies whose matching relationship among some data attributes is very different from others’ in a dataset. Aiming at this problem, an approach based on wavelet analysis for detecting and amending anomalous samples was proposed. Taking full advantage of wavelet analysis’ properties of multi-resolution and local analysis, this approach is able to detect and amend anomalous samples effectively. To realize the rapid numeric computation of wavelet translation for a discrete sequence, a modified algorithm based on Newton-Cores formula was also proposed. The experimental result shows that the approach is feasible with good result and good practicality.展开更多
With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this pap...With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this paper,we propose a dependency graph model to describe the relationships between web requests.Based on this model,we design and implement a heuristic parallel algorithm to distinguish user clicks with the assistance of cloud computing technology.We evaluate the proposed algorithm with real massive data.The size of the dataset collected from a mobile core network is 228.7GB.It covers more than three million users.The experiment results demonstrate that the proposed algorithm can achieve higher accuracy than previous methods.展开更多
The diversity of e-commerce Business to Consumer systems and the significant increase in their use during the COVID-19 pandemic as a one of the primary channels of retail commerce, has made all the most important the ...The diversity of e-commerce Business to Consumer systems and the significant increase in their use during the COVID-19 pandemic as a one of the primary channels of retail commerce, has made all the most important the need to measuring their quality using practical methods. This paper presents a quality evaluation framework for web metrics that are B2C specific. The framework uses three dimensions based on end-user interaction categories, metrics internal specs and quality sub-characteristics as defined by ISO25010. Beginning from the existing large corpus of general-purpose web metrics, e-commerce specific metrics are chosen and categorized. Analysis results are subjected to a data mining analysis to provide association rules between the various dimensions of the framework. Finally, an ontology that corresponds to the framework is developed to answer to complicated questions related to metrics use and to facilitate the production of new, user defined meta-metrics.展开更多
The goal of this project is to use the Semantic Web Technologies and Data Mining for disease diagnosis to assist health care professionals regarding the possible medication and drug to prescribe (Drug recommendation) ...The goal of this project is to use the Semantic Web Technologies and Data Mining for disease diagnosis to assist health care professionals regarding the possible medication and drug to prescribe (Drug recommendation) according to the features of the patient. Numerous Decision Support Systems (DSS) and Expert Systems allow medical collaboration, like in the differential diagnosis specific or general. But, a medical recommendation system using both Semantic Web technologies and Data mining has not yet been developed which initiated this work. However, it should be mentioned that there are several system references about medicine or active ingredient interactions, but their final goal is not the Drug recommendation which uses above technologies. With this project we try to provide an assistant to the doctor for better recommendations. The patient will also able to use this system for explanation of drugs, food interaction and side effects of corresponding drugs.展开更多
文摘Corporations focus on web based education to train their employees ever more than before. Unlike traditional learning environments, web based education applications store large amount of data. This growing availability of data stimulated the emergence of a new field called educational data mining. In this study, the classification method is implemented on a data that is obtained from a company which uses web based education to train their employees. The authors' aim is to find out the most critical factors that influence the users' success. For the classification of the data, two decision tree algorithms, Classification and Regression Tree (CART) and Quick, Unbiased and Efficient Statistical Tree (QUEST) are applied. According to the results, assurance of a certificate at the end of the training is found to be the most critical factor that influences the users' success. Position, number of work years and the education level of the user, are also found as important factors.
文摘This paper describes in detail the web data mining technology, analyzes the relationship between the data on the web site to the tourism electronic commerce (including the server log, tourism commodity database, user database, the shopping cart), access to relevant user preference information for tourism commodity. Based on these models, the paper presents recommended strategies for the site registered users, and has had the corresponding formulas for calculating the current user of certain items recommended values and the corresponding recommendation algorithm, and the system can get a recommendation for user.
文摘Improvement on mining the frequently visited groups of web pages was studied. First, in the data preprocessing phrase, we introduce an extra frame filtering step that reduces the negative influence of frame pages on the result page groups. Through recognizing the frame pages in the site documents and constructing the frame subframe relation set, the subframe pages that influence the final mining result can be efficiently filtered. Second, we enhance the mining algorithm with the consideration of both the site topology and the content of the web pages. By the introduction of the normalized content link ratio of the web page and the group interlink degree of the page group, the enhanced algorithm concentrates more on the content pages that are less interlinked together. The experiments show that the new approach can effectively reveal more interesting page groups, which would not be found without these enhancements.
基金support from Taif University Researchers supporting Project Number(TURSP-2020/215),Taif University,Taif,Saudi Arabia.
文摘One of the leading cancers for both genders worldwide is lung cancer.The occurrence of lung cancer has fully augmented since the early 19th century.In this manuscript,we have discussed various data mining techniques that have been employed for cancer diagnosis.Exposure to air pollution has been related to various adverse health effects.This work is subject to analysis of various air pollutants and associated health hazards and intends to evaluate the impact of air pollution caused by lung cancer.We have introduced data mining in lung cancer to air pollution,and our approach includes preprocessing,data mining,testing and evaluation,and knowledge discovery.Initially,we will eradicate the noise and irrelevant data,and following that,we will join the multiple informed sources into a common source.From that source,we will designate the information relevant to our investigation to be regained from that assortment.Following that,we will convert the designated data into a suitable mining process.The patterns are abstracted by utilizing a relational suggestion rule mining process.These patterns have revealed information,and this information is categorized with the help of an Auto Associative Neural Network classification method(AANN).The proposed method is compared with the existing method in various factors.In conclusion,the projected Auto associative neural network and relational suggestion rule mining methods accomplish a high accuracy status.
基金supported by the“National Natural Science Foundation of China:Research on the discovery of key diagnosis and treatment elements and clinical optimization decision of spleen and stomach diseases based on deep learning(NO:81873200)”the“Construction and application of an intelligent early warning system for TCM clinical drug contraindications based on rule engine(NO:ZZ150321).”。
文摘Objective:This study analyzed the data of the medical cases in the book,“Clinical Guide Medical records”using a data mining method,to provide a reference for Ye Tianshi’s academic thoughts.Methods:We used the web version of the ancient and modern medical records cloud platform to complete distribution statistics,association rules,cluster analysis,and complex network analysis of all the medical records in the“Clinical Guide Medical records.”These methods were used to summarize the baseline data and to identify the core relationship between Chinese medicine diseases and Chinese medicine,as well as the Chinese medicine Classification.Results:A total of 2572 medical records,3136 visits,and 2879 prescriptions of 1127 traditional Chinese medicines were included in this study.The most common diseases(such as hematemesis),syndromes(such as liver–stomach disharmony),symptoms(such as rapid pulse),disease sites(such as gastric cavity),disease properties(such as Yang deficiency),treatment methods(such as activating Yang),and traditional Chinese medicines(such as Poria cocos)were identified.Furthermore,medicines with a warm,flat,cold,sweet,or bitter taste with its effects on the lungs,spleen,and heart were the most common.The observed effects of the drugs included clearing dampness,promoting diuresis,and strengthening the spleen.The association analysis showed that the associations between TCM diseases and traditional Chinese medicines that had a high confidence were“phlegm and fluid retention–Poria cocos,”“diarrhea–Poria cocos,”etc.The cluster analysis showed that traditional Chinese medicines were classified into five categories.The complex network showed the core relationship between nine high-frequency diseases and nine high-frequency traditional Chinese medicine.Conclusion:This study revealed the most important relationships between traditional Chinese medicines diseases and traditional Chinese medicines and classified the most used traditional Chinese medicines.These findings may help the coming generations of doctors to make accurate diagnoses and treat patients effectively and to improve the clinicians’efficacy in clinical diagnosis and treatment.
文摘The increasing usage of internet requires a significant system for effective communication. To pro- vide an effective communication for the internet users, based on nature of their queries, shortest routing path is usually preferred for data forwarding. But when more number of data chooses the same path, in that case, bottleneck occurs in the traffic this leads to data loss or provides irrelevant data to the users. In this paper, a Rule Based System using Improved Apriori (RBS-IA) rule mining framework is proposed for effective monitoring of traffic occurrence over the network and control the network traffic. RBS-IA framework integrates both the traffic control and decision making system to enhance the usage of internet trendier. At first, the network traffic data are ana- lyzed and the incoming and outgoing data information is processed using apriori rule mining algorithm. After generating the set of rules, the network traffic condition is analyzed. Based on the traffic conditions, the decision rule framework is introduced which derives and assigns the set of suitable rules to the appropriate states of the network. The decision rule framework improves the effectiveness of network traffic control by updating the traffic condition states for identifying the relevant route path for packet data transmission. Experimental evaluation is conducted by extrac- ting the Dodgers loop sensor data set from UCI repository to detect the effectiveness of theproposed Rule Based System using Improved Apriori (RBS-IA) rule mining framework. Performance evaluation shows that the proposed RBS-IA rule mining framework provides significant improvement in managing the network traffic control scheme. RBS-IA rule mining framework is evaluated over the factors such as accuracy of the decision being obtained, interestingness measure and execution time.
文摘速度和效果是聚类算法面临的两大问题.DBSCAN(density based spatial clustering of applications with noise)是典型的基于密度的一种聚类方法,对于大型数据库的聚类实验显示了它在速度上的优越性.提出了一种基于密度的递归聚类算法(recursive density based clustering algorithm,简称RDBC),此算法可以智能地、动态地修改其密度参数.RDBC是基于DBSCAN的一种改进算法,其运算复杂度和DBSCAN相同.通过在Web文档上的聚类实验,结果表明,RDBC不但保留了DBSCAN高速度的优点,而且聚类效果大大优于DBSCAN.
文摘The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.
文摘Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “symptom phase”, “treatment phase”, and “commonly-used phrase” were set. Python web crawler was used to obtain relevant influenza data from the National Influenza Center’s influenza surveillance weekly report and Baidu Index. The establishment of support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), convolutional neural networks (CNN) prediction models through machine learning, took into account the seasonal characteristics of the influenza, also established the time series model (ARMA). The results show that, it is feasible to predict influenza based on web search data. Machine learning shows a certain forecast effect in the prediction of influenza based on web search data. In the future, it will have certain reference value in influenza prediction. The ARMA(3,0) model predicts better results and has greater generalization. Finally, the lack of research in this paper and future research directions are given.
基金supported by the National Natural Science Foundation of China (Grant No.:71473183)
文摘Purpose: Our study proposes a bootstrapping-based method to automatically extract data- usage statements from academic texts. Design/methodology/approach: The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper. Findings: The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns. Research limitations: While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future. Practical implications: Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation. Originality/value: To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data.
文摘The integration of the two fast-developing scientific research areas Semantic Web and Web Mining is known as Semantic Web Mining. The huge increase in the amount of Semantic Web data became a perfect target for many researchers to apply Data Mining techniques on it. This paper gives a detailed state-of-the-art survey of on-going research in this new area. It shows the positive effects of Semantic Web Mining, the obstacles faced by researchers and propose number of approaches to deal with the very complex and heterogeneous information and knowledge which are produced by the technologies of Semantic Web.
基金Note:Contents discussed in this paper are part of a key project,No.2000-A31-01-04,sponsored by Ministry of Science and Technology of P.R.China
文摘Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of the procedure of Web data extraction is presented, as well as the description of crawling and extraction algorithm. Based on the formalization, an XML - based page structure description language, TIDL, is brought out, including the object model, the HTML object reference model and definition of tags. At the final part, a Web data gathering and querying application based on Internet agent technology, named Web Integration Services Kit (WISK) is mentioned.
文摘A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.
文摘With user-generated content, anyone can De a content creator. This phenomenon has infinitely increased the amount of information circulated online, and it is beeoming harder to efficiently obtain required information. In this paper, we describe how natural language processing and text mining can be parallelized using Hadoop and Message Passing Interface. We propose a parallel web text mining platform that processes massive amounts data quickly and efficiently. Our web knowledge service platform is designed to collect information about the IT and telecommunications industries from the web and process this in-formation using natural language processing and data-mining techniques.
基金Project(50374079) supported by the National Natural Science Foundation of China
文摘It is difficult to detect the anomalies whose matching relationship among some data attributes is very different from others’ in a dataset. Aiming at this problem, an approach based on wavelet analysis for detecting and amending anomalous samples was proposed. Taking full advantage of wavelet analysis’ properties of multi-resolution and local analysis, this approach is able to detect and amend anomalous samples effectively. To realize the rapid numeric computation of wavelet translation for a discrete sequence, a modified algorithm based on Newton-Cores formula was also proposed. The experimental result shows that the approach is feasible with good result and good practicality.
基金supported in part by the Fundamental Research Funds for the Central Universities under Grant No.2013RC0114111 Project of China under Grant No.B08004
文摘With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this paper,we propose a dependency graph model to describe the relationships between web requests.Based on this model,we design and implement a heuristic parallel algorithm to distinguish user clicks with the assistance of cloud computing technology.We evaluate the proposed algorithm with real massive data.The size of the dataset collected from a mobile core network is 228.7GB.It covers more than three million users.The experiment results demonstrate that the proposed algorithm can achieve higher accuracy than previous methods.
文摘The diversity of e-commerce Business to Consumer systems and the significant increase in their use during the COVID-19 pandemic as a one of the primary channels of retail commerce, has made all the most important the need to measuring their quality using practical methods. This paper presents a quality evaluation framework for web metrics that are B2C specific. The framework uses three dimensions based on end-user interaction categories, metrics internal specs and quality sub-characteristics as defined by ISO25010. Beginning from the existing large corpus of general-purpose web metrics, e-commerce specific metrics are chosen and categorized. Analysis results are subjected to a data mining analysis to provide association rules between the various dimensions of the framework. Finally, an ontology that corresponds to the framework is developed to answer to complicated questions related to metrics use and to facilitate the production of new, user defined meta-metrics.
文摘The goal of this project is to use the Semantic Web Technologies and Data Mining for disease diagnosis to assist health care professionals regarding the possible medication and drug to prescribe (Drug recommendation) according to the features of the patient. Numerous Decision Support Systems (DSS) and Expert Systems allow medical collaboration, like in the differential diagnosis specific or general. But, a medical recommendation system using both Semantic Web technologies and Data mining has not yet been developed which initiated this work. However, it should be mentioned that there are several system references about medicine or active ingredient interactions, but their final goal is not the Drug recommendation which uses above technologies. With this project we try to provide an assistant to the doctor for better recommendations. The patient will also able to use this system for explanation of drugs, food interaction and side effects of corresponding drugs.