Data mining techniques offer great opportunities for developing ethics lines whose main aim is to ensure improvements and compliance with the values, conduct and commitments making up the code of ethics. The aim of th...Data mining techniques offer great opportunities for developing ethics lines whose main aim is to ensure improvements and compliance with the values, conduct and commitments making up the code of ethics. The aim of this study is to suggest a process for exploiting the data generated by the data generated and collected from an ethics line by extracting rules of association and applying the Apriori algorithm. This makes it possible to identify anomalies and behaviour patterns requiring action to review, correct, promote or expand them, as appropriate.展开更多
The Apriori algorithm is a classical method of association rules mining.Based on analysis of this theory,the paper provides an improved Apriori algorithm.The paper puts foward with algorithm combines HASH table techni...The Apriori algorithm is a classical method of association rules mining.Based on analysis of this theory,the paper provides an improved Apriori algorithm.The paper puts foward with algorithm combines HASH table technique and reduction of candidate item sets to enhance the usage efficiency of resources as well as the individualized service of the data library.展开更多
Hotspots (active fires) indicate spatial distribution of fires. A study on determining influence factors for hotspot occurrence is essential so that fire events can be predicted based on characteristics of a certain a...Hotspots (active fires) indicate spatial distribution of fires. A study on determining influence factors for hotspot occurrence is essential so that fire events can be predicted based on characteristics of a certain area. This study discovers the possible influence factors on the occurrence of fire events using the association rule algorithm namely Apriori in the study area of Rokan Hilir Riau Province Indonesia. The Apriori algorithm was applied on a forest fire dataset which containeddata on physical environment (land cover, river, road and city center), socio-economic (income source, population, and number of school), weather (precipitation, wind speed, and screen temperature), and peatlands. The experiment results revealed 324 multidimensional association rules indicating relationships between hotspots occurrence and other factors.The association among hotspots occurrence with other geographical objects was discovered for the minimum support of 10% and the minimum confidence of 80%. The results show that strong relations between hotspots occurrence and influence factors are found for the support about 12.42%, the confidence of 1, and the lift of 2.26. These factors are precipitation greater than or equal to 3 mm/day, wind speed in [1m/s, 2m/s), non peatland area, screen temperature in [297K, 298K), the number of school in 1 km2 less than or equal to 0.1, and the distance of each hotspot to the nearest road less than or equal to 2.5 km.展开更多
Developing an efficient algorithm that can maintain discovered information as a database changes is quite important in data mining.Many proposed algorithms focused on a single level,and did not utilize previously mine...Developing an efficient algorithm that can maintain discovered information as a database changes is quite important in data mining.Many proposed algorithms focused on a single level,and did not utilize previously mined information in incrementally growing databases.In the past,we proposed an incremental mining algorithm for maintenance of multiple-level association rules as new transactions were inserted.Deletion of records in databases is,however,commonly seen in real-world applications.In this paper,we thus attempt to extend our previous approach to solve this issue.The concept of pre-large itemsets is used to reduce the need for rescanning original databases and to save maintenance costs.A pre-large itemset is not truly large,but promises to be large in the future.A lower support threshold and an upper support threshold are used to realize this concept.The two user-specified upper and lower support thresholds make the pre-large itemsets act as a gap to avoid small itemsets becoming large in the updated database when transactions are deleted.A new algorithm is thus proposed based on the concept to maintain discovered multiple-level association rules for deletion of records.The proposed algorithm doesn't need to rescan the original database until a number of records have been deleted.It can thus save much maintenance time.展开更多
In order to make effective use a large amount of graduate data in colleges and universities that accumulate by teaching management of work, the paper study the data mining for higher vocational graduates database usin...In order to make effective use a large amount of graduate data in colleges and universities that accumulate by teaching management of work, the paper study the data mining for higher vocational graduates database using the data mining technology. Using a variety of data preprocessing methods for the original data, and the paper put forward to mining algorithm based on commonly association rule Apriori algorithm, then according to the actual needs of the design and implementation of association rule mining system, has been beneficial to the employment guidance of college teaching management decision and graduates of the mining results.展开更多
Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting corre...Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.展开更多
Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a...Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.展开更多
Association rules discovering and prediction with data mining method are two topics in the field of information processing. In this paper, the records in database are divided into many linguistic values expressed with...Association rules discovering and prediction with data mining method are two topics in the field of information processing. In this paper, the records in database are divided into many linguistic values expressed with normal fuzzy numbers by fuzzy c-means algorithm, and a series of linguistic valued association rules are generated. Then the records in database are mapped onto the linguistic values according to largest subject principle, and the support and confidence definitions of linguistic valued association rules are also provided. The discovering and prediction methods of the linguistic valued association rules are discussed through a weather example last.展开更多
By analyzing the existing prefix-tree data structure, an improved pattern tree was introduced for processing new transactions. It firstly stored transactions in a lexicographic order tree and then restructured the tre...By analyzing the existing prefix-tree data structure, an improved pattern tree was introduced for processing new transactions. It firstly stored transactions in a lexicographic order tree and then restructured the tree by sorting each path in a frequency-descending order. While updating the improved pattern tree, there was no need to rescan the entire new database or reconstruct a new tree for incremental updating. A test was performed on synthetic dataset T1014D100K with 100 000 transactions and 870 items. Experimental results show that the smaller the minimum sup- port threshold, the faster the improved pattern tree achieves over CanTree for all datasets. As the minimum support threshold increased from 2% to 3.5%, the runtime decreased from 452.71 s to 186.26 s. Meanwhile, the runtime re- quired by CanTree decreased from 1 367.03 s to 432.19 s. When the database was updated, the execution time of im- proved pattern tree consisted of construction of original improved pattern trees and reconstruction of initial tree. The experiment results showed that the runtime was saved by about 15% compared with that of CanTree. As the number of transactions increased, the runtime of improved pattern tree was about 25% shorter than that of FP-tree. The improved pattern tree also required less memory than CanTree.展开更多
Datamining plays a crucial role in extractingmeaningful knowledge fromlarge-scale data repositories,such as data warehouses and databases.Association rule mining,a fundamental process in data mining,involves discoveri...Datamining plays a crucial role in extractingmeaningful knowledge fromlarge-scale data repositories,such as data warehouses and databases.Association rule mining,a fundamental process in data mining,involves discovering correlations,patterns,and causal structures within datasets.In the healthcare domain,association rules offer valuable opportunities for building knowledge bases,enabling intelligent diagnoses,and extracting invaluable information rapidly.This paper presents a novel approach called the Machine Learning based Association Rule Mining and Classification for Healthcare Data Management System(MLARMC-HDMS).The MLARMC-HDMS technique integrates classification and association rule mining(ARM)processes.Initially,the chimp optimization algorithm-based feature selection(COAFS)technique is employed within MLARMC-HDMS to select relevant attributes.Inspired by the foraging behavior of chimpanzees,the COA algorithm mimics their search strategy for food.Subsequently,the classification process utilizes stochastic gradient descent with a multilayer perceptron(SGD-MLP)model,while the Apriori algorithm determines attribute relationships.We propose a COA-based feature selection approach for medical data classification using machine learning techniques.This approach involves selecting pertinent features from medical datasets through COA and training machine learning models using the reduced feature set.We evaluate the performance of our approach on various medical datasets employing diverse machine learning classifiers.Experimental results demonstrate that our proposed approach surpasses alternative feature selection methods,achieving higher accuracy and precision rates in medical data classification tasks.The study showcases the effectiveness and efficiency of the COA-based feature selection approach in identifying relevant features,thereby enhancing the diagnosis and treatment of various diseases.To provide further validation,we conduct detailed experiments on a benchmark medical dataset,revealing the superiority of the MLARMCHDMS model over other methods,with a maximum accuracy of 99.75%.Therefore,this research contributes to the advancement of feature selection techniques in medical data classification and highlights the potential for improving healthcare outcomes through accurate and efficient data analysis.The presented MLARMC-HDMS framework and COA-based feature selection approach offer valuable insights for researchers and practitioners working in the field of healthcare data mining and machine learning.展开更多
After the digital revolution,large quantities of data have been generated with time through various networks.The networks have made the process of data analysis very difficult by detecting attacks using suitable techn...After the digital revolution,large quantities of data have been generated with time through various networks.The networks have made the process of data analysis very difficult by detecting attacks using suitable techniques.While Intrusion Detection Systems(IDSs)secure resources against threats,they still face challenges in improving detection accuracy,reducing false alarm rates,and detecting the unknown ones.This paper presents a framework to integrate data mining classification algorithms and association rules to implement network intrusion detection.Several experiments have been performed and evaluated to assess various machine learning classifiers based on the KDD99 intrusion dataset.Our study focuses on several data mining algorithms such as;naïve Bayes,decision trees,support vector machines,decision tables,k-nearest neighbor algorithms,and artificial neural networks.Moreover,this paper is concerned with the association process in creating attack rules to identify those in the network audit data,by utilizing a KDD99 dataset anomaly detection.The focus is on false negative and false positive performance metrics to enhance the detection rate of the intrusion detection system.The implemented experiments compare the results of each algorithm and demonstrate that the decision tree is the most powerful algorithm as it has the highest accuracy(0.992)and the lowest false positive rate(0.009).展开更多
The traditional Apriori applied in books management system causes slow system operation due to frequent scanning of database and excessive quantity of candidate item-sets, so an information recommendation book managem...The traditional Apriori applied in books management system causes slow system operation due to frequent scanning of database and excessive quantity of candidate item-sets, so an information recommendation book management system based on improved Apriori data mining algorithm is designed, in which the C/S (client/server) architecture and B/S (browser/server) architecture are integrated, so as to open the book information to library staff and borrowers. The related information data of the borrowers and books can be extracted from books lending database by the data preprocessing sub-module in the system function module. After the data is cleaned, converted and integrated, the association rule mining sub-module is used to mine the strong association rules with support degree greater than minimum support degree threshold and confidence coefficient greater than minimum confidence coefficient threshold according to the processed data and by means of the improved Apriori data mining algorithm to generate association rule database. The association matching is performed by the personalized recommendation sub-module according to the borrower and his selected books in the association rule database. The book information associated with the books read by borrower is recommended to him to realize personalized recommendation of the book information. The experimental results show that the system can effectively recommend book related information, and its CPU occupation rate is only 6.47% under the condition that 50 clients are running it at the same time. Anyway, it has good performance.展开更多
With the development of smart agriculture,the accumulation of data in the field of pesticide regulation has a certain scale.The pesticide transaction data collected by the Pesticide National Data Center alone produces...With the development of smart agriculture,the accumulation of data in the field of pesticide regulation has a certain scale.The pesticide transaction data collected by the Pesticide National Data Center alone produces more than 10 million records daily.However,due to the backward technical means,the existing pesticide supervision data lack deep mining and usage.The Apriori algorithm is one of the classic algorithms in association rule mining,but it needs to traverse the transaction database multiple times,which will cause an extra IO burden.Spark is an emerging big data parallel computing framework with advantages such as memory computing and flexible distributed data sets.Compared with the Hadoop MapReduce computing framework,IO performance was greatly improved.Therefore,this paper proposed an improved Apriori algorithm based on Spark framework,ICAMA.The MapReduce process was used to support the candidate set and then to generate the candidate set.After experimental comparison,when the data volume exceeds 250 Mb,the performance of Spark-based Apriori algorithm was 20%higher than that of the traditional Hadoop-based Apriori algorithm,and with the increase of data volume,the performance improvement was more obvious.展开更多
ost proposed algorithms for mining association rules follow the conventional level wise approach. The dynamic candidate generation idea introduced in the dynamic itemset counting (DIC) algorithm broke away from the l...ost proposed algorithms for mining association rules follow the conventional level wise approach. The dynamic candidate generation idea introduced in the dynamic itemset counting (DIC) algorithm broke away from the level wise limitation which could find the large itemsets using fewer passes over the database than level wise algorithms. However, the dynamic approach is very sensitive to the data distribution of the database and it requires a proper interval size. In this paper an optimization technique named adaptive interval configuration (AIC) has been developed to enhance the dynamic approach. The AIC optimization has the following two functions. The first is that a homogeneous distribution of large itemsets over intervals can be achieved so that less unnecessary candidates could be generated and less database scanning passes are guaranteed. The second is that the near optimal interval size could be determined adaptively to produce the best response time. We also developed a candidate pruning technique named virtual partition pruning to reduce the size 2 candidate set and incorporated it into the AIC optimization. Based on the optimization technique, we proposed the efficient AIC algorithm for mining association rules. The algorithms of AIC, DIC and the classic Apriori were implemented on a Sun Ultra Enterprise 4000 for performance comparison. The results show that the AIC performed much better than both DIC and Apriori, and showed a strong robustness.展开更多
文摘Data mining techniques offer great opportunities for developing ethics lines whose main aim is to ensure improvements and compliance with the values, conduct and commitments making up the code of ethics. The aim of this study is to suggest a process for exploiting the data generated by the data generated and collected from an ethics line by extracting rules of association and applying the Apriori algorithm. This makes it possible to identify anomalies and behaviour patterns requiring action to review, correct, promote or expand them, as appropriate.
文摘The Apriori algorithm is a classical method of association rules mining.Based on analysis of this theory,the paper provides an improved Apriori algorithm.The paper puts foward with algorithm combines HASH table technique and reduction of candidate item sets to enhance the usage efficiency of resources as well as the individualized service of the data library.
文摘Hotspots (active fires) indicate spatial distribution of fires. A study on determining influence factors for hotspot occurrence is essential so that fire events can be predicted based on characteristics of a certain area. This study discovers the possible influence factors on the occurrence of fire events using the association rule algorithm namely Apriori in the study area of Rokan Hilir Riau Province Indonesia. The Apriori algorithm was applied on a forest fire dataset which containeddata on physical environment (land cover, river, road and city center), socio-economic (income source, population, and number of school), weather (precipitation, wind speed, and screen temperature), and peatlands. The experiment results revealed 324 multidimensional association rules indicating relationships between hotspots occurrence and other factors.The association among hotspots occurrence with other geographical objects was discovered for the minimum support of 10% and the minimum confidence of 80%. The results show that strong relations between hotspots occurrence and influence factors are found for the support about 12.42%, the confidence of 1, and the lift of 2.26. These factors are precipitation greater than or equal to 3 mm/day, wind speed in [1m/s, 2m/s), non peatland area, screen temperature in [297K, 298K), the number of school in 1 km2 less than or equal to 0.1, and the distance of each hotspot to the nearest road less than or equal to 2.5 km.
文摘Developing an efficient algorithm that can maintain discovered information as a database changes is quite important in data mining.Many proposed algorithms focused on a single level,and did not utilize previously mined information in incrementally growing databases.In the past,we proposed an incremental mining algorithm for maintenance of multiple-level association rules as new transactions were inserted.Deletion of records in databases is,however,commonly seen in real-world applications.In this paper,we thus attempt to extend our previous approach to solve this issue.The concept of pre-large itemsets is used to reduce the need for rescanning original databases and to save maintenance costs.A pre-large itemset is not truly large,but promises to be large in the future.A lower support threshold and an upper support threshold are used to realize this concept.The two user-specified upper and lower support thresholds make the pre-large itemsets act as a gap to avoid small itemsets becoming large in the updated database when transactions are deleted.A new algorithm is thus proposed based on the concept to maintain discovered multiple-level association rules for deletion of records.The proposed algorithm doesn't need to rescan the original database until a number of records have been deleted.It can thus save much maintenance time.
文摘In order to make effective use a large amount of graduate data in colleges and universities that accumulate by teaching management of work, the paper study the data mining for higher vocational graduates database using the data mining technology. Using a variety of data preprocessing methods for the original data, and the paper put forward to mining algorithm based on commonly association rule Apriori algorithm, then according to the actual needs of the design and implementation of association rule mining system, has been beneficial to the employment guidance of college teaching management decision and graduates of the mining results.
文摘Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.
基金Supported by the National Natural Science Foundation of China(60472099)Ningbo Natural Science Foundation(2006A610017)
文摘Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.
基金The projectis supported by N ational N atural Science Foundation of China(No.699310 4 0 )
文摘Association rules discovering and prediction with data mining method are two topics in the field of information processing. In this paper, the records in database are divided into many linguistic values expressed with normal fuzzy numbers by fuzzy c-means algorithm, and a series of linguistic valued association rules are generated. Then the records in database are mapped onto the linguistic values according to largest subject principle, and the support and confidence definitions of linguistic valued association rules are also provided. The discovering and prediction methods of the linguistic valued association rules are discussed through a weather example last.
基金Supported by National Natural Science Foundation of China (No.50975193)Specialized Research Fund for Doctoral Program of Higher Education of China (No.20060056016)
文摘By analyzing the existing prefix-tree data structure, an improved pattern tree was introduced for processing new transactions. It firstly stored transactions in a lexicographic order tree and then restructured the tree by sorting each path in a frequency-descending order. While updating the improved pattern tree, there was no need to rescan the entire new database or reconstruct a new tree for incremental updating. A test was performed on synthetic dataset T1014D100K with 100 000 transactions and 870 items. Experimental results show that the smaller the minimum sup- port threshold, the faster the improved pattern tree achieves over CanTree for all datasets. As the minimum support threshold increased from 2% to 3.5%, the runtime decreased from 452.71 s to 186.26 s. Meanwhile, the runtime re- quired by CanTree decreased from 1 367.03 s to 432.19 s. When the database was updated, the execution time of im- proved pattern tree consisted of construction of original improved pattern trees and reconstruction of initial tree. The experiment results showed that the runtime was saved by about 15% compared with that of CanTree. As the number of transactions increased, the runtime of improved pattern tree was about 25% shorter than that of FP-tree. The improved pattern tree also required less memory than CanTree.
基金Deputyship for Research&Innovation,Ministry of Education in Saudi Arabia for funding this research work through the Project Number RI-44-0444.
文摘Datamining plays a crucial role in extractingmeaningful knowledge fromlarge-scale data repositories,such as data warehouses and databases.Association rule mining,a fundamental process in data mining,involves discovering correlations,patterns,and causal structures within datasets.In the healthcare domain,association rules offer valuable opportunities for building knowledge bases,enabling intelligent diagnoses,and extracting invaluable information rapidly.This paper presents a novel approach called the Machine Learning based Association Rule Mining and Classification for Healthcare Data Management System(MLARMC-HDMS).The MLARMC-HDMS technique integrates classification and association rule mining(ARM)processes.Initially,the chimp optimization algorithm-based feature selection(COAFS)technique is employed within MLARMC-HDMS to select relevant attributes.Inspired by the foraging behavior of chimpanzees,the COA algorithm mimics their search strategy for food.Subsequently,the classification process utilizes stochastic gradient descent with a multilayer perceptron(SGD-MLP)model,while the Apriori algorithm determines attribute relationships.We propose a COA-based feature selection approach for medical data classification using machine learning techniques.This approach involves selecting pertinent features from medical datasets through COA and training machine learning models using the reduced feature set.We evaluate the performance of our approach on various medical datasets employing diverse machine learning classifiers.Experimental results demonstrate that our proposed approach surpasses alternative feature selection methods,achieving higher accuracy and precision rates in medical data classification tasks.The study showcases the effectiveness and efficiency of the COA-based feature selection approach in identifying relevant features,thereby enhancing the diagnosis and treatment of various diseases.To provide further validation,we conduct detailed experiments on a benchmark medical dataset,revealing the superiority of the MLARMCHDMS model over other methods,with a maximum accuracy of 99.75%.Therefore,this research contributes to the advancement of feature selection techniques in medical data classification and highlights the potential for improving healthcare outcomes through accurate and efficient data analysis.The presented MLARMC-HDMS framework and COA-based feature selection approach offer valuable insights for researchers and practitioners working in the field of healthcare data mining and machine learning.
文摘After the digital revolution,large quantities of data have been generated with time through various networks.The networks have made the process of data analysis very difficult by detecting attacks using suitable techniques.While Intrusion Detection Systems(IDSs)secure resources against threats,they still face challenges in improving detection accuracy,reducing false alarm rates,and detecting the unknown ones.This paper presents a framework to integrate data mining classification algorithms and association rules to implement network intrusion detection.Several experiments have been performed and evaluated to assess various machine learning classifiers based on the KDD99 intrusion dataset.Our study focuses on several data mining algorithms such as;naïve Bayes,decision trees,support vector machines,decision tables,k-nearest neighbor algorithms,and artificial neural networks.Moreover,this paper is concerned with the association process in creating attack rules to identify those in the network audit data,by utilizing a KDD99 dataset anomaly detection.The focus is on false negative and false positive performance metrics to enhance the detection rate of the intrusion detection system.The implemented experiments compare the results of each algorithm and demonstrate that the decision tree is the most powerful algorithm as it has the highest accuracy(0.992)and the lowest false positive rate(0.009).
文摘The traditional Apriori applied in books management system causes slow system operation due to frequent scanning of database and excessive quantity of candidate item-sets, so an information recommendation book management system based on improved Apriori data mining algorithm is designed, in which the C/S (client/server) architecture and B/S (browser/server) architecture are integrated, so as to open the book information to library staff and borrowers. The related information data of the borrowers and books can be extracted from books lending database by the data preprocessing sub-module in the system function module. After the data is cleaned, converted and integrated, the association rule mining sub-module is used to mine the strong association rules with support degree greater than minimum support degree threshold and confidence coefficient greater than minimum confidence coefficient threshold according to the processed data and by means of the improved Apriori data mining algorithm to generate association rule database. The association matching is performed by the personalized recommendation sub-module according to the borrower and his selected books in the association rule database. The book information associated with the books read by borrower is recommended to him to realize personalized recommendation of the book information. The experimental results show that the system can effectively recommend book related information, and its CPU occupation rate is only 6.47% under the condition that 50 clients are running it at the same time. Anyway, it has good performance.
基金supported by National Natural Science Foundation of China(No.61601471)。
文摘With the development of smart agriculture,the accumulation of data in the field of pesticide regulation has a certain scale.The pesticide transaction data collected by the Pesticide National Data Center alone produces more than 10 million records daily.However,due to the backward technical means,the existing pesticide supervision data lack deep mining and usage.The Apriori algorithm is one of the classic algorithms in association rule mining,but it needs to traverse the transaction database multiple times,which will cause an extra IO burden.Spark is an emerging big data parallel computing framework with advantages such as memory computing and flexible distributed data sets.Compared with the Hadoop MapReduce computing framework,IO performance was greatly improved.Therefore,this paper proposed an improved Apriori algorithm based on Spark framework,ICAMA.The MapReduce process was used to support the candidate set and then to generate the candidate set.After experimental comparison,when the data volume exceeds 250 Mb,the performance of Spark-based Apriori algorithm was 20%higher than that of the traditional Hadoop-based Apriori algorithm,and with the increase of data volume,the performance improvement was more obvious.
文摘ost proposed algorithms for mining association rules follow the conventional level wise approach. The dynamic candidate generation idea introduced in the dynamic itemset counting (DIC) algorithm broke away from the level wise limitation which could find the large itemsets using fewer passes over the database than level wise algorithms. However, the dynamic approach is very sensitive to the data distribution of the database and it requires a proper interval size. In this paper an optimization technique named adaptive interval configuration (AIC) has been developed to enhance the dynamic approach. The AIC optimization has the following two functions. The first is that a homogeneous distribution of large itemsets over intervals can be achieved so that less unnecessary candidates could be generated and less database scanning passes are guaranteed. The second is that the near optimal interval size could be determined adaptively to produce the best response time. We also developed a candidate pruning technique named virtual partition pruning to reduce the size 2 candidate set and incorporated it into the AIC optimization. Based on the optimization technique, we proposed the efficient AIC algorithm for mining association rules. The algorithms of AIC, DIC and the classic Apriori were implemented on a Sun Ultra Enterprise 4000 for performance comparison. The results show that the AIC performed much better than both DIC and Apriori, and showed a strong robustness.