Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting corre...Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.展开更多
The classical algorithm of finding association rules generated by a frequent itemset has to generate all non-empty subsets of the frequent itemset as candidate set of consequents. Xiongfei Li aimed at this and propose...The classical algorithm of finding association rules generated by a frequent itemset has to generate all non-empty subsets of the frequent itemset as candidate set of consequents. Xiongfei Li aimed at this and proposed an improved algorithm. The algorithm finds all consequents layer by layer, so it is breadth-first. In this paper, we propose a new algorithm Generate Rules by using Set-Enumeration Tree (GRSET) which uses the structure of Set-Enumeration Tree and depth-first method to find all consequents of the association rules one by one and get all association rules correspond to the consequents. Experiments show GRSET algorithm to be practicable and efficient.展开更多
Association rule mining is an important issue in data mining. The paper proposed an binary system based method to generate candidate frequent itemsets and corresponding supporting counts efficiently, which needs only ...Association rule mining is an important issue in data mining. The paper proposed an binary system based method to generate candidate frequent itemsets and corresponding supporting counts efficiently, which needs only some operations such as "and", "or" and "xor". Applying this idea in the existed distributed association rule mining al gorithm FDM, the improved algorithm BFDM is proposed. The theoretical analysis and experiment testify that BFDM is effective and efficient.展开更多
Data mining techniques offer great opportunities for developing ethics lines whose main aim is to ensure improvements and compliance with the values, conduct and commitments making up the code of ethics. The aim of th...Data mining techniques offer great opportunities for developing ethics lines whose main aim is to ensure improvements and compliance with the values, conduct and commitments making up the code of ethics. The aim of this study is to suggest a process for exploiting the data generated by the data generated and collected from an ethics line by extracting rules of association and applying the Apriori algorithm. This makes it possible to identify anomalies and behaviour patterns requiring action to review, correct, promote or expand them, as appropriate.展开更多
In this paper, we propose an efficient algorithm, called FFP-Growth (shortfor fast FP-Growth) , to mine frequent itemsets. Similar to FP-Growth, FFP-Growth searches theFP-tree in the bottom-up order, but need not cons...In this paper, we propose an efficient algorithm, called FFP-Growth (shortfor fast FP-Growth) , to mine frequent itemsets. Similar to FP-Growth, FFP-Growth searches theFP-tree in the bottom-up order, but need not construct conditional pattern bases and sub-FP-trees,thus, saving a substantial amount of time and space, and the FP-tree created by it is much smallerthan that created by TD-FP-Growth, hence improving efficiency. At the same time, FFP-Growth can beeasily extended for reducing the search space as TD-FP-Growth (M) and TD-FP-Growth (C). Experimentalresults show that the algorithm of this paper is effective and efficient.展开更多
The market trends rapidly changed over the last two decades.The primary reason is the newly created opportunities and the increased number of competitors competing to grasp market share using business analysis techniq...The market trends rapidly changed over the last two decades.The primary reason is the newly created opportunities and the increased number of competitors competing to grasp market share using business analysis techniques.Market Basket Analysis has a tangible effect in facilitating current change in the market.Market Basket Analysis is one of the famous fields that deal with Big Data and Data Mining applications.MBA initially uses Association Rule Learning(ARL)as a mean for realization.ARL has a beneficial effect in providing a plenty benefit in analyzing the market data and understanding customers’behavior.An important motive of using such techniques is maximizing the business profit as well as matching the exact customer needs as closely as possible.In this survey paper,we discussed several applications and methods of MBA based on ARL.Also,we reviewed some association rule learning measurements including trust,lift,leverage,and others.Furthermore,we discuss some open issues and future topics in the area of market basket analysis and association rule learning.展开更多
The fight against fraud and trafficking is a fundamental mission of customs. The conditions for carrying out this mission depend both on the evolution of economic issues and on the behaviour of the actors in charge of...The fight against fraud and trafficking is a fundamental mission of customs. The conditions for carrying out this mission depend both on the evolution of economic issues and on the behaviour of the actors in charge of its implementation. As part of the customs clearance process, customs are nowadays confronted with an increasing volume of goods in connection with the development of international trade. Automated risk management is therefore required to limit intrusive control. In this article, we propose an unsupervised classification method to extract knowledge rules from a database of customs offences in order to identify abnormal behaviour resulting from customs control. The idea is to apply the Apriori principle on the basis of frequent grounds on a database relating to customs offences in customs procedures to uncover potential rules of association between a customs operation and an offence for the purpose of extracting knowledge governing the occurrence of fraud. This mass of often heterogeneous and complex data thus generates new needs that knowledge extraction methods must be able to meet. The assessment of infringements inevitably requires a proper identification of the risks. It is an original approach based on data mining or data mining to build association rules in two steps: first, search for frequent patterns (support >= minimum support) then from the frequent patterns, produce association rules (Trust >= Minimum Trust). The simulations carried out highlighted three main association rules: forecasting rules, targeting rules and neutral rules with the introduction of a third indicator of rule relevance which is the Lift measure. Confidence in the first two rules has been set at least 50%.展开更多
In this paper,association rule mining algorithm is utilized to analyze the correlations of various factors of causing traffic accidents,from which the relationship model of dangerous driving behaviors is established.I...In this paper,association rule mining algorithm is utilized to analyze the correlations of various factors of causing traffic accidents,from which the relationship model of dangerous driving behaviors is established.In this model,the factors and their correlations include:ability of risk control,ability of driving self-confidence,individual characteristics,and incorrect driving operations.By selecting the drivers in the city of Chengdu to be the objects of investigation,a group of valid sample data is obtained.Based on these data,the Support and Confidence for association rules are analyzed.In the analysis,the two stage computing of Apriori algorithm programming is simulated,and from which some important rules are obtained.With these rules,departments of traffic administration can focus on these key factors in their processing of traffic transactions.By the training of drivers’skills and their physical and mental behaviors,the incorrect driving operations can be greatly reduced and the traffic safety can be effectively guaranteed.展开更多
The traditional Apriori applied in books management system causes slow system operation due to frequent scanning of database and excessive quantity of candidate item-sets, so an information recommendation book managem...The traditional Apriori applied in books management system causes slow system operation due to frequent scanning of database and excessive quantity of candidate item-sets, so an information recommendation book management system based on improved Apriori data mining algorithm is designed, in which the C/S (client/server) architecture and B/S (browser/server) architecture are integrated, so as to open the book information to library staff and borrowers. The related information data of the borrowers and books can be extracted from books lending database by the data preprocessing sub-module in the system function module. After the data is cleaned, converted and integrated, the association rule mining sub-module is used to mine the strong association rules with support degree greater than minimum support degree threshold and confidence coefficient greater than minimum confidence coefficient threshold according to the processed data and by means of the improved Apriori data mining algorithm to generate association rule database. The association matching is performed by the personalized recommendation sub-module according to the borrower and his selected books in the association rule database. The book information associated with the books read by borrower is recommended to him to realize personalized recommendation of the book information. The experimental results show that the system can effectively recommend book related information, and its CPU occupation rate is only 6.47% under the condition that 50 clients are running it at the same time. Anyway, it has good performance.展开更多
Hotspots (active fires) indicate spatial distribution of fires. A study on determining influence factors for hotspot occurrence is essential so that fire events can be predicted based on characteristics of a certain a...Hotspots (active fires) indicate spatial distribution of fires. A study on determining influence factors for hotspot occurrence is essential so that fire events can be predicted based on characteristics of a certain area. This study discovers the possible influence factors on the occurrence of fire events using the association rule algorithm namely Apriori in the study area of Rokan Hilir Riau Province Indonesia. The Apriori algorithm was applied on a forest fire dataset which containeddata on physical environment (land cover, river, road and city center), socio-economic (income source, population, and number of school), weather (precipitation, wind speed, and screen temperature), and peatlands. The experiment results revealed 324 multidimensional association rules indicating relationships between hotspots occurrence and other factors.The association among hotspots occurrence with other geographical objects was discovered for the minimum support of 10% and the minimum confidence of 80%. The results show that strong relations between hotspots occurrence and influence factors are found for the support about 12.42%, the confidence of 1, and the lift of 2.26. These factors are precipitation greater than or equal to 3 mm/day, wind speed in [1m/s, 2m/s), non peatland area, screen temperature in [297K, 298K), the number of school in 1 km2 less than or equal to 0.1, and the distance of each hotspot to the nearest road less than or equal to 2.5 km.展开更多
In this paper,We study the Apriori and FP-growth algorithm in mining association rules and give a method for computing all the frequent item-sets in a database.Its basic idea is giving a concept based on the boolean v...In this paper,We study the Apriori and FP-growth algorithm in mining association rules and give a method for computing all the frequent item-sets in a database.Its basic idea is giving a concept based on the boolean vector business product,which be computed between all the businesses,then we can get all the two frequent item-sets(minsup=2).We basis their inclusive relation to construct a set-tree of item-sets in database transaction,and then traverse path in it and get all the frequent item-sets.Therefore,we can get minimal frequent item sets between transactions and items in the database without scanning the database and iteratively computing in Apriori algorithm.展开更多
This paper presents some new algorithms to efficiently mine max frequent generalized itemsets(g-itemsets)and essential generalized association rules(g-rules).These are compact and general representations for all frequ...This paper presents some new algorithms to efficiently mine max frequent generalized itemsets(g-itemsets)and essential generalized association rules(g-rules).These are compact and general representations for all frequent patterns and all strong association rules in the generalized environment.Our results fill an important gap among algorithms for frequent patterns and association rules by combining two concepts.First,generalized itemsets employ a taxonomy of items,rather than a flat list of items.This produces more natural frequent itemsets and associations such as(meat,milk)instead of(beef,milk),(chicken,milk),etc.Second,compact representations of frequent itemsets and strong rules,whose result size is exponentially smaller,can solve a standard dilemma in mining patterns:with small threshold values for support and confidence,the user is overwhelmed by the extraordinary number of identified patterns and associations;but with large threshold values,some interesting patterns and associations fail to be identified.Our algorithms can also expand those max frequent g-itemsets and essential g-rules into the much larger set of ordinary frequent g-itemsets and strong g-rules.While that expansion is not recommended in most practical cases,we do so in order to present a comparison with existing algorithms that only handle ordinary frequent g-itemsets.In this case,the new algorithm is shown to be thousands,and in some cases millions,of the time faster than previous algorithms.Further,the new algorithm succeeds in analyzing deeper taxonomies,with the depths of seven or more.Experimental results for previous algorithms limited themselves to taxonomies with depth at most three or four.In each of the two problems,a straightforward lattice-based approach is briefly discussed and then a classificationbased algorithm is developed.In particular,the two classification-based algorithms are MFGI_class for mining max frequent g-itemsets and EGR_class for mining essential g-rules.The classification-based algorithms are featured with conceptual classification trees and dynamic generation and pruning algorithms.展开更多
A method for mining frequent itemsets by evaluating their probability of supports based on asso-ciation analysis is presented.This paper obtains the probability of every 1-itemset by scanning the database,then evaluat...A method for mining frequent itemsets by evaluating their probability of supports based on asso-ciation analysis is presented.This paper obtains the probability of every 1-itemset by scanning the database,then evaluates the probability of every 2-itemset,every 3-itemset,every k-itemset from the frequent 1-itemsets and gains all the candidate frequent itemsets.This paper also scans the database for verifying the support of the candidate frequent itemsets.Last,the frequent itemsets are mined.The method reduces a lot of time of scanning database and shortens the computation time of the algorithm.展开更多
One of the obstacles of the efficient association rule mining is theexplosive expansion of data sets since it is costly or impossible to scan large databases, esp., formultiple times. A popular solution to improve the...One of the obstacles of the efficient association rule mining is theexplosive expansion of data sets since it is costly or impossible to scan large databases, esp., formultiple times. A popular solution to improve the speed and scalability of the association rulemining is to do the algorithm on a random sample instead of the entire database. But how toeffectively define and efficiently estimate the degree of error with respect to the outcome of thealgorithm, and how to determine the sample size needed are entangling researches until now. In thispaper, an effective and efficient algorithm is given based on the PAC (Probably Approximate Correct)learning theory to measure and estimate sample error. Then, a new adaptive, on-line, fast samplingstrategy - multi-scaling sampling - is presented inspired by MRA (Multi-Resolution Analysis) andShannon sampling theorem, for quickly obtaining acceptably approximate association rules atappropriate sample size. Both theoretical analysis and empirical study have showed that the Samplingstrategy can achieve a very good speed-accuracy trade-off.展开更多
Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a...Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.展开更多
This paper aims to mine the knowledge and rules on compatibility of drugs from the prescriptions for curing arrhythmia in the Chinese traditional medicine database by Apriori algorithm. For data preparation, 1 113 pre...This paper aims to mine the knowledge and rules on compatibility of drugs from the prescriptions for curing arrhythmia in the Chinese traditional medicine database by Apriori algorithm. For data preparation, 1 113 prescriptions for arrhythmia, including 535 herbs ( totally 10884 counts of herbs) were collected into the database. The prescription data were preprocessed through redundancy reduction, normalized storage, and knowledge induction according to the pretreatment demands of data mining. Then the Apriori algorithm was used to analyze the data and form the related technical rules and treatment procedures. The experimental result of compatibility of drugs for curing arrhythmia from the Chinese traditional medicine database shows that the prescription compatibility obtained by Apriori algorithm generally accords with the basic law of traditional Chinese medicine for arrhythmia. Some special compatibilities unreported were also discovered in the experiment, which may be used as the basis for developing new prescriptions for arrhythmia.展开更多
Working at height is widespread across various industries,with frequent and hazardous falls occurring regularly.Such tasks are often linked to multifactorial issues,where the interplay of diverse factors leads to acci...Working at height is widespread across various industries,with frequent and hazardous falls occurring regularly.Such tasks are often linked to multifactorial issues,where the interplay of diverse factors leads to accidents that are challenging to control effectively.This study establishes an index system for the factors influencing falls from height by statistically analyzing 101 incidents,identifying 64 causative elements classified into four categories.These include 17 factors related to operator condition and behavior,13 concerning equipment and facility conditions,7 pertaining to site conditions,and 27 associated with production operations management.Utilizing the Apriori algorithm and Gephi software,the study mined the association rules of causal factors in falls from height and constructed their network diagram.By examining association rules with high support,confidence,and lift,the relationships between key causal factors leading to accidents are clarified,identifying critical operational control points and providing a scientific foundation for reducing the incidence of falls from height.Currently,China's standards related to working at height remain fragmented.This study lays the foundation for the development of comprehensive,systematic,generic safety management standards for working at height,satisfying the needs of the field.展开更多
文摘Maximum frequent pattern generation from a large database of transactions and items for association rule mining is an important research topic in data mining. Association rule mining aims to discover interesting correlations, frequent patterns, associations, or causal structures between items hidden in a large database. By exploiting quantum computing, we propose an efficient quantum search algorithm design to discover the maximum frequent patterns. We modified Grover’s search algorithm so that a subspace of arbitrary symmetric states is used instead of the whole search space. We presented a novel quantum oracle design that employs a quantum counter to count the maximum frequent items and a quantum comparator to check with a minimum support threshold. The proposed derived algorithm increases the rate of the correct solutions since the search is only in a subspace. Furthermore, our algorithm significantly scales and optimizes the required number of qubits in design, which directly reflected positively on the performance. Our proposed design can accommodate more transactions and items and still have a good performance with a small number of qubits.
基金Supported by the National Natural Science Foundation of China (No.60474022) the Natural Science Foundation of Henan Province(No. G2002026,200510475028)
文摘The classical algorithm of finding association rules generated by a frequent itemset has to generate all non-empty subsets of the frequent itemset as candidate set of consequents. Xiongfei Li aimed at this and proposed an improved algorithm. The algorithm finds all consequents layer by layer, so it is breadth-first. In this paper, we propose a new algorithm Generate Rules by using Set-Enumeration Tree (GRSET) which uses the structure of Set-Enumeration Tree and depth-first method to find all consequents of the association rules one by one and get all association rules correspond to the consequents. Experiments show GRSET algorithm to be practicable and efficient.
基金Supported by the National Natural Science Foun-dation of China (70371015)
文摘Association rule mining is an important issue in data mining. The paper proposed an binary system based method to generate candidate frequent itemsets and corresponding supporting counts efficiently, which needs only some operations such as "and", "or" and "xor". Applying this idea in the existed distributed association rule mining al gorithm FDM, the improved algorithm BFDM is proposed. The theoretical analysis and experiment testify that BFDM is effective and efficient.
文摘Data mining techniques offer great opportunities for developing ethics lines whose main aim is to ensure improvements and compliance with the values, conduct and commitments making up the code of ethics. The aim of this study is to suggest a process for exploiting the data generated by the data generated and collected from an ethics line by extracting rules of association and applying the Apriori algorithm. This makes it possible to identify anomalies and behaviour patterns requiring action to review, correct, promote or expand them, as appropriate.
文摘In this paper, we propose an efficient algorithm, called FFP-Growth (shortfor fast FP-Growth) , to mine frequent itemsets. Similar to FP-Growth, FFP-Growth searches theFP-tree in the bottom-up order, but need not construct conditional pattern bases and sub-FP-trees,thus, saving a substantial amount of time and space, and the FP-tree created by it is much smallerthan that created by TD-FP-Growth, hence improving efficiency. At the same time, FFP-Growth can beeasily extended for reducing the search space as TD-FP-Growth (M) and TD-FP-Growth (C). Experimentalresults show that the algorithm of this paper is effective and efficient.
文摘The market trends rapidly changed over the last two decades.The primary reason is the newly created opportunities and the increased number of competitors competing to grasp market share using business analysis techniques.Market Basket Analysis has a tangible effect in facilitating current change in the market.Market Basket Analysis is one of the famous fields that deal with Big Data and Data Mining applications.MBA initially uses Association Rule Learning(ARL)as a mean for realization.ARL has a beneficial effect in providing a plenty benefit in analyzing the market data and understanding customers’behavior.An important motive of using such techniques is maximizing the business profit as well as matching the exact customer needs as closely as possible.In this survey paper,we discussed several applications and methods of MBA based on ARL.Also,we reviewed some association rule learning measurements including trust,lift,leverage,and others.Furthermore,we discuss some open issues and future topics in the area of market basket analysis and association rule learning.
文摘The fight against fraud and trafficking is a fundamental mission of customs. The conditions for carrying out this mission depend both on the evolution of economic issues and on the behaviour of the actors in charge of its implementation. As part of the customs clearance process, customs are nowadays confronted with an increasing volume of goods in connection with the development of international trade. Automated risk management is therefore required to limit intrusive control. In this article, we propose an unsupervised classification method to extract knowledge rules from a database of customs offences in order to identify abnormal behaviour resulting from customs control. The idea is to apply the Apriori principle on the basis of frequent grounds on a database relating to customs offences in customs procedures to uncover potential rules of association between a customs operation and an offence for the purpose of extracting knowledge governing the occurrence of fraud. This mass of often heterogeneous and complex data thus generates new needs that knowledge extraction methods must be able to meet. The assessment of infringements inevitably requires a proper identification of the risks. It is an original approach based on data mining or data mining to build association rules in two steps: first, search for frequent patterns (support >= minimum support) then from the frequent patterns, produce association rules (Trust >= Minimum Trust). The simulations carried out highlighted three main association rules: forecasting rules, targeting rules and neutral rules with the introduction of a third indicator of rule relevance which is the Lift measure. Confidence in the first two rules has been set at least 50%.
文摘In this paper,association rule mining algorithm is utilized to analyze the correlations of various factors of causing traffic accidents,from which the relationship model of dangerous driving behaviors is established.In this model,the factors and their correlations include:ability of risk control,ability of driving self-confidence,individual characteristics,and incorrect driving operations.By selecting the drivers in the city of Chengdu to be the objects of investigation,a group of valid sample data is obtained.Based on these data,the Support and Confidence for association rules are analyzed.In the analysis,the two stage computing of Apriori algorithm programming is simulated,and from which some important rules are obtained.With these rules,departments of traffic administration can focus on these key factors in their processing of traffic transactions.By the training of drivers’skills and their physical and mental behaviors,the incorrect driving operations can be greatly reduced and the traffic safety can be effectively guaranteed.
文摘The traditional Apriori applied in books management system causes slow system operation due to frequent scanning of database and excessive quantity of candidate item-sets, so an information recommendation book management system based on improved Apriori data mining algorithm is designed, in which the C/S (client/server) architecture and B/S (browser/server) architecture are integrated, so as to open the book information to library staff and borrowers. The related information data of the borrowers and books can be extracted from books lending database by the data preprocessing sub-module in the system function module. After the data is cleaned, converted and integrated, the association rule mining sub-module is used to mine the strong association rules with support degree greater than minimum support degree threshold and confidence coefficient greater than minimum confidence coefficient threshold according to the processed data and by means of the improved Apriori data mining algorithm to generate association rule database. The association matching is performed by the personalized recommendation sub-module according to the borrower and his selected books in the association rule database. The book information associated with the books read by borrower is recommended to him to realize personalized recommendation of the book information. The experimental results show that the system can effectively recommend book related information, and its CPU occupation rate is only 6.47% under the condition that 50 clients are running it at the same time. Anyway, it has good performance.
文摘Hotspots (active fires) indicate spatial distribution of fires. A study on determining influence factors for hotspot occurrence is essential so that fire events can be predicted based on characteristics of a certain area. This study discovers the possible influence factors on the occurrence of fire events using the association rule algorithm namely Apriori in the study area of Rokan Hilir Riau Province Indonesia. The Apriori algorithm was applied on a forest fire dataset which containeddata on physical environment (land cover, river, road and city center), socio-economic (income source, population, and number of school), weather (precipitation, wind speed, and screen temperature), and peatlands. The experiment results revealed 324 multidimensional association rules indicating relationships between hotspots occurrence and other factors.The association among hotspots occurrence with other geographical objects was discovered for the minimum support of 10% and the minimum confidence of 80%. The results show that strong relations between hotspots occurrence and influence factors are found for the support about 12.42%, the confidence of 1, and the lift of 2.26. These factors are precipitation greater than or equal to 3 mm/day, wind speed in [1m/s, 2m/s), non peatland area, screen temperature in [297K, 298K), the number of school in 1 km2 less than or equal to 0.1, and the distance of each hotspot to the nearest road less than or equal to 2.5 km.
文摘In this paper,We study the Apriori and FP-growth algorithm in mining association rules and give a method for computing all the frequent item-sets in a database.Its basic idea is giving a concept based on the boolean vector business product,which be computed between all the businesses,then we can get all the two frequent item-sets(minsup=2).We basis their inclusive relation to construct a set-tree of item-sets in database transaction,and then traverse path in it and get all the frequent item-sets.Therefore,we can get minimal frequent item sets between transactions and items in the database without scanning the database and iteratively computing in Apriori algorithm.
文摘This paper presents some new algorithms to efficiently mine max frequent generalized itemsets(g-itemsets)and essential generalized association rules(g-rules).These are compact and general representations for all frequent patterns and all strong association rules in the generalized environment.Our results fill an important gap among algorithms for frequent patterns and association rules by combining two concepts.First,generalized itemsets employ a taxonomy of items,rather than a flat list of items.This produces more natural frequent itemsets and associations such as(meat,milk)instead of(beef,milk),(chicken,milk),etc.Second,compact representations of frequent itemsets and strong rules,whose result size is exponentially smaller,can solve a standard dilemma in mining patterns:with small threshold values for support and confidence,the user is overwhelmed by the extraordinary number of identified patterns and associations;but with large threshold values,some interesting patterns and associations fail to be identified.Our algorithms can also expand those max frequent g-itemsets and essential g-rules into the much larger set of ordinary frequent g-itemsets and strong g-rules.While that expansion is not recommended in most practical cases,we do so in order to present a comparison with existing algorithms that only handle ordinary frequent g-itemsets.In this case,the new algorithm is shown to be thousands,and in some cases millions,of the time faster than previous algorithms.Further,the new algorithm succeeds in analyzing deeper taxonomies,with the depths of seven or more.Experimental results for previous algorithms limited themselves to taxonomies with depth at most three or four.In each of the two problems,a straightforward lattice-based approach is briefly discussed and then a classificationbased algorithm is developed.In particular,the two classification-based algorithms are MFGI_class for mining max frequent g-itemsets and EGR_class for mining essential g-rules.The classification-based algorithms are featured with conceptual classification trees and dynamic generation and pruning algorithms.
基金Funded by the National 973 Project(No.2003CB415205).
文摘A method for mining frequent itemsets by evaluating their probability of supports based on asso-ciation analysis is presented.This paper obtains the probability of every 1-itemset by scanning the database,then evaluates the probability of every 2-itemset,every 3-itemset,every k-itemset from the frequent 1-itemsets and gains all the candidate frequent itemsets.This paper also scans the database for verifying the support of the candidate frequent itemsets.Last,the frequent itemsets are mined.The method reduces a lot of time of scanning database and shortens the computation time of the algorithm.
基金CAS Project of Brain and Mind Science,国家高技术研究发展计划(863计划),国家重点基础研究发展计划(973计划),国家自然科学基金,湖南省自然科学基金
文摘One of the obstacles of the efficient association rule mining is theexplosive expansion of data sets since it is costly or impossible to scan large databases, esp., formultiple times. A popular solution to improve the speed and scalability of the association rulemining is to do the algorithm on a random sample instead of the entire database. But how toeffectively define and efficiently estimate the degree of error with respect to the outcome of thealgorithm, and how to determine the sample size needed are entangling researches until now. In thispaper, an effective and efficient algorithm is given based on the PAC (Probably Approximate Correct)learning theory to measure and estimate sample error. Then, a new adaptive, on-line, fast samplingstrategy - multi-scaling sampling - is presented inspired by MRA (Multi-Resolution Analysis) andShannon sampling theorem, for quickly obtaining acceptably approximate association rules atappropriate sample size. Both theoretical analysis and empirical study have showed that the Samplingstrategy can achieve a very good speed-accuracy trade-off.
基金Supported by the National Natural Science Foundation of China(60472099)Ningbo Natural Science Foundation(2006A610017)
文摘Because data warehouse is frequently changing, incremental data leads to old knowledge which is mined formerly unavailable. In order to maintain the discovered knowledge and patterns dynamically, this study presents a novel algorithm updating for global frequent patterns-IPARUC. A rapid clustering method is introduced to divide database into n parts in IPARUC firstly, where the data are similar in the same part. Then, the nodes in the tree are adjusted dynamically in inserting process by "pruning and laying back" to keep the frequency descending order so that they can be shared to approaching optimization. Finally local frequent itemsets mined from each local dataset are merged into global frequent itemsets. The results of experimental study are very encouraging. It is obvious from experiment that IPARUC is more effective and efficient than other two contrastive methods. Furthermore, there is significant application potential to a prototype of Web log Analyzer in web usage mining that can help us to discover useful knowledge effectively, even help managers making decision.
文摘This paper aims to mine the knowledge and rules on compatibility of drugs from the prescriptions for curing arrhythmia in the Chinese traditional medicine database by Apriori algorithm. For data preparation, 1 113 prescriptions for arrhythmia, including 535 herbs ( totally 10884 counts of herbs) were collected into the database. The prescription data were preprocessed through redundancy reduction, normalized storage, and knowledge induction according to the pretreatment demands of data mining. Then the Apriori algorithm was used to analyze the data and form the related technical rules and treatment procedures. The experimental result of compatibility of drugs for curing arrhythmia from the Chinese traditional medicine database shows that the prescription compatibility obtained by Apriori algorithm generally accords with the basic law of traditional Chinese medicine for arrhythmia. Some special compatibilities unreported were also discovered in the experiment, which may be used as the basis for developing new prescriptions for arrhythmia.
文摘Working at height is widespread across various industries,with frequent and hazardous falls occurring regularly.Such tasks are often linked to multifactorial issues,where the interplay of diverse factors leads to accidents that are challenging to control effectively.This study establishes an index system for the factors influencing falls from height by statistically analyzing 101 incidents,identifying 64 causative elements classified into four categories.These include 17 factors related to operator condition and behavior,13 concerning equipment and facility conditions,7 pertaining to site conditions,and 27 associated with production operations management.Utilizing the Apriori algorithm and Gephi software,the study mined the association rules of causal factors in falls from height and constructed their network diagram.By examining association rules with high support,confidence,and lift,the relationships between key causal factors leading to accidents are clarified,identifying critical operational control points and providing a scientific foundation for reducing the incidence of falls from height.Currently,China's standards related to working at height remain fragmented.This study lays the foundation for the development of comprehensive,systematic,generic safety management standards for working at height,satisfying the needs of the field.