Travelling Salesman Problem(TSP) is a classical optimization problem and it is one of a class of NP-Problem.The purposes of this work is to apply data mining methodologies to explore the patterns in data generated by ...Travelling Salesman Problem(TSP) is a classical optimization problem and it is one of a class of NP-Problem.The purposes of this work is to apply data mining methodologies to explore the patterns in data generated by an Ant Colony Algorithm(ACA) performing a searching operation and to develop a rule set searcher which approximates the ACA′s searcher.An attribute-oriented induction methodology was used to explore the relationship between an operations′ sequence and its attributes and a set of rules has been developed.At the end of this paper,the experimental results have shown that the proposed approach has good performance with respect to the quality of solution and the speed of computation.展开更多
Improved traditional ant colony algorithms,a data routing model used to the data remote exchange on WAN was presented.In the model,random heuristic factors were introduced to realize multi-path search.The updating mod...Improved traditional ant colony algorithms,a data routing model used to the data remote exchange on WAN was presented.In the model,random heuristic factors were introduced to realize multi-path search.The updating model of pheromone could adjust the pheromone concentration on the optimal path according to path load dynamically to make the system keep load balance.The simulation results show that the improved model has a higher performance on convergence and load balance.展开更多
Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between c...The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.展开更多
This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical clus...This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical cluster analysis methods.In our proposed algorithm,improvements in the efficiency of an agent operation were achieved,and a new function "cluster condensation" was added.Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation.Compared with classical cluster analysis methods,the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure,and the dispersion of the result can also be reduced.Moreover,our clustering algorithm has the advantage of being possible even in a small-field cluster condensation.In addition,the number of objects that exist in the field decreases because the cluster condenses;therefore,it becomes possible to add an object to a space that has become empty.In other words,first,the majority of data is put on standby.They are then clustered,gradually adding parts of the standby data to the clustering data.The method can be adopted for a large amount of data.Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.展开更多
High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this prob...High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this problem.The basic idea was to search the line manifold clusters hidden in datasets,and then fuse some of the line manifold clusters to construct higher dimensional manifold clusters.The orthogonal distance and the tangent distance were considered together as the linear manifold distance metrics. Spatial neighbor information was fully utilized to construct the original line manifold and optimize line manifolds during the line manifold cluster searching procedure.The results obtained from experiments over real and synthetic data sets demonstrate the superiority of the proposed method over some competing clustering methods in terms of accuracy and computation time.The proposed method is able to obtain high clustering accuracy for various data sets with different sizes,manifold dimensions and noise ratios,which confirms the anti-noise capability and high clustering accuracy of the proposed method for high dimensional data.展开更多
Medical data classification(MDC)refers to the application of classification methods on medical datasets.This work focuses on applying a classification task to medical datasets related to specific diseases in order to ...Medical data classification(MDC)refers to the application of classification methods on medical datasets.This work focuses on applying a classification task to medical datasets related to specific diseases in order to predict the associated diagnosis or prognosis.To gain experts’trust,the prediction and the reasoning behind it are equally important.Accordingly,we confine our research to learn rule-based models because they are transparent and comprehensible.One approach to MDC involves the use of metaheuristic(MH)algorithms.Here we report on the development and testing of a novel MH algorithm:IWD-Miner.This algorithm can be viewed as a fusion of Intelligent Water Drops(IWDs)and AntMiner+.It was subjected to a four-stage sensitivity analysis to optimize its performance.For this purpose,21 publicly available medical datasets were used from the Machine Learning Repository at the University of California Irvine.Interestingly,there were only limited differences in performance between IWDMiner variants which is suggestive of its robustness.Finally,using the same 21 datasets,we compared the performance of the optimized IWD-Miner against two extant algorithms,AntMiner+and J48.The experiments showed that both rival algorithms are considered comparable in the effectiveness to IWD-Miner,as confirmed by the Wilcoxon nonparametric statistical test.Results suggest that IWD-Miner is more efficient than AntMiner+as measured by the average number of fitness evaluations to a solution(1,386,621.30 vs.2,827,283.88 fitness evaluations,respectively).J48 exhibited higher accuracy on average than IWD-Miner(79.58 vs.73.65,respectively)but produced larger models(32.82 leaves vs.8.38 terms,respectively).展开更多
The world produces vast quantities of high-dimensional multi-semantic data.However,extracting valuable information from such a large amount of high-dimensional and multi-label data is undoubtedly arduous and challengi...The world produces vast quantities of high-dimensional multi-semantic data.However,extracting valuable information from such a large amount of high-dimensional and multi-label data is undoubtedly arduous and challenging.Feature selection aims to mitigate the adverse impacts of high dimensionality in multi-label data by eliminating redundant and irrelevant features.The ant colony optimization algorithm has demonstrated encouraging outcomes in multi-label feature selection,because of its simplicity,efficiency,and similarity to reinforcement learning.Nevertheless,existing methods do not consider crucial correlation information,such as dynamic redundancy and label correlation.To tackle these concerns,the paper proposes a multi-label feature selection technique based on ant colony optimization algorithm(MFACO),focusing on dynamic redundancy and label correlation.Initially,the dynamic redundancy is assessed between the selected feature subset and potential features.Meanwhile,the ant colony optimization algorithm extracts label correlation from the label set,which is then combined into the heuristic factor as label weights.Experimental results demonstrate that our proposed strategies can effectively enhance the optimal search ability of ant colony,outperforming the other algorithms involved in the paper.展开更多
Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the mana...Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.展开更多
Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and impleme...Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and implemented for all most all data types. The quality of results of cluster analysis mainly depends on the clustering algorithm used in the analysis. Architecture of a versatile, less user dependent, dynamic and scalable data clustering machine is presented. The machine selects for analysis, the best available data clustering algorithm on the basis of the credentials of the data and previously used domain knowledge. The domain knowledge is updated on completion of each session of data analysis.展开更多
DNS(domain name system) query log analysis has been a popular research topic in recent years. CLOPE, the represented transactional clustering algorithm, could be readily used for DNS query log mining. However, the alg...DNS(domain name system) query log analysis has been a popular research topic in recent years. CLOPE, the represented transactional clustering algorithm, could be readily used for DNS query log mining. However, the algorithm is inefficient when processing large scale data. The MR-CLOPE algorithm is proposed, which is an extension and improvement on CLOPE based on Map Reduce. Different from the previous parallel clustering method, a two-stage Map Reduce implementation framework is proposed. Each of the stage is implemented by one kind Map Reduce task. In the first stage, the DNS query logs are divided into multiple splits and the CLOPE algorithm is executed on each split. The second stage usually tends to iterate many times to merge the small clusters into bigger satisfactory ones. In these two stages, a novel partition process is designed to randomly spread out original sub clusters, which will be moved and merged in the map phrase of the second phase according to the defined merge criteria. In such way, the advantage of the original CLOPE algorithm is kept and its disadvantages are dealt with in the proposed framework to achieve more excellent clustering performance. The experiment results show that MR-CLOPE is not only faster but also has better clustering quality on DNS query logs compared with CLOPE.展开更多
Fuzzy c-means(FCM) clustering algorithm is sensitive to noise points and outlier data, and the possibilistic fuzzy c-means(PFCM) clustering algorithm overcomes the problem well, but PFCM clustering algorithm has some ...Fuzzy c-means(FCM) clustering algorithm is sensitive to noise points and outlier data, and the possibilistic fuzzy c-means(PFCM) clustering algorithm overcomes the problem well, but PFCM clustering algorithm has some problems: it is still sensitive to initial clustering centers and the clustering results are not good when the tested datasets with noise are very unequal. An improved kernel possibilistic fuzzy c-means algorithm based on invasive weed optimization(IWO-KPFCM) is proposed in this paper. This algorithm first uses invasive weed optimization(IWO) algorithm to seek the optimal solution as the initial clustering centers, and introduces kernel method to make the input data from the sample space map into the high-dimensional feature space. Then, the sample variance is introduced in the objection function to measure the compact degree of data. Finally, the improved algorithm is used to cluster data. The simulation results of the University of California-Irvine(UCI) data sets and artificial data sets show that the proposed algorithm has stronger ability to resist noise, higher cluster accuracy and faster convergence speed than the PFCM algorithm.展开更多
Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical...Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.展开更多
Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge is in the choice of the appropriate model for exploring the data features. The choi...Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge is in the choice of the appropriate model for exploring the data features. The choice is generally poorly understood and any tentative choice may be too restrictive. Growing volumes of data, disparate data sources and modelling techniques entail the need for model optimization via adaptability rather than comparability. We propose a novel two-stage algorithm to modelling continuous data consisting of an unsupervised stage whereby the algorithm searches through the data for optimal parameter values and a supervised stage that adapts the parameters for predictive modelling. The method is implemented on the sunspots data with inherently Gaussian distributional properties and assumed bi-modality. Optimal values separating high from lows cycles are obtained via multiple simulations. Early patterns for each recorded cycle reveal that the first 3 years provide a sufficient basis for predicting the peak. Multiple Support Vector Machine runs using repeatedly improved data parameters show that the approach yields greater accuracy and reliability than conventional approaches and provides a good basis for model selection. Model reliability is established via multiple simulations of this type.展开更多
Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some m...Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some major problems.One way to resolve these problems and improve the k-means algorithm is the use of evolutionary algorithms in clustering.In this study,the Imperialist Competitive Algorithm(ICA) is developed and then used in the clustering process.Clustering of IRIS,Wine and CMC datasets using developed ICA and comparing them with the results of clustering by the original ICA,GA and PSO algorithms,demonstrate the improvement of Imperialist competitive algorithm.展开更多
In the field of data mining and machine learning,clustering is a typical issue which has been widely studied by many researchers,and lots of effective algorithms have been proposed,including K-means,fuzzy c-means(FCM)...In the field of data mining and machine learning,clustering is a typical issue which has been widely studied by many researchers,and lots of effective algorithms have been proposed,including K-means,fuzzy c-means(FCM)and DBSCAN.However,the traditional clustering methods are easily trapped into local optimum.Thus,many evolutionary-based clustering methods have been investigated.Considering the effectiveness of brain storm optimization(BSO)in increasing the diversity while the diversity optimization is performed,in this paper,we propose a new clustering model based on BSO to use the global ability of BSO.In our experiment,we apply the novel binary model to solve the problem.During the period of processing data,BSO was mainly utilized for iteration.Also,in the process of K-means,we set the more appropriate parameters selected to match it greatly.Four datasets were used in our experiment.In our model,BSO was first introduced in solving the clustering problem.With the algorithm running on each dataset repeatedly,our experimental results have obtained good convergence and diversity.In addition,by comparing the results with other clustering models,the BSO clustering model also guarantees high accuracy.Therefore,from many aspects,the simulation results show that the model of this paper has good performance.展开更多
文摘Travelling Salesman Problem(TSP) is a classical optimization problem and it is one of a class of NP-Problem.The purposes of this work is to apply data mining methodologies to explore the patterns in data generated by an Ant Colony Algorithm(ACA) performing a searching operation and to develop a rule set searcher which approximates the ACA′s searcher.An attribute-oriented induction methodology was used to explore the relationship between an operations′ sequence and its attributes and a set of rules has been developed.At the end of this paper,the experimental results have shown that the proposed approach has good performance with respect to the quality of solution and the speed of computation.
基金Sponsored by the National High Technology Research and Development Program of China(2006AA701306)the National Innovation Foundation of Enterprises(05C26212200378)
文摘Improved traditional ant colony algorithms,a data routing model used to the data remote exchange on WAN was presented.In the model,random heuristic factors were introduced to realize multi-path search.The updating model of pheromone could adjust the pheromone concentration on the optimal path according to path load dynamically to make the system keep load balance.The simulation results show that the improved model has a higher performance on convergence and load balance.
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
基金Projects(60873265,60903222) supported by the National Natural Science Foundation of China Project(IRT0661) supported by the Program for Changjiang Scholars and Innovative Research Team in University of China
文摘The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.
基金Project (No.18510132) supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research
文摘This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical cluster analysis methods.In our proposed algorithm,improvements in the efficiency of an agent operation were achieved,and a new function "cluster condensation" was added.Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation.Compared with classical cluster analysis methods,the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure,and the dispersion of the result can also be reduced.Moreover,our clustering algorithm has the advantage of being possible even in a small-field cluster condensation.In addition,the number of objects that exist in the field decreases because the cluster condenses;therefore,it becomes possible to add an object to a space that has become empty.In other words,first,the majority of data is put on standby.They are then clustered,gradually adding parts of the standby data to the clustering data.The method can be adopted for a large amount of data.Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.
基金Project(60835005) supported by the National Nature Science Foundation of China
文摘High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this problem.The basic idea was to search the line manifold clusters hidden in datasets,and then fuse some of the line manifold clusters to construct higher dimensional manifold clusters.The orthogonal distance and the tangent distance were considered together as the linear manifold distance metrics. Spatial neighbor information was fully utilized to construct the original line manifold and optimize line manifolds during the line manifold cluster searching procedure.The results obtained from experiments over real and synthetic data sets demonstrate the superiority of the proposed method over some competing clustering methods in terms of accuracy and computation time.The proposed method is able to obtain high clustering accuracy for various data sets with different sizes,manifold dimensions and noise ratios,which confirms the anti-noise capability and high clustering accuracy of the proposed method for high dimensional data.
基金a grant from the“Research Center of the Female Scientific and Medical Colleges”,the Deanship of Scientific Research,King Saud University.
文摘Medical data classification(MDC)refers to the application of classification methods on medical datasets.This work focuses on applying a classification task to medical datasets related to specific diseases in order to predict the associated diagnosis or prognosis.To gain experts’trust,the prediction and the reasoning behind it are equally important.Accordingly,we confine our research to learn rule-based models because they are transparent and comprehensible.One approach to MDC involves the use of metaheuristic(MH)algorithms.Here we report on the development and testing of a novel MH algorithm:IWD-Miner.This algorithm can be viewed as a fusion of Intelligent Water Drops(IWDs)and AntMiner+.It was subjected to a four-stage sensitivity analysis to optimize its performance.For this purpose,21 publicly available medical datasets were used from the Machine Learning Repository at the University of California Irvine.Interestingly,there were only limited differences in performance between IWDMiner variants which is suggestive of its robustness.Finally,using the same 21 datasets,we compared the performance of the optimized IWD-Miner against two extant algorithms,AntMiner+and J48.The experiments showed that both rival algorithms are considered comparable in the effectiveness to IWD-Miner,as confirmed by the Wilcoxon nonparametric statistical test.Results suggest that IWD-Miner is more efficient than AntMiner+as measured by the average number of fitness evaluations to a solution(1,386,621.30 vs.2,827,283.88 fitness evaluations,respectively).J48 exhibited higher accuracy on average than IWD-Miner(79.58 vs.73.65,respectively)but produced larger models(32.82 leaves vs.8.38 terms,respectively).
基金supported by National Natural Science Foundation of China(Grant Nos.62376089,62302153,62302154,62202147)the key Research and Development Program of Hubei Province,China(Grant No.2023BEB024).
文摘The world produces vast quantities of high-dimensional multi-semantic data.However,extracting valuable information from such a large amount of high-dimensional and multi-label data is undoubtedly arduous and challenging.Feature selection aims to mitigate the adverse impacts of high dimensionality in multi-label data by eliminating redundant and irrelevant features.The ant colony optimization algorithm has demonstrated encouraging outcomes in multi-label feature selection,because of its simplicity,efficiency,and similarity to reinforcement learning.Nevertheless,existing methods do not consider crucial correlation information,such as dynamic redundancy and label correlation.To tackle these concerns,the paper proposes a multi-label feature selection technique based on ant colony optimization algorithm(MFACO),focusing on dynamic redundancy and label correlation.Initially,the dynamic redundancy is assessed between the selected feature subset and potential features.Meanwhile,the ant colony optimization algorithm extracts label correlation from the label set,which is then combined into the heuristic factor as label weights.Experimental results demonstrate that our proposed strategies can effectively enhance the optimal search ability of ant colony,outperforming the other algorithms involved in the paper.
文摘Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.
文摘Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and implemented for all most all data types. The quality of results of cluster analysis mainly depends on the clustering algorithm used in the analysis. Architecture of a versatile, less user dependent, dynamic and scalable data clustering machine is presented. The machine selects for analysis, the best available data clustering algorithm on the basis of the credentials of the data and previously used domain knowledge. The domain knowledge is updated on completion of each session of data analysis.
基金Project(61103046) supported in part by the National Natural Science Foundation of ChinaProject(B201312) supported by DHU Distinguished Young Professor Program,China+1 种基金Project(LY14F020007) supported by Zhejiang Provincial Natural Science Funds of ChinaProject(2014A610072) supported by the Natural Science Foundation of Ningbo City,China
文摘DNS(domain name system) query log analysis has been a popular research topic in recent years. CLOPE, the represented transactional clustering algorithm, could be readily used for DNS query log mining. However, the algorithm is inefficient when processing large scale data. The MR-CLOPE algorithm is proposed, which is an extension and improvement on CLOPE based on Map Reduce. Different from the previous parallel clustering method, a two-stage Map Reduce implementation framework is proposed. Each of the stage is implemented by one kind Map Reduce task. In the first stage, the DNS query logs are divided into multiple splits and the CLOPE algorithm is executed on each split. The second stage usually tends to iterate many times to merge the small clusters into bigger satisfactory ones. In these two stages, a novel partition process is designed to randomly spread out original sub clusters, which will be moved and merged in the map phrase of the second phase according to the defined merge criteria. In such way, the advantage of the original CLOPE algorithm is kept and its disadvantages are dealt with in the proposed framework to achieve more excellent clustering performance. The experiment results show that MR-CLOPE is not only faster but also has better clustering quality on DNS query logs compared with CLOPE.
文摘Fuzzy c-means(FCM) clustering algorithm is sensitive to noise points and outlier data, and the possibilistic fuzzy c-means(PFCM) clustering algorithm overcomes the problem well, but PFCM clustering algorithm has some problems: it is still sensitive to initial clustering centers and the clustering results are not good when the tested datasets with noise are very unequal. An improved kernel possibilistic fuzzy c-means algorithm based on invasive weed optimization(IWO-KPFCM) is proposed in this paper. This algorithm first uses invasive weed optimization(IWO) algorithm to seek the optimal solution as the initial clustering centers, and introduces kernel method to make the input data from the sample space map into the high-dimensional feature space. Then, the sample variance is introduced in the objection function to measure the compact degree of data. Finally, the improved algorithm is used to cluster data. The simulation results of the University of California-Irvine(UCI) data sets and artificial data sets show that the proposed algorithm has stronger ability to resist noise, higher cluster accuracy and faster convergence speed than the PFCM algorithm.
文摘Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.
文摘Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge is in the choice of the appropriate model for exploring the data features. The choice is generally poorly understood and any tentative choice may be too restrictive. Growing volumes of data, disparate data sources and modelling techniques entail the need for model optimization via adaptability rather than comparability. We propose a novel two-stage algorithm to modelling continuous data consisting of an unsupervised stage whereby the algorithm searches through the data for optimal parameter values and a supervised stage that adapts the parameters for predictive modelling. The method is implemented on the sunspots data with inherently Gaussian distributional properties and assumed bi-modality. Optimal values separating high from lows cycles are obtained via multiple simulations. Early patterns for each recorded cycle reveal that the first 3 years provide a sufficient basis for predicting the peak. Multiple Support Vector Machine runs using repeatedly improved data parameters show that the approach yields greater accuracy and reliability than conventional approaches and provides a good basis for model selection. Model reliability is established via multiple simulations of this type.
文摘Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some major problems.One way to resolve these problems and improve the k-means algorithm is the use of evolutionary algorithms in clustering.In this study,the Imperialist Competitive Algorithm(ICA) is developed and then used in the clustering process.Clustering of IRIS,Wine and CMC datasets using developed ICA and comparing them with the results of clustering by the original ICA,GA and PSO algorithms,demonstrate the improvement of Imperialist competitive algorithm.
基金supported by Natural Science Foundation of Jiangsu Province(Grant No.BK20141005)by Natural Science Foundation of the Jiangsu Higher Education Institutions of China(Grant No.14KJB520025).
文摘In the field of data mining and machine learning,clustering is a typical issue which has been widely studied by many researchers,and lots of effective algorithms have been proposed,including K-means,fuzzy c-means(FCM)and DBSCAN.However,the traditional clustering methods are easily trapped into local optimum.Thus,many evolutionary-based clustering methods have been investigated.Considering the effectiveness of brain storm optimization(BSO)in increasing the diversity while the diversity optimization is performed,in this paper,we propose a new clustering model based on BSO to use the global ability of BSO.In our experiment,we apply the novel binary model to solve the problem.During the period of processing data,BSO was mainly utilized for iteration.Also,in the process of K-means,we set the more appropriate parameters selected to match it greatly.Four datasets were used in our experiment.In our model,BSO was first introduced in solving the clustering problem.With the algorithm running on each dataset repeatedly,our experimental results have obtained good convergence and diversity.In addition,by comparing the results with other clustering models,the BSO clustering model also guarantees high accuracy.Therefore,from many aspects,the simulation results show that the model of this paper has good performance.