Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their sim...Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.展开更多
Nature-inspired optimization algorithms refer to techniques that simulate the behavior and ecosystem of living organisms or natural phenomena.One such technique is the“Photosynthesis Spectrum Algorithm,”which was de...Nature-inspired optimization algorithms refer to techniques that simulate the behavior and ecosystem of living organisms or natural phenomena.One such technique is the“Photosynthesis Spectrum Algorithm,”which was developed by mimicking the process by which photons behave as a population in plants.This optimization technique has three stages that mimic the structure of leaves and the fluorescence phenomenon.Each stage updates the fitness of the solution by using a mathematical equation to direct the photon to the reaction center.Three stages of testing have been conducted to test the efficacy of this approach.In the first stage,functions from the CEC 2019 and CEC 2021 competitions are used to evaluate the performance and convergence of the proposed method.The statistical results from non-parametric Friedman and Kendall’s W tests show that the proposed method is superior to other methods in terms of obtaining the best average of solutions and achieving stability in finding solutions.In other sections,the experiment is designed for data clustering.The proposed method is compared with recent data clustering and classification metaheuristic algorithms,indicating that this method can achieve significant performance for clustering in less than 10 s of CPU time and with an accuracy of over 90%.展开更多
Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and impleme...Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and implemented for all most all data types. The quality of results of cluster analysis mainly depends on the clustering algorithm used in the analysis. Architecture of a versatile, less user dependent, dynamic and scalable data clustering machine is presented. The machine selects for analysis, the best available data clustering algorithm on the basis of the credentials of the data and previously used domain knowledge. The domain knowledge is updated on completion of each session of data analysis.展开更多
This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical clus...This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical cluster analysis methods.In our proposed algorithm,improvements in the efficiency of an agent operation were achieved,and a new function "cluster condensation" was added.Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation.Compared with classical cluster analysis methods,the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure,and the dispersion of the result can also be reduced.Moreover,our clustering algorithm has the advantage of being possible even in a small-field cluster condensation.In addition,the number of objects that exist in the field decreases because the cluster condenses;therefore,it becomes possible to add an object to a space that has become empty.In other words,first,the majority of data is put on standby.They are then clustered,gradually adding parts of the standby data to the clustering data.The method can be adopted for a large amount of data.Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.展开更多
With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow e...With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow explosively.These multi-source heterogeneous data have data differences,which lead to data variation in the process of transmission and preservation,thus forming the bad information of incomplete data.Therefore,the research on data integrity has become an urgent task.This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system.According to the characteristics and data sources of the massive data generated by power equipment,the fuzzy mining model of power equipment data is established,and the data is divided into numerical and non-numerical data based on numerical data.Take the text data of power equipment defects as the mining material.Then,the Apriori algorithm based on an array is used to mine deeply.The strong association rules in incomplete data of power equipment are obtained and analyzed.From the change trend of NRMSE metrics and classification accuracy,most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend,and will not fluctuate greatly with the growth of the missing rate.The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets,and the filling effect fluctuates greatly with the increase of the missing rate,that is,with the increase of the missing rate,the improvement effect of the model for the existing filling methods is higher than 4.3%.Through the incomplete data clustering technology studied in this paper,a more innovative state assessment of smart grid reliability operation is carried out,which has good research value and reference significance.展开更多
In this paper, a cardinality compensation method based on Information-weighted Consensus Filter(ICF) using data clustering is proposed in order to accurately estimate the cardinality of the Cardinalized Probability Hy...In this paper, a cardinality compensation method based on Information-weighted Consensus Filter(ICF) using data clustering is proposed in order to accurately estimate the cardinality of the Cardinalized Probability Hypothesis Density(CPHD) filter. Although the joint propagation of the intensity and the cardinality distribution in the CPHD filter process allows for more reliable estimation of the cardinality(target number) than the PHD filter, tracking loss may occur when noise and clutter are high in the measurements in a practical situation. For that reason, the cardinality compensation process is included in the CPHD filter, which is based on information fusion step using estimated cardinality obtained from the CPHD filter and measured cardinality obtained through data clustering. Here, the ICF is used for information fusion. To verify the performance of the proposed method, simulations were carried out and it was confirmed that the tracking performance of the multi-target was improved because the cardinality was estimated more accurately as compared to the existing techniques.展开更多
An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sp...An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sparse Feature Vector', thus reduces the data scaleenormously, and can get the clustering result with only one data scan. Both theoretical analysis andempirical tests showed that CABOSFV is of low computational complexity. The algorithm findsclusters in high dimensional large datasets efficiently and handles noise effectively.展开更多
In recent years, functional data has been widely used in finance, medicine, biology and other fields. The current clustering analysis can solve the problems in finite-dimensional space, but it is difficult to be direc...In recent years, functional data has been widely used in finance, medicine, biology and other fields. The current clustering analysis can solve the problems in finite-dimensional space, but it is difficult to be directly used for the clustering of functional data. In this paper, we propose a new unsupervised clustering algorithm based on adaptive weights. In the absence of initialization parameter, we use entropy-type penalty terms and fuzzy partition matrix to find the optimal number of clusters. At the same time, we introduce a measure based on adaptive weights to reflect the difference in information content between different clustering metrics. Simulation experiments show that the proposed algorithm has higher purity than some algorithms.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
The traditional methods are easy to generate a large number of fake samples or data loss when classifying unbalanced data.Therefore,this paper proposes a novel DBSCAN(density-based spatial clustering of application wi...The traditional methods are easy to generate a large number of fake samples or data loss when classifying unbalanced data.Therefore,this paper proposes a novel DBSCAN(density-based spatial clustering of application with noise)for data clustering.The density-based DBSCAN clustering decomposition algorithm is applied to most classes of unbalanced data sets,which reduces the advantage of most class samples without data loss.The algorithm uses different distance measurements for disordered and ordered classification data,and assigns corresponding weights with average entropy.The experimental results show that the new algorithm has better clustering effect than other advanced clustering algorithms on both artificial and real data sets.展开更多
Big data clustering plays an important role in the field of data processing in wireless sensor networks.However,there are some problems such as poor clustering effect and low Jaccard coefficient.This paper proposes a ...Big data clustering plays an important role in the field of data processing in wireless sensor networks.However,there are some problems such as poor clustering effect and low Jaccard coefficient.This paper proposes a novel big data clustering optimization method based on intuitionistic fuzzy set distance and particle swarm optimization for wireless sensor networks.This method combines principal component analysis method and information entropy dimensionality reduction to process big data and reduce the time required for data clustering.A new distance measurement method of intuitionistic fuzzy sets is defined,which not only considers membership and non-membership information,but also considers the allocation of hesitancy to membership and non-membership,thereby indirectly introducing hesitancy into intuitionistic fuzzy set distance.The intuitionistic fuzzy kernel clustering algorithm is used to cluster big data,and particle swarm optimization is introduced to optimize the intuitionistic fuzzy kernel clustering method.The optimized algorithm is used to obtain the optimization results of wireless sensor network big data clustering,and the big data clustering is realized.Simulation results show that the proposed method has good clustering effect by comparing with other state-of-the-art clustering methods.展开更多
Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets...Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.展开更多
Purpose-The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents,which is useful for achieving the robust tweets data cl...Purpose-The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents,which is useful for achieving the robust tweets data clustering results.Design/methodology/approach-Let“N”be the number of tweets documents for the topics extraction.Unwanted texts,punctuations and other symbols are removed,tokenization and stemming operations are performed in the initial tweets pre-processing step.Bag-of-features are determined for the tweets;later tweets are modelled with the obtained bag-of-features during the process of topics extraction.Approximation of topics features are extracted for every tweet document.These set of topics features of N documents are treated as multi-viewpoints.The key idea of the proposed work is to use multi-viewpoints in the similarity features computation.The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents(here N 55)and corresponding documents are defined in projected space with five viewpoints,say,v_(1),v_(2),v_(3),v4,and v5.For example,similarity features between two documents(viewpoints v_(1),and v_(2))are computed concerning the other three multi-viewpoints(v_(3),v4,and v5),unlike a single viewpoint in traditional cosine metric.Findings-Healthcare problems with tweets data.Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding term frequency and inverse document frequency(TF-IDF)for unlabelled tweets.Originality/value-Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding TF-IDF for unlabelled tweets.展开更多
The data clustering problem consists in dividing a data set into prescribed groups of homogeneous data.This is an NP-hard problem that can be relaxed in the spectral graph theory,where the optimal cuts of a graph are ...The data clustering problem consists in dividing a data set into prescribed groups of homogeneous data.This is an NP-hard problem that can be relaxed in the spectral graph theory,where the optimal cuts of a graph are related to the eigenvalues of graph 1-Laplacian.In this paper,we first give new notations to describe the paths,among critical eigenvectors of the graph 1-Laplacian,realizing sets with prescribed genus.We introduce the pseudo-orthogonality to characterize m_(3)(G),a special eigenvalue for the graph 1-Laplacian.Furthermore,we use it to give an upper bound for the third graph Cheeger constant h_(3)(G),that is,h_(3)(G)≤m_(3)(G).This is a first step for proving that the k-th Cheeger constant is the minimum of the 1-Laplacian Raylegh quotient among vectors that are pseudo-orthogonal to the vectors realizing the previous k−1 Cheeger constants.Eventually,we apply these results to give a method and a numerical algorithm to compute m3(G),based on a generalized inverse power method.展开更多
Harmony Search(HS)algorithm is highly effective in solving a wide range of real-world engineering optimization problems.However,it still has the problems such as being prone to local optima,low optimization accuracy,a...Harmony Search(HS)algorithm is highly effective in solving a wide range of real-world engineering optimization problems.However,it still has the problems such as being prone to local optima,low optimization accuracy,and low search efficiency.To address the limitations of the HS algorithm,a novel approach called the Dual-Memory Dynamic Search Harmony Search(DMDS-HS)algorithm is introduced.The main innovations of this algorithm are as follows:Firstly,a dual-memory structure is introduced to rank and hierarchically organize the harmonies in the harmony memory,creating an effective and selectable trust region to reduce approach blind searching.Furthermore,the trust region is dynamically adjusted to improve the convergence of the algorithm while maintaining its global search capability.Secondly,to boost the algorithm’s convergence speed,a phased dynamic convergence domain concept is introduced to strategically devise a global random search strategy.Lastly,the algorithm constructs an adaptive parameter adjustment strategy to adjust the usage probability of the algorithm’s search strategies,which aim to rationalize the abilities of exploration and exploitation of the algorithm.The results tested on the Computational Experiment Competition on 2017(CEC2017)test function set show that DMDS-HS outperforms the other nine HS algorithms and the other four state-of-the-art algorithms in terms of diversity,freedom from local optima,and solution accuracy.In addition,applying DMDS-HS to data clustering problems,the results show that it exhibits clustering performance that exceeds the other seven classical clustering algorithms,which verifies the effectiveness and reliability of DMDS-HS in solving complex data clustering problems.展开更多
Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between c...The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.展开更多
Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algor...Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algorithms force a structure in the data instead of discovering one.To avoid false structures in the relations of data,a novel clusterability assessment method called density-based clusterability measure is proposed in this paper.I measures the prominence of clustering structure in the data to evaluate whether a cluster analysis could produce a meaningfu insight to the relationships in the data.This is especially useful in time-series data since visualizing the structure in time-series data is hard.The performance of the clusterability measure is evalu ated against several synthetic data sets and time-series data sets which illustrate that the density-based clusterability measure can successfully indicate clustering structure of time-series data.展开更多
A heterogeneous wireless sensor network comprises a number of inexpensive energy constrained wireless sensor nodes which collect data from the sensing environment and transmit them toward the improved cluster head in ...A heterogeneous wireless sensor network comprises a number of inexpensive energy constrained wireless sensor nodes which collect data from the sensing environment and transmit them toward the improved cluster head in a coordinated way. Employing clustering techniques in such networks can achieve balanced energy consumption of member nodes and prolong the network lifetimes.In classical clustering techniques, clustering and in-cluster data routes are usually separated into independent operations. Although separate considerations of these two issues simplify the system design, it is often the non-optimal lifetime expectancy for wireless sensor networks. This paper proposes an integral framework that integrates these two correlated items in an interactive entirety. For that,we develop the clustering problems using nonlinear programming. Evolution process of clustering is provided in simulations. Results show that our joint-design proposal reaches the near optimal match between member nodes and cluster heads.展开更多
文摘Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.
文摘Nature-inspired optimization algorithms refer to techniques that simulate the behavior and ecosystem of living organisms or natural phenomena.One such technique is the“Photosynthesis Spectrum Algorithm,”which was developed by mimicking the process by which photons behave as a population in plants.This optimization technique has three stages that mimic the structure of leaves and the fluorescence phenomenon.Each stage updates the fitness of the solution by using a mathematical equation to direct the photon to the reaction center.Three stages of testing have been conducted to test the efficacy of this approach.In the first stage,functions from the CEC 2019 and CEC 2021 competitions are used to evaluate the performance and convergence of the proposed method.The statistical results from non-parametric Friedman and Kendall’s W tests show that the proposed method is superior to other methods in terms of obtaining the best average of solutions and achieving stability in finding solutions.In other sections,the experiment is designed for data clustering.The proposed method is compared with recent data clustering and classification metaheuristic algorithms,indicating that this method can achieve significant performance for clustering in less than 10 s of CPU time and with an accuracy of over 90%.
文摘Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and implemented for all most all data types. The quality of results of cluster analysis mainly depends on the clustering algorithm used in the analysis. Architecture of a versatile, less user dependent, dynamic and scalable data clustering machine is presented. The machine selects for analysis, the best available data clustering algorithm on the basis of the credentials of the data and previously used domain knowledge. The domain knowledge is updated on completion of each session of data analysis.
基金Project (No.18510132) supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research
文摘This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical cluster analysis methods.In our proposed algorithm,improvements in the efficiency of an agent operation were achieved,and a new function "cluster condensation" was added.Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation.Compared with classical cluster analysis methods,the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure,and the dispersion of the result can also be reduced.Moreover,our clustering algorithm has the advantage of being possible even in a small-field cluster condensation.In addition,the number of objects that exist in the field decreases because the cluster condenses;therefore,it becomes possible to add an object to a space that has become empty.In other words,first,the majority of data is put on standby.They are then clustered,gradually adding parts of the standby data to the clustering data.The method can be adopted for a large amount of data.Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.
文摘With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow explosively.These multi-source heterogeneous data have data differences,which lead to data variation in the process of transmission and preservation,thus forming the bad information of incomplete data.Therefore,the research on data integrity has become an urgent task.This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system.According to the characteristics and data sources of the massive data generated by power equipment,the fuzzy mining model of power equipment data is established,and the data is divided into numerical and non-numerical data based on numerical data.Take the text data of power equipment defects as the mining material.Then,the Apriori algorithm based on an array is used to mine deeply.The strong association rules in incomplete data of power equipment are obtained and analyzed.From the change trend of NRMSE metrics and classification accuracy,most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend,and will not fluctuate greatly with the growth of the missing rate.The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets,and the filling effect fluctuates greatly with the increase of the missing rate,that is,with the increase of the missing rate,the improvement effect of the model for the existing filling methods is higher than 4.3%.Through the incomplete data clustering technology studied in this paper,a more innovative state assessment of smart grid reliability operation is carried out,which has good research value and reference significance.
基金supported by the National GNSS Research Center Program of the Defense Acquisition Program Administration and Agency for Defense Developmentthe Ministry of Science and ICT of the Republic of Korea through the Space Core Technology Development Program (No. NRF2018M1A3A3A02065722)
文摘In this paper, a cardinality compensation method based on Information-weighted Consensus Filter(ICF) using data clustering is proposed in order to accurately estimate the cardinality of the Cardinalized Probability Hypothesis Density(CPHD) filter. Although the joint propagation of the intensity and the cardinality distribution in the CPHD filter process allows for more reliable estimation of the cardinality(target number) than the PHD filter, tracking loss may occur when noise and clutter are high in the measurements in a practical situation. For that reason, the cardinality compensation process is included in the CPHD filter, which is based on information fusion step using estimated cardinality obtained from the CPHD filter and measured cardinality obtained through data clustering. Here, the ICF is used for information fusion. To verify the performance of the proposed method, simulations were carried out and it was confirmed that the tracking performance of the multi-target was improved because the cardinality was estimated more accurately as compared to the existing techniques.
文摘An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sparse Feature Vector', thus reduces the data scaleenormously, and can get the clustering result with only one data scan. Both theoretical analysis andempirical tests showed that CABOSFV is of low computational complexity. The algorithm findsclusters in high dimensional large datasets efficiently and handles noise effectively.
文摘In recent years, functional data has been widely used in finance, medicine, biology and other fields. The current clustering analysis can solve the problems in finite-dimensional space, but it is difficult to be directly used for the clustering of functional data. In this paper, we propose a new unsupervised clustering algorithm based on adaptive weights. In the absence of initialization parameter, we use entropy-type penalty terms and fuzzy partition matrix to find the optimal number of clusters. At the same time, we introduce a measure based on adaptive weights to reflect the difference in information content between different clustering metrics. Simulation experiments show that the proposed algorithm has higher purity than some algorithms.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
文摘The traditional methods are easy to generate a large number of fake samples or data loss when classifying unbalanced data.Therefore,this paper proposes a novel DBSCAN(density-based spatial clustering of application with noise)for data clustering.The density-based DBSCAN clustering decomposition algorithm is applied to most classes of unbalanced data sets,which reduces the advantage of most class samples without data loss.The algorithm uses different distance measurements for disordered and ordered classification data,and assigns corresponding weights with average entropy.The experimental results show that the new algorithm has better clustering effect than other advanced clustering algorithms on both artificial and real data sets.
基金2021 Scientific Research Funding Project of Liaoning Provincial Education Department(Research and implementation of university scientific research information platform serving the transformation of achievements).
文摘Big data clustering plays an important role in the field of data processing in wireless sensor networks.However,there are some problems such as poor clustering effect and low Jaccard coefficient.This paper proposes a novel big data clustering optimization method based on intuitionistic fuzzy set distance and particle swarm optimization for wireless sensor networks.This method combines principal component analysis method and information entropy dimensionality reduction to process big data and reduce the time required for data clustering.A new distance measurement method of intuitionistic fuzzy sets is defined,which not only considers membership and non-membership information,but also considers the allocation of hesitancy to membership and non-membership,thereby indirectly introducing hesitancy into intuitionistic fuzzy set distance.The intuitionistic fuzzy kernel clustering algorithm is used to cluster big data,and particle swarm optimization is introduced to optimize the intuitionistic fuzzy kernel clustering method.The optimized algorithm is used to obtain the optimization results of wireless sensor network big data clustering,and the big data clustering is realized.Simulation results show that the proposed method has good clustering effect by comparing with other state-of-the-art clustering methods.
文摘Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.
文摘Purpose-The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents,which is useful for achieving the robust tweets data clustering results.Design/methodology/approach-Let“N”be the number of tweets documents for the topics extraction.Unwanted texts,punctuations and other symbols are removed,tokenization and stemming operations are performed in the initial tweets pre-processing step.Bag-of-features are determined for the tweets;later tweets are modelled with the obtained bag-of-features during the process of topics extraction.Approximation of topics features are extracted for every tweet document.These set of topics features of N documents are treated as multi-viewpoints.The key idea of the proposed work is to use multi-viewpoints in the similarity features computation.The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents(here N 55)and corresponding documents are defined in projected space with five viewpoints,say,v_(1),v_(2),v_(3),v4,and v5.For example,similarity features between two documents(viewpoints v_(1),and v_(2))are computed concerning the other three multi-viewpoints(v_(3),v4,and v5),unlike a single viewpoint in traditional cosine metric.Findings-Healthcare problems with tweets data.Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding term frequency and inverse document frequency(TF-IDF)for unlabelled tweets.Originality/value-Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding TF-IDF for unlabelled tweets.
基金supported by the MiUR-Dipartimenti di Eccellenza 2018–2022 grant“Sistemi distribuiti intelligenti”of Dipartimento di Ingegneria Elettrica e dell’Informazione“M.Scarano”,by the MiSE-FSC 2014–2020 grant“SUMMa:Smart Urban Mobility Management”,and by GNAMPA of INdAM.The authors would also like to thank D.A.La Manna and V.Mottola for the helpful conversations during the starting stage of this work.
文摘The data clustering problem consists in dividing a data set into prescribed groups of homogeneous data.This is an NP-hard problem that can be relaxed in the spectral graph theory,where the optimal cuts of a graph are related to the eigenvalues of graph 1-Laplacian.In this paper,we first give new notations to describe the paths,among critical eigenvectors of the graph 1-Laplacian,realizing sets with prescribed genus.We introduce the pseudo-orthogonality to characterize m_(3)(G),a special eigenvalue for the graph 1-Laplacian.Furthermore,we use it to give an upper bound for the third graph Cheeger constant h_(3)(G),that is,h_(3)(G)≤m_(3)(G).This is a first step for proving that the k-th Cheeger constant is the minimum of the 1-Laplacian Raylegh quotient among vectors that are pseudo-orthogonal to the vectors realizing the previous k−1 Cheeger constants.Eventually,we apply these results to give a method and a numerical algorithm to compute m3(G),based on a generalized inverse power method.
基金This work was supported by the Fund of Innovative Training Program for College Students of Guangzhou University(No.s202211078116)Guangzhou City School Joint Fund Project(No.SL2022A03J01009)+2 种基金National Natural Science Foundation of China(No.61806058)Natural Science Foundation of Guangdong Province(No.2018A030310063)Guangzhou Science and Technology Plan Project(No.201804010299).
文摘Harmony Search(HS)algorithm is highly effective in solving a wide range of real-world engineering optimization problems.However,it still has the problems such as being prone to local optima,low optimization accuracy,and low search efficiency.To address the limitations of the HS algorithm,a novel approach called the Dual-Memory Dynamic Search Harmony Search(DMDS-HS)algorithm is introduced.The main innovations of this algorithm are as follows:Firstly,a dual-memory structure is introduced to rank and hierarchically organize the harmonies in the harmony memory,creating an effective and selectable trust region to reduce approach blind searching.Furthermore,the trust region is dynamically adjusted to improve the convergence of the algorithm while maintaining its global search capability.Secondly,to boost the algorithm’s convergence speed,a phased dynamic convergence domain concept is introduced to strategically devise a global random search strategy.Lastly,the algorithm constructs an adaptive parameter adjustment strategy to adjust the usage probability of the algorithm’s search strategies,which aim to rationalize the abilities of exploration and exploitation of the algorithm.The results tested on the Computational Experiment Competition on 2017(CEC2017)test function set show that DMDS-HS outperforms the other nine HS algorithms and the other four state-of-the-art algorithms in terms of diversity,freedom from local optima,and solution accuracy.In addition,applying DMDS-HS to data clustering problems,the results show that it exhibits clustering performance that exceeds the other seven classical clustering algorithms,which verifies the effectiveness and reliability of DMDS-HS in solving complex data clustering problems.
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
基金Projects(60873265,60903222) supported by the National Natural Science Foundation of China Project(IRT0661) supported by the Program for Changjiang Scholars and Innovative Research Team in University of China
文摘The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.
文摘Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algorithms force a structure in the data instead of discovering one.To avoid false structures in the relations of data,a novel clusterability assessment method called density-based clusterability measure is proposed in this paper.I measures the prominence of clustering structure in the data to evaluate whether a cluster analysis could produce a meaningfu insight to the relationships in the data.This is especially useful in time-series data since visualizing the structure in time-series data is hard.The performance of the clusterability measure is evalu ated against several synthetic data sets and time-series data sets which illustrate that the density-based clusterability measure can successfully indicate clustering structure of time-series data.
基金supported by National Natural Science Foundation of China(Nos.61304131 and 61402147)Grant of China Scholarship Council(No.201608130174)+2 种基金Natural Science Foundation of Hebei Province(Nos.F2016402054 and F2014402075)the Scientific Research Plan Projects of Hebei Education Department(Nos.BJ2014019,ZD2015087 and QN2015046)the Research Program of Talent Cultivation Project in Hebei Province(No.A2016002023)
文摘A heterogeneous wireless sensor network comprises a number of inexpensive energy constrained wireless sensor nodes which collect data from the sensing environment and transmit them toward the improved cluster head in a coordinated way. Employing clustering techniques in such networks can achieve balanced energy consumption of member nodes and prolong the network lifetimes.In classical clustering techniques, clustering and in-cluster data routes are usually separated into independent operations. Although separate considerations of these two issues simplify the system design, it is often the non-optimal lifetime expectancy for wireless sensor networks. This paper proposes an integral framework that integrates these two correlated items in an interactive entirety. For that,we develop the clustering problems using nonlinear programming. Evolution process of clustering is provided in simulations. Results show that our joint-design proposal reaches the near optimal match between member nodes and cluster heads.