An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sp...An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sparse Feature Vector', thus reduces the data scaleenormously, and can get the clustering result with only one data scan. Both theoretical analysis andempirical tests showed that CABOSFV is of low computational complexity. The algorithm findsclusters in high dimensional large datasets efficiently and handles noise effectively.展开更多
Traditional spatial clustering methods have the disadvantage of "hardware division", and can not describe the physical characteristics of spatial entity effectively. In view of the above, this paper sets forth a gen...Traditional spatial clustering methods have the disadvantage of "hardware division", and can not describe the physical characteristics of spatial entity effectively. In view of the above, this paper sets forth a general multi-dimensional cloud model, which describes the characteristics of spatial objects more reasonably according to the idea of non-homogeneous and non-symmetry. Based on infrastructures' classification and demarcation in Zhanjiang, a detailed interpretation of clustering results is made from the spatial distribution of membership degree of clustering, the comparative study of Fuzzy C-means and a coupled analysis of residential land prices. General multi-dimensional cloud model reflects the integrated char- acteristics of spatial objects better, reveals the spatial distribution of potential information, and realizes spatial division more accurately in complex circumstances. However, due to the complexity of spatial interactions between geographical entities, the generation of cloud model is a specific and challenging task.展开更多
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp...Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.展开更多
Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outl...Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.展开更多
A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimensions (). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clust...A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimensions (). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering;however, new techniques are also introduced based on the path length between partitions that are connected to one another. A Line-of-Sight algorithm is also developed for clustering. A test bank of 12 data sets with varying properties is used to expose the strengths and weaknesses of each technique. Finally, a robust clustering technique is discussed based on reaching a consensus among the multiple approaches, overcoming the weaknesses found individually.展开更多
为有效识别和剔除风电机组实测数据中的异常数据,通过分析风电机组实测数据的高维特征,提出一种基于流形学习的异常数据识别算法。首先,采用k-近邻互信息算法实现风电机组特征变量选择;随后,使用将样本间距离度量替换为欧几里得度量和...为有效识别和剔除风电机组实测数据中的异常数据,通过分析风电机组实测数据的高维特征,提出一种基于流形学习的异常数据识别算法。首先,采用k-近邻互信息算法实现风电机组特征变量选择;随后,使用将样本间距离度量替换为欧几里得度量和局部主成分分析(local principal component analysis,LPCA)差别加权和的优化t-分布随机近邻嵌入(t-distributed stochastic neighbor embedding,t-SNE)算法挖掘出高维流形数据中具有内在规律的低维特征,使得具有不同分布特征的数据在可视化二维空间中显著分离;最后,采用基于密度的噪声空间聚类(density-based spatial clustering of applications with noise,DBSCAN)算法对二维空间中的数据进行聚类。结果表明,与主成分分析(principal component analysis,PCA)算法、局部线性嵌入(locally linear embedding,LLE)算法和原t-SNE算法相比,所提方法能够对各种复杂工况数据进行可视化分离聚类,并对异常数据进行识别和剔除。展开更多
Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling...Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling better service to be provided to them.Performing power load profile clustering is the basis for mining the users’electricity consumption behavior.By examining the complexity,randomness,and uncertainty of the users’electricity consumption behavior,this paper proposes an ensemble clustering method to analyze this behavior.First,principle component analysis(PCA)is used to reduce the dimensions of the data.Subsequently,the single clustering method is used,and the majority is selected for integrated clustering.As a result,the users’electricity consumption behavior is classified into different modes,and their characteristics are analyzed in detail.This paper examines the electricity power data of 19 real users in China for simulation purposes.This manuscript provides a thorough analysis along with suggestions for the users’weekly electricity consumption behavior.The results verify the effectiveness of the proposed method.展开更多
Three-dimensional distribution of solute elements in an Mg–Zn–Gd alloy during ageing process is quantitatively characterized by three-dimensional atom probe(3DAP) tomography. Based on the radius distribution functio...Three-dimensional distribution of solute elements in an Mg–Zn–Gd alloy during ageing process is quantitatively characterized by three-dimensional atom probe(3DAP) tomography. Based on the radius distribution function, it is found that Zn–Gd solute pairs in Mg matrix appear mainly at two peaks at early stage of ageing, and the separation distance between Zn and Gd atoms could be well rationalized by the first-principle calculation. Moreover, the fraction of Zn–Gd solute pairs increases first and then decreases due to the precipitation of long-period stacking ordered(LPSO) structures. Both the composition of the structural unit in LPSO structure and the solute enrichment around it are quantified. It is found that Zn and Gd elements are synchronized in the LPSO structure, and solute segregation of pure Zn or Gd is not observed at the transformation front of the LPSO structure in this alloy. In addition, the crystallography of transformation front is further determined by 3DAP data.展开更多
To overcome the limitation of the traditional clustering algorithms which fail to produce meaningful clusters in high-dimensional, sparseness and binary value data sets, a new method based on hypergraph model is propo...To overcome the limitation of the traditional clustering algorithms which fail to produce meaningful clusters in high-dimensional, sparseness and binary value data sets, a new method based on hypergraph model is proposed. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge represents the similarity of attrlbute-value distribution between two points. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. The quality of the clustering result can be evaluated by applying the intra-cluster singularity value. Analysis and experimental results have demonstrated that this approach is applicable and effective in wide ranging scheme.展开更多
The effect of Co addition on the formation of Ni-Ti clusters in maraging stainless steel was studied by three dimensional atom probe(3 DAP) and first-principles calculation. The cluster analysis based on the maximum...The effect of Co addition on the formation of Ni-Ti clusters in maraging stainless steel was studied by three dimensional atom probe(3 DAP) and first-principles calculation. The cluster analysis based on the maximum separation approach showed an increase in size but a decrease in density of Ni-Ti clusters with increasing the Co content. The first-principles calculation indicated weaker Co-Ni(Co-Ti) interactions than Co-Ti(Fe-Ti) interactions, which should be the essential reason for the change of distribution characteristics of Ni-Ti clusters in bcc Fe caused by Co addition.展开更多
[ Objective] This study aimed to construct four-dimensional graphics of nucleotide sequences of six genes in rice ( GluB-6, GtuB-7, PDIL2, OsMPK1, OsCATC, OsCATA) and to conduct phase-space clustering, thus demonstr...[ Objective] This study aimed to construct four-dimensional graphics of nucleotide sequences of six genes in rice ( GluB-6, GtuB-7, PDIL2, OsMPK1, OsCATC, OsCATA) and to conduct phase-space clustering, thus demonstrating the relationship between the structure and function of rice genes. [ Method ] Base sequences were represented by four-dimensional graphics and clustered in the phase space. The relationship between clustering results and biological characteristics of these genes were analyzed. [ Result] Genes with similar four-dimensional graphics exhibit similar biological characteristics. [ Conclusion] Four-dimensional graphics of genes with different functions and base lengths present phase-space relationship with their biological functions, which provided an effective way for the prediction of gene function.展开更多
文摘An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sparse Feature Vector', thus reduces the data scaleenormously, and can get the clustering result with only one data scan. Both theoretical analysis andempirical tests showed that CABOSFV is of low computational complexity. The algorithm findsclusters in high dimensional large datasets efficiently and handles noise effectively.
基金National Natural Science Foundation of China, N0.40971102 Knowledge Innovation Project of the Chinese Academy of Sciences, No. KZCX2-YW-322 Special Grant for Postgraduates' Scientific Innovation and So- cial Practice in 2008
文摘Traditional spatial clustering methods have the disadvantage of "hardware division", and can not describe the physical characteristics of spatial entity effectively. In view of the above, this paper sets forth a general multi-dimensional cloud model, which describes the characteristics of spatial objects more reasonably according to the idea of non-homogeneous and non-symmetry. Based on infrastructures' classification and demarcation in Zhanjiang, a detailed interpretation of clustering results is made from the spatial distribution of membership degree of clustering, the comparative study of Fuzzy C-means and a coupled analysis of residential land prices. General multi-dimensional cloud model reflects the integrated char- acteristics of spatial objects better, reveals the spatial distribution of potential information, and realizes spatial division more accurately in complex circumstances. However, due to the complexity of spatial interactions between geographical entities, the generation of cloud model is a specific and challenging task.
基金supported in part by the National Natural Science Foundation of China (Nos. 61303074, 61309013)the Programs for Science, National Key Basic Research and Development Program ("973") of China (No. 2012CB315900)Technology Development of Henan province (Nos.12210231003, 13210231002)
文摘Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.
基金Project(61362021)supported by the National Natural Science Foundation of ChinaProject(2016GXNSFAA380149)supported by Natural Science Foundation of Guangxi Province,China+1 种基金Projects(2016YJCXB02,2017YJCX34)supported by Innovation Project of GUET Graduate Education,ChinaProject(2011KF11)supported by the Key Laboratory of Cognitive Radio and Information Processing,Ministry of Education,China
文摘Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.
文摘A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimensions (). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering;however, new techniques are also introduced based on the path length between partitions that are connected to one another. A Line-of-Sight algorithm is also developed for clustering. A test bank of 12 data sets with varying properties is used to expose the strengths and weaknesses of each technique. Finally, a robust clustering technique is discussed based on reaching a consensus among the multiple approaches, overcoming the weaknesses found individually.
文摘为有效识别和剔除风电机组实测数据中的异常数据,通过分析风电机组实测数据的高维特征,提出一种基于流形学习的异常数据识别算法。首先,采用k-近邻互信息算法实现风电机组特征变量选择;随后,使用将样本间距离度量替换为欧几里得度量和局部主成分分析(local principal component analysis,LPCA)差别加权和的优化t-分布随机近邻嵌入(t-distributed stochastic neighbor embedding,t-SNE)算法挖掘出高维流形数据中具有内在规律的低维特征,使得具有不同分布特征的数据在可视化二维空间中显著分离;最后,采用基于密度的噪声空间聚类(density-based spatial clustering of applications with noise,DBSCAN)算法对二维空间中的数据进行聚类。结果表明,与主成分分析(principal component analysis,PCA)算法、局部线性嵌入(locally linear embedding,LLE)算法和原t-SNE算法相比,所提方法能够对各种复杂工况数据进行可视化分离聚类,并对异常数据进行识别和剔除。
基金supported by the State Grid Science and Technology Project (No.5442AI90009)Natural Science Foundation of China (No. 6170337)
文摘Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling better service to be provided to them.Performing power load profile clustering is the basis for mining the users’electricity consumption behavior.By examining the complexity,randomness,and uncertainty of the users’electricity consumption behavior,this paper proposes an ensemble clustering method to analyze this behavior.First,principle component analysis(PCA)is used to reduce the dimensions of the data.Subsequently,the single clustering method is used,and the majority is selected for integrated clustering.As a result,the users’electricity consumption behavior is classified into different modes,and their characteristics are analyzed in detail.This paper examines the electricity power data of 19 real users in China for simulation purposes.This manuscript provides a thorough analysis along with suggestions for the users’weekly electricity consumption behavior.The results verify the effectiveness of the proposed method.
基金supported by a Grant-in-Aid for Scientific Research on Innovative Areas,‘‘Synchronized Long-Period Stacking Ordered Structure’’,from the Ministry of Education,Culture,Sports,Science and Technology,Japan(No.23109006)Fundamental Research Funds for the Central Universities(No.FRFTP-17-003A1)
文摘Three-dimensional distribution of solute elements in an Mg–Zn–Gd alloy during ageing process is quantitatively characterized by three-dimensional atom probe(3DAP) tomography. Based on the radius distribution function, it is found that Zn–Gd solute pairs in Mg matrix appear mainly at two peaks at early stage of ageing, and the separation distance between Zn and Gd atoms could be well rationalized by the first-principle calculation. Moreover, the fraction of Zn–Gd solute pairs increases first and then decreases due to the precipitation of long-period stacking ordered(LPSO) structures. Both the composition of the structural unit in LPSO structure and the solute enrichment around it are quantified. It is found that Zn and Gd elements are synchronized in the LPSO structure, and solute segregation of pure Zn or Gd is not observed at the transformation front of the LPSO structure in this alloy. In addition, the crystallography of transformation front is further determined by 3DAP data.
文摘To overcome the limitation of the traditional clustering algorithms which fail to produce meaningful clusters in high-dimensional, sparseness and binary value data sets, a new method based on hypergraph model is proposed. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge represents the similarity of attrlbute-value distribution between two points. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. The quality of the clustering result can be evaluated by applying the intra-cluster singularity value. Analysis and experimental results have demonstrated that this approach is applicable and effective in wide ranging scheme.
基金sponsored by Youth Innovation Promotion Association of Chinese Academy of Sciences (2017233)National Natural Science Foundation of C hina (No. 51472249)+2 种基金Innovation Project of Institute of Metal Research (2015-ZD04)National Natural Science Foundation of China Research Fund for International Young Scientists (No. 51750110515)the Special Program for Applied Research on Super Computation of the NSFCGuangdong Joint Fund (second phase) under Grant No. U1501501
文摘The effect of Co addition on the formation of Ni-Ti clusters in maraging stainless steel was studied by three dimensional atom probe(3 DAP) and first-principles calculation. The cluster analysis based on the maximum separation approach showed an increase in size but a decrease in density of Ni-Ti clusters with increasing the Co content. The first-principles calculation indicated weaker Co-Ni(Co-Ti) interactions than Co-Ti(Fe-Ti) interactions, which should be the essential reason for the change of distribution characteristics of Ni-Ti clusters in bcc Fe caused by Co addition.
基金Supported by Eleventh Five-Year Development Planning for Instructional Science in Hubei Province(2006B131)
文摘[ Objective] This study aimed to construct four-dimensional graphics of nucleotide sequences of six genes in rice ( GluB-6, GtuB-7, PDIL2, OsMPK1, OsCATC, OsCATA) and to conduct phase-space clustering, thus demonstrating the relationship between the structure and function of rice genes. [ Method ] Base sequences were represented by four-dimensional graphics and clustered in the phase space. The relationship between clustering results and biological characteristics of these genes were analyzed. [ Result] Genes with similar four-dimensional graphics exhibit similar biological characteristics. [ Conclusion] Four-dimensional graphics of genes with different functions and base lengths present phase-space relationship with their biological functions, which provided an effective way for the prediction of gene function.