Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro c...Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro cluster. We propose LeaDen-Stream (Leader Density-based clustering algorithm over evolving data Stream), a density-based clustering algorithm using leader clustering. The algorithm is based on a two-phase clustering. The online phase selects the proper mini-micro or micro-cluster leaders based on the distribution of data points in the micro clusters. Then, the leader centers are sent to the offline phase to form final clusters. In LeaDen-Stream, by carefully choosing between two kinds of micro leaders, we decrease time complexity of the clustering while maintaining the cluster quality. A pruning strategy is also used to filter out real data from noise by introducing dense and sparse mini-micro and micro-cluster leaders. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.展开更多
In this paper, we explore a novel ensemble method for spectral clustering. In contrast to the traditional clustering ensemble methods that combine all the obtained clustering results, we propose the adaptive spectral ...In this paper, we explore a novel ensemble method for spectral clustering. In contrast to the traditional clustering ensemble methods that combine all the obtained clustering results, we propose the adaptive spectral clustering ensemble method to achieve a better clustering solution. This method can adaptively assess the number of the component members, which is not owned by many other algorithms. The component clusterings of the ensemble system are generated by spectral clustering (SC) which bears some good characteristics to engender the diverse committees. The selection process works by evaluating the generated component spectral clustering through resampling technique and population-based incremental learning algorithm (PBIL). Experimental results on UCI datasets demonstrate that the proposed algorithm can achieve better results compared with traditional clustering ensemble methods, especially when the number of component clusterings is large.展开更多
Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outl...Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.展开更多
为有效识别和剔除风电机组实测数据中的异常数据,通过分析风电机组实测数据的高维特征,提出一种基于流形学习的异常数据识别算法。首先,采用k-近邻互信息算法实现风电机组特征变量选择;随后,使用将样本间距离度量替换为欧几里得度量和...为有效识别和剔除风电机组实测数据中的异常数据,通过分析风电机组实测数据的高维特征,提出一种基于流形学习的异常数据识别算法。首先,采用k-近邻互信息算法实现风电机组特征变量选择;随后,使用将样本间距离度量替换为欧几里得度量和局部主成分分析(local principal component analysis,LPCA)差别加权和的优化t-分布随机近邻嵌入(t-distributed stochastic neighbor embedding,t-SNE)算法挖掘出高维流形数据中具有内在规律的低维特征,使得具有不同分布特征的数据在可视化二维空间中显著分离;最后,采用基于密度的噪声空间聚类(density-based spatial clustering of applications with noise,DBSCAN)算法对二维空间中的数据进行聚类。结果表明,与主成分分析(principal component analysis,PCA)算法、局部线性嵌入(locally linear embedding,LLE)算法和原t-SNE算法相比,所提方法能够对各种复杂工况数据进行可视化分离聚类,并对异常数据进行识别和剔除。展开更多
For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic...For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.展开更多
为解决近年来用户行业变化特性加剧导致的难以准确辨识用户档案信息变动的问题,文中提出一种基于数据驱动的负荷特征异常辨识方法。首先,提出一种两阶段行业典型负荷形态构建方法,利用基于层次密度的含噪声应用空间聚类(hierarchical de...为解决近年来用户行业变化特性加剧导致的难以准确辨识用户档案信息变动的问题,文中提出一种基于数据驱动的负荷特征异常辨识方法。首先,提出一种两阶段行业典型负荷形态构建方法,利用基于层次密度的含噪声应用空间聚类(hierarchical density-based spatial clustering of applications with noise,HDBSCAN)提取用户在不同场景下的典型日负荷曲线,并利用改进的K-means算法对提取出的典型日负荷曲线进行聚类分析,构建行业的典型负荷形态;其次,提出一种多维场景负荷特征异常智能研判方法,通过构造用户的负荷特征,使用熵权法评估行业典型场景的相对重要性,并采用单分类支持向量机(one-class support vector machine,OCSVM)算法量化每个场景下的用户负荷特征的异常程度,通过加权计算得到用户的综合嫌疑得分并排序,从而实现对负荷特征异常用户的准确辨识。最后,采用某地区实际用户数据进行算例验证。仿真结果表明,所提方法在行业典型负荷场景构建及负荷特征异常辨识方面表现出良好的可行性与实用价值。展开更多
In wireless sensor networks, topology control plays an important role for data forwarding efficiency in the data gathering applications. In this paper, we present a novel topology control and data forwarding mechanism...In wireless sensor networks, topology control plays an important role for data forwarding efficiency in the data gathering applications. In this paper, we present a novel topology control and data forwarding mechanism called REMUDA, which is designed for a practical indoor parking lot management system. REMUDA forms a tree-based hierarchical network topology which brings as many nodes as possible to be leaf nodes and constructs a virtual cluster structure. Meanwhile, it takes the reliability, stability and path length into account in the tree construction process. Through an experiment in a network of 30 real sensor nodes, we evaluate the performance of REMUDA and compare it with LEPS which is also a practical routing protocol in TinyOS. Experiment results show that REMUDA can achieve better performance than LEPS.展开更多
Fuzzy C-means (FCM) is simple and widely used for complex data pattern recognition and image analyses. However, selecting an appropriate fuzzifier (m) is crucial in identifying an optimal number of patterns and achiev...Fuzzy C-means (FCM) is simple and widely used for complex data pattern recognition and image analyses. However, selecting an appropriate fuzzifier (m) is crucial in identifying an optimal number of patterns and achieving higher clustering accuracy, which few studies have investigated. Built upon two existing methods on selecting fuzzifier, we developed an integrated fuzzifier evaluation and selection algorithm and tested it using real datasets. Our findings indicate that the consistent optimal number of clusters can be learnt from testing different fuzzifiers for each dataset and the fuzzifier with the lowest value for this consistency should be selected for clustering. Our evaluation also shows that the fuzzifier impacts the clustering accuracy. For longitudinal data with missing values, m = 2 could be an empirical rule to start fuzzy clustering, and the best clustering accuracy was achieved for tested data, especially using our multiple-imputation based fuzzy clustering.展开更多
In this paper we address the problem related to determination of the most suitable candidates for an M&A (Merger &Acquisition) scenario of Banks/Financial Institutions. During the pre-merger period of ...In this paper we address the problem related to determination of the most suitable candidates for an M&A (Merger &Acquisition) scenario of Banks/Financial Institutions. During the pre-merger period of an M&A, a number of candidates may be available to undergo the Merger/Acquisition, but all of them may not be suitable. The normal practice is to carry out a due diligence exercise to identify the candidates that should lead to optimum increase in shareholder value and customer satisfaction, post-merger. The due diligence ought to be able to determine those candidates that are unsuitable for merger, those candidates that are relatively suitable, and those that are most suitable. Towards achieving the above objective, we propose a Fuzzy Data Mining Framework wherein Fuzzy Cluster Analysis concept is used for advisability of merger of two banks and other Financial Institutions. Subsequently, we propose orchestration/composition of business processes of two banks into consolidated business process during Merger &Acquisition (M&A) scenario. Our paper discusses modeling of individual business process with UML, and the consolidation of the individual business process models by means of our proposed Knowledge Based approach.展开更多
Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clusterin...Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the nmnber of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density- based methods in the clustering process and at the same time overcoming the constraints, which are put out by data streanFs nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.展开更多
针对现有的网络入侵检测方法忽略了流量特征间的关联性对特征选择的重要性,且在数据平衡时未能考虑到低频攻击样本的分布离散性,导致检测性能下降的问题,提出互信息值融合(mutual information value fusion,MIVF)方法来选择与攻击行为...针对现有的网络入侵检测方法忽略了流量特征间的关联性对特征选择的重要性,且在数据平衡时未能考虑到低频攻击样本的分布离散性,导致检测性能下降的问题,提出互信息值融合(mutual information value fusion,MIVF)方法来选择与攻击行为相关性高且彼此之间关联性低的特征。提出基于DBSCAN改进的SMOTE方法对低频攻击样本按照其密度聚类分布进行过采样;构建SAE-MSCNN分类模型来检验性能。在NSL-KDD和UNSW-NB15数据集上验证,准确率分别达到92.89%和94.85%。结果表明所提方法可以有效地选择特征以及平衡数据,尤其是提高低频攻击的检测准确率。展开更多
物联网设备持续产出的数据中会掺杂部分异常数据,导致物联网通信数据分类的质量与效率下降。因此,提出一种基于集成学习的物联网通信数据快速分类方法。从物联网设备收集通信数据,利用孤立森林算法确定物联网通信数据样本的异常分值,并...物联网设备持续产出的数据中会掺杂部分异常数据,导致物联网通信数据分类的质量与效率下降。因此,提出一种基于集成学习的物联网通信数据快速分类方法。从物联网设备收集通信数据,利用孤立森林算法确定物联网通信数据样本的异常分值,并去除异常分值较高的数据,通过基于密度的带噪声应用空间聚类(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)算法整合去除异常后的数据,结合集成学习算法实现物联网通信数据快速分类。实验结果表明,所提方法的物联网通信数据分类准确率始终在97.2%以上,物联网通信数据分类时间均值约为1.55 s,具有良好的应用潜力。展开更多
文摘Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro cluster. We propose LeaDen-Stream (Leader Density-based clustering algorithm over evolving data Stream), a density-based clustering algorithm using leader clustering. The algorithm is based on a two-phase clustering. The online phase selects the proper mini-micro or micro-cluster leaders based on the distribution of data points in the micro clusters. Then, the leader centers are sent to the offline phase to form final clusters. In LeaDen-Stream, by carefully choosing between two kinds of micro leaders, we decrease time complexity of the clustering while maintaining the cluster quality. A pruning strategy is also used to filter out real data from noise by introducing dense and sparse mini-micro and micro-cluster leaders. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.
基金Supported by the National Natural Science Foundation of China (60661003)the Research Project Department of Education of Jiangxi Province (GJJ10566)
文摘In this paper, we explore a novel ensemble method for spectral clustering. In contrast to the traditional clustering ensemble methods that combine all the obtained clustering results, we propose the adaptive spectral clustering ensemble method to achieve a better clustering solution. This method can adaptively assess the number of the component members, which is not owned by many other algorithms. The component clusterings of the ensemble system are generated by spectral clustering (SC) which bears some good characteristics to engender the diverse committees. The selection process works by evaluating the generated component spectral clustering through resampling technique and population-based incremental learning algorithm (PBIL). Experimental results on UCI datasets demonstrate that the proposed algorithm can achieve better results compared with traditional clustering ensemble methods, especially when the number of component clusterings is large.
基金Project(61362021)supported by the National Natural Science Foundation of ChinaProject(2016GXNSFAA380149)supported by Natural Science Foundation of Guangxi Province,China+1 种基金Projects(2016YJCXB02,2017YJCX34)supported by Innovation Project of GUET Graduate Education,ChinaProject(2011KF11)supported by the Key Laboratory of Cognitive Radio and Information Processing,Ministry of Education,China
文摘Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.
文摘为有效识别和剔除风电机组实测数据中的异常数据,通过分析风电机组实测数据的高维特征,提出一种基于流形学习的异常数据识别算法。首先,采用k-近邻互信息算法实现风电机组特征变量选择;随后,使用将样本间距离度量替换为欧几里得度量和局部主成分分析(local principal component analysis,LPCA)差别加权和的优化t-分布随机近邻嵌入(t-distributed stochastic neighbor embedding,t-SNE)算法挖掘出高维流形数据中具有内在规律的低维特征,使得具有不同分布特征的数据在可视化二维空间中显著分离;最后,采用基于密度的噪声空间聚类(density-based spatial clustering of applications with noise,DBSCAN)算法对二维空间中的数据进行聚类。结果表明,与主成分分析(principal component analysis,PCA)算法、局部线性嵌入(locally linear embedding,LLE)算法和原t-SNE算法相比,所提方法能够对各种复杂工况数据进行可视化分离聚类,并对异常数据进行识别和剔除。
基金supported by the National Key Research and Development Program of China(2018YFB1003700)the Scientific and Technological Support Project(Society)of Jiangsu Province(BE2016776)+2 种基金the“333” project of Jiangsu Province(BRA2017228 BRA2017401)the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012)
文摘For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.
文摘为解决近年来用户行业变化特性加剧导致的难以准确辨识用户档案信息变动的问题,文中提出一种基于数据驱动的负荷特征异常辨识方法。首先,提出一种两阶段行业典型负荷形态构建方法,利用基于层次密度的含噪声应用空间聚类(hierarchical density-based spatial clustering of applications with noise,HDBSCAN)提取用户在不同场景下的典型日负荷曲线,并利用改进的K-means算法对提取出的典型日负荷曲线进行聚类分析,构建行业的典型负荷形态;其次,提出一种多维场景负荷特征异常智能研判方法,通过构造用户的负荷特征,使用熵权法评估行业典型场景的相对重要性,并采用单分类支持向量机(one-class support vector machine,OCSVM)算法量化每个场景下的用户负荷特征的异常程度,通过加权计算得到用户的综合嫌疑得分并排序,从而实现对负荷特征异常用户的准确辨识。最后,采用某地区实际用户数据进行算例验证。仿真结果表明,所提方法在行业典型负荷场景构建及负荷特征异常辨识方面表现出良好的可行性与实用价值。
基金Supported by National Natural Science Foundation of P. R. China (60673178) National Basic Research Program of P.R. China (2006 CB 303000)
文摘In wireless sensor networks, topology control plays an important role for data forwarding efficiency in the data gathering applications. In this paper, we present a novel topology control and data forwarding mechanism called REMUDA, which is designed for a practical indoor parking lot management system. REMUDA forms a tree-based hierarchical network topology which brings as many nodes as possible to be leaf nodes and constructs a virtual cluster structure. Meanwhile, it takes the reliability, stability and path length into account in the tree construction process. Through an experiment in a network of 30 real sensor nodes, we evaluate the performance of REMUDA and compare it with LEPS which is also a practical routing protocol in TinyOS. Experiment results show that REMUDA can achieve better performance than LEPS.
文摘Fuzzy C-means (FCM) is simple and widely used for complex data pattern recognition and image analyses. However, selecting an appropriate fuzzifier (m) is crucial in identifying an optimal number of patterns and achieving higher clustering accuracy, which few studies have investigated. Built upon two existing methods on selecting fuzzifier, we developed an integrated fuzzifier evaluation and selection algorithm and tested it using real datasets. Our findings indicate that the consistent optimal number of clusters can be learnt from testing different fuzzifiers for each dataset and the fuzzifier with the lowest value for this consistency should be selected for clustering. Our evaluation also shows that the fuzzifier impacts the clustering accuracy. For longitudinal data with missing values, m = 2 could be an empirical rule to start fuzzy clustering, and the best clustering accuracy was achieved for tested data, especially using our multiple-imputation based fuzzy clustering.
文摘In this paper we address the problem related to determination of the most suitable candidates for an M&A (Merger &Acquisition) scenario of Banks/Financial Institutions. During the pre-merger period of an M&A, a number of candidates may be available to undergo the Merger/Acquisition, but all of them may not be suitable. The normal practice is to carry out a due diligence exercise to identify the candidates that should lead to optimum increase in shareholder value and customer satisfaction, post-merger. The due diligence ought to be able to determine those candidates that are unsuitable for merger, those candidates that are relatively suitable, and those that are most suitable. Towards achieving the above objective, we propose a Fuzzy Data Mining Framework wherein Fuzzy Cluster Analysis concept is used for advisability of merger of two banks and other Financial Institutions. Subsequently, we propose orchestration/composition of business processes of two banks into consolidated business process during Merger &Acquisition (M&A) scenario. Our paper discusses modeling of individual business process with UML, and the consolidation of the individual business process models by means of our proposed Knowledge Based approach.
基金supported by the University of Malaya Research under Grant No.RG097-12ICT
文摘Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the nmnber of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density- based methods in the clustering process and at the same time overcoming the constraints, which are put out by data streanFs nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.
文摘针对现有的网络入侵检测方法忽略了流量特征间的关联性对特征选择的重要性,且在数据平衡时未能考虑到低频攻击样本的分布离散性,导致检测性能下降的问题,提出互信息值融合(mutual information value fusion,MIVF)方法来选择与攻击行为相关性高且彼此之间关联性低的特征。提出基于DBSCAN改进的SMOTE方法对低频攻击样本按照其密度聚类分布进行过采样;构建SAE-MSCNN分类模型来检验性能。在NSL-KDD和UNSW-NB15数据集上验证,准确率分别达到92.89%和94.85%。结果表明所提方法可以有效地选择特征以及平衡数据,尤其是提高低频攻击的检测准确率。
文摘物联网设备持续产出的数据中会掺杂部分异常数据,导致物联网通信数据分类的质量与效率下降。因此,提出一种基于集成学习的物联网通信数据快速分类方法。从物联网设备收集通信数据,利用孤立森林算法确定物联网通信数据样本的异常分值,并去除异常分值较高的数据,通过基于密度的带噪声应用空间聚类(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)算法整合去除异常后的数据,结合集成学习算法实现物联网通信数据快速分类。实验结果表明,所提方法的物联网通信数据分类准确率始终在97.2%以上,物联网通信数据分类时间均值约为1.55 s,具有良好的应用潜力。