Feature selection (FS) is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that not all features are important. Some of the features ma...Feature selection (FS) is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that not all features are important. Some of the features may be redundant, and others may be irrelevant and noisy. The conventional supervised FS methods evaluate various feature subsets using an evaluation function or metric to select only those features which are related to the decision classes of the data under consideration. However, for many data mining applications, decision class labels are often unknown or incomplete, thus indicating the significance of unsupervised feature selection. However, in unsupervised learning, decision class labels are not provided. In this paper, we propose a new unsupervised quick reduct (QR) algorithm using rough set theory. The quality of the reduced data is measured by the classification performance and it is evaluated using WEKA classifier tool. The method is compared with existing supervised methods and the result demonstrates the efficiency of the proposed algorithm.展开更多
The backup requirement of data centres is tremendous as the size of data created by human is massive and is increasing exponentially.Single node deduplication cannot meet the increasing backup requirement of data cent...The backup requirement of data centres is tremendous as the size of data created by human is massive and is increasing exponentially.Single node deduplication cannot meet the increasing backup requirement of data centres.A feasible way is the deduplication cluster,which can meet it by adding storage nodes.The data routing strategy is the key of the deduplication cluster.DRSS(data routing strategy using semantics) improves the storage utilization of MCS(minimum chunk signature) data routing strategy a lot.However,for the large deduplication cluster,the load balance of DRSS is worse than MCS.To improve the load balance of DRSS,we propose a load balance strategy used for DRSS,namely DRSSLB.When a node is overloaded,DRSSLB iteratively migrates the current smallest container of the node to the smallest node in the deduplication cluster until this overloaded node becomes non-overloaded.A container is the minimum unit of data migration.Similar files sharing the same features or file names are stored in the same container.This ensures the similar data groups are still in the same node after rebalancing the nodes.We use the dataset from the real world to evaluate DRSSLB.Experimental results show that,for various numbers of nodes of the deduplication cluster,the data skews of DRSSLB are under predefined value while the storage utilizations of DRSSLB do not nearly increase compared with DRSS,with the low penalty(the data migration rate is only6.5% when the number of nodes is 64).展开更多
基金supported by the UGC, SERO, Hyderabad under FDP during XI plan periodthe UGC, New Delhi for financial assistance under major research project Grant No. F-34-105/2008
文摘Feature selection (FS) is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that not all features are important. Some of the features may be redundant, and others may be irrelevant and noisy. The conventional supervised FS methods evaluate various feature subsets using an evaluation function or metric to select only those features which are related to the decision classes of the data under consideration. However, for many data mining applications, decision class labels are often unknown or incomplete, thus indicating the significance of unsupervised feature selection. However, in unsupervised learning, decision class labels are not provided. In this paper, we propose a new unsupervised quick reduct (QR) algorithm using rough set theory. The quality of the reduced data is measured by the classification performance and it is evaluated using WEKA classifier tool. The method is compared with existing supervised methods and the result demonstrates the efficiency of the proposed algorithm.
基金supported by the National Natural Science Foundation of China under Grant No.61373120the Aeronautical Science Foundation of China under Grant No.2014ZD53049
文摘The backup requirement of data centres is tremendous as the size of data created by human is massive and is increasing exponentially.Single node deduplication cannot meet the increasing backup requirement of data centres.A feasible way is the deduplication cluster,which can meet it by adding storage nodes.The data routing strategy is the key of the deduplication cluster.DRSS(data routing strategy using semantics) improves the storage utilization of MCS(minimum chunk signature) data routing strategy a lot.However,for the large deduplication cluster,the load balance of DRSS is worse than MCS.To improve the load balance of DRSS,we propose a load balance strategy used for DRSS,namely DRSSLB.When a node is overloaded,DRSSLB iteratively migrates the current smallest container of the node to the smallest node in the deduplication cluster until this overloaded node becomes non-overloaded.A container is the minimum unit of data migration.Similar files sharing the same features or file names are stored in the same container.This ensures the similar data groups are still in the same node after rebalancing the nodes.We use the dataset from the real world to evaluate DRSSLB.Experimental results show that,for various numbers of nodes of the deduplication cluster,the data skews of DRSSLB are under predefined value while the storage utilizations of DRSSLB do not nearly increase compared with DRSS,with the low penalty(the data migration rate is only6.5% when the number of nodes is 64).