With the abundance of exceptionally High Dimensional data, feature selection has become an essential element in the Data Mining process. In this paper, we investigate the problem of efficient feature selection for cla...With the abundance of exceptionally High Dimensional data, feature selection has become an essential element in the Data Mining process. In this paper, we investigate the problem of efficient feature selection for classification on High Dimensional datasets. We present a novel filter based approach for feature selection that sorts out the features based on a score and then we measure the performance of four different Data Mining classification algorithms on the resulting data. In the proposed approach, we partition the sorted feature and search the important feature in forward manner as well as in reversed manner, while starting from first and last feature simultaneously in the sorted list. The proposed approach is highly scalable and effective as it parallelizes over both attribute and tuples simultaneously allowing us to evaluate many of potential features for High Dimensional datasets. The newly proposed framework for feature selection is experimentally shown to be very valuable with real and synthetic High Dimensional datasets which improve the precision of selected features. We have also tested it to measure classification accuracy against various feature selection process.展开更多
This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected featu...This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.展开更多
To solve the high-dimensionality issue and improve its accuracy in credit risk assessment,a high-dimensionality-trait-driven learning paradigm is proposed for feature extraction and classifier selection.The proposed p...To solve the high-dimensionality issue and improve its accuracy in credit risk assessment,a high-dimensionality-trait-driven learning paradigm is proposed for feature extraction and classifier selection.The proposed paradigm consists of three main stages:categorization of high dimensional data,high-dimensionality-trait-driven feature extraction,and high-dimensionality-trait-driven classifier selection.In the first stage,according to the definition of high-dimensionality and the relationship between sample size and feature dimensions,the high-dimensionality traits of credit dataset are further categorized into two types:100<feature dimensions<sample size,and feature dimensions≥sample size.In the second stage,some typical feature extraction methods are tested regarding the two categories of high dimensionality.In the final stage,four types of classifiers are performed to evaluate credit risk considering different high-dimensionality traits.For the purpose of illustration and verification,credit classification experiments are performed on two publicly available credit risk datasets,and the results show that the proposed high-dimensionality-trait-driven learning paradigm for feature extraction and classifier selection is effective in handling high-dimensional credit classification issues and improving credit classification accuracy relative to the benchmark models listed in this study.展开更多
In this paper, we study the sure independence screening of ultrahigh-dimensional censored data with varying coefficient single-index model. This general model framework covers a large number of commonly used survival ...In this paper, we study the sure independence screening of ultrahigh-dimensional censored data with varying coefficient single-index model. This general model framework covers a large number of commonly used survival models. The property that the proposed method is not derived for a specific model is appealing in ultrahigh dimensional regressions, as it is difficult to specify a correct model for ultrahigh dimensional predictors.Once the assuming data generating process does not meet the actual one, the screening method based on the model will be problematic. We establish the sure screening property and consistency in ranking property of the proposed method. Simulations are conducted to study the finite sample performances, and the results demonstrate that the proposed method is competitive compared with the existing methods. We also illustrate the results via the analysis of data from The National Alzheimers Coordinating Center(NACC).展开更多
The current study proposes a novel technique for feature selection by inculcating robustness in the conventional Signal to noise Ratio(SNR).The proposed method utilizes the robust measures of location i.e.,the“Median...The current study proposes a novel technique for feature selection by inculcating robustness in the conventional Signal to noise Ratio(SNR).The proposed method utilizes the robust measures of location i.e.,the“Median”as well as the measures of variation i.e.,“Median absolute deviation(MAD)and Interquartile range(IQR)”in the SNR.By this way,two independent robust signal-to-noise ratios have been proposed.The proposed method selects the most informative genes/features by combining the minimum subset of genes or features obtained via the greedy search approach with top-ranked genes selected through the robust signal-to-noise ratio(RSNR).The results obtained via the proposed method are compared with wellknown gene/feature selection methods on the basis of performance metric i.e.,classification error rate.A total of 5 gene expression datasets have been used in this study.Different subsets of informative genes are selected by the proposed and all the other methods included in the study,and their efficacy in terms of classification is investigated by using the classifier models such as support vector machine(SVM),Random forest(RF)and k-nearest neighbors(k-NN).The results of the analysis reveal that the proposed method(RSNR)produces minimum error rates than all the other competing feature selection methods in majority of the cases.For further assessment of the method,a detailed simulation study is also conducted.展开更多
为解决医疗数据中存在的特征高维和类别不平衡问题,在基于简单、快速和有效高维特征选择算法SFE(simple,fast and effective high-dimensional feature selection)的基础上,提出了一种面向不平衡医疗数据的多阶段混合特征选择算法HFSIM(...为解决医疗数据中存在的特征高维和类别不平衡问题,在基于简单、快速和有效高维特征选择算法SFE(simple,fast and effective high-dimensional feature selection)的基础上,提出了一种面向不平衡医疗数据的多阶段混合特征选择算法HFSIM(hybrid feature selection for imbalanced medical data)。HFSIM算法采用改进的自适应边界SMOTE过采样技术,生成符合边界条件的新少数类实例以解决医学数据中类不平衡问题。同时,为了改善搜索空间多样性不足的问题,优化了SFE算法中的非选择操作符率参数UR(unselected rate),有效避免了算法过早收敛及易陷入局部最优的问题。将过滤式Fisher Score方法与优化UR参数后的算法有效结合,使算法能以较低的计算成本获得较好寻优能力。经实验验证,相比于SFE算法,HFSIM算法在Ovarian数据集上准确率达到99.67%,提升了2.11个百分点,G-means和F1分别提升了5.13和2.30个百分点。此外,通过对比特征数量和运行时间,证明了HFSIM算法既能保证算法精度又有效降低了计算成本。展开更多
Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data m...Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data mining approaches, especially in classification. Gene expression data are usually composed of dozens of samples characterized by thousands of genes. This increases the dimensionality coupled with the existence of irrelevant and redundant features. Accordingly, the selection of informative genes (features) becomes difficult, which badly affects the gene classification accuracy. In this paper, we consider the feature selection for classifying gene expression microarray datasets. The goal is to detect the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper-based selection method is applied. Experimental results showed that our proposed feature selection technique performs better than other techniques since it produces lower time latency and improves classification performance.展开更多
文摘With the abundance of exceptionally High Dimensional data, feature selection has become an essential element in the Data Mining process. In this paper, we investigate the problem of efficient feature selection for classification on High Dimensional datasets. We present a novel filter based approach for feature selection that sorts out the features based on a score and then we measure the performance of four different Data Mining classification algorithms on the resulting data. In the proposed approach, we partition the sorted feature and search the important feature in forward manner as well as in reversed manner, while starting from first and last feature simultaneously in the sorted list. The proposed approach is highly scalable and effective as it parallelizes over both attribute and tuples simultaneously allowing us to evaluate many of potential features for High Dimensional datasets. The newly proposed framework for feature selection is experimentally shown to be very valuable with real and synthetic High Dimensional datasets which improve the precision of selected features. We have also tested it to measure classification accuracy against various feature selection process.
文摘This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.
基金This work is partially supported by grants from the Key Program of National Natural Science Foundation of China(NSFC Nos.71631005 and 71731009)the Major Program of the National Social Science Foundation of China(No.19ZDA103).
文摘To solve the high-dimensionality issue and improve its accuracy in credit risk assessment,a high-dimensionality-trait-driven learning paradigm is proposed for feature extraction and classifier selection.The proposed paradigm consists of three main stages:categorization of high dimensional data,high-dimensionality-trait-driven feature extraction,and high-dimensionality-trait-driven classifier selection.In the first stage,according to the definition of high-dimensionality and the relationship between sample size and feature dimensions,the high-dimensionality traits of credit dataset are further categorized into two types:100<feature dimensions<sample size,and feature dimensions≥sample size.In the second stage,some typical feature extraction methods are tested regarding the two categories of high dimensionality.In the final stage,four types of classifiers are performed to evaluate credit risk considering different high-dimensionality traits.For the purpose of illustration and verification,credit classification experiments are performed on two publicly available credit risk datasets,and the results show that the proposed high-dimensionality-trait-driven learning paradigm for feature extraction and classifier selection is effective in handling high-dimensional credit classification issues and improving credit classification accuracy relative to the benchmark models listed in this study.
基金Supported by the National Natural Science Foundation of China(No.11801567)
文摘In this paper, we study the sure independence screening of ultrahigh-dimensional censored data with varying coefficient single-index model. This general model framework covers a large number of commonly used survival models. The property that the proposed method is not derived for a specific model is appealing in ultrahigh dimensional regressions, as it is difficult to specify a correct model for ultrahigh dimensional predictors.Once the assuming data generating process does not meet the actual one, the screening method based on the model will be problematic. We establish the sure screening property and consistency in ranking property of the proposed method. Simulations are conducted to study the finite sample performances, and the results demonstrate that the proposed method is competitive compared with the existing methods. We also illustrate the results via the analysis of data from The National Alzheimers Coordinating Center(NACC).
基金King Saud University for funding this work through Researchers Supporting Project Number(RSP2022R426),King Saud University,Riyadh,Saudi Arabia.
文摘The current study proposes a novel technique for feature selection by inculcating robustness in the conventional Signal to noise Ratio(SNR).The proposed method utilizes the robust measures of location i.e.,the“Median”as well as the measures of variation i.e.,“Median absolute deviation(MAD)and Interquartile range(IQR)”in the SNR.By this way,two independent robust signal-to-noise ratios have been proposed.The proposed method selects the most informative genes/features by combining the minimum subset of genes or features obtained via the greedy search approach with top-ranked genes selected through the robust signal-to-noise ratio(RSNR).The results obtained via the proposed method are compared with wellknown gene/feature selection methods on the basis of performance metric i.e.,classification error rate.A total of 5 gene expression datasets have been used in this study.Different subsets of informative genes are selected by the proposed and all the other methods included in the study,and their efficacy in terms of classification is investigated by using the classifier models such as support vector machine(SVM),Random forest(RF)and k-nearest neighbors(k-NN).The results of the analysis reveal that the proposed method(RSNR)produces minimum error rates than all the other competing feature selection methods in majority of the cases.For further assessment of the method,a detailed simulation study is also conducted.
文摘为解决医疗数据中存在的特征高维和类别不平衡问题,在基于简单、快速和有效高维特征选择算法SFE(simple,fast and effective high-dimensional feature selection)的基础上,提出了一种面向不平衡医疗数据的多阶段混合特征选择算法HFSIM(hybrid feature selection for imbalanced medical data)。HFSIM算法采用改进的自适应边界SMOTE过采样技术,生成符合边界条件的新少数类实例以解决医学数据中类不平衡问题。同时,为了改善搜索空间多样性不足的问题,优化了SFE算法中的非选择操作符率参数UR(unselected rate),有效避免了算法过早收敛及易陷入局部最优的问题。将过滤式Fisher Score方法与优化UR参数后的算法有效结合,使算法能以较低的计算成本获得较好寻优能力。经实验验证,相比于SFE算法,HFSIM算法在Ovarian数据集上准确率达到99.67%,提升了2.11个百分点,G-means和F1分别提升了5.13和2.30个百分点。此外,通过对比特征数量和运行时间,证明了HFSIM算法既能保证算法精度又有效降低了计算成本。
文摘Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data mining approaches, especially in classification. Gene expression data are usually composed of dozens of samples characterized by thousands of genes. This increases the dimensionality coupled with the existence of irrelevant and redundant features. Accordingly, the selection of informative genes (features) becomes difficult, which badly affects the gene classification accuracy. In this paper, we consider the feature selection for classifying gene expression microarray datasets. The goal is to detect the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper-based selection method is applied. Experimental results showed that our proposed feature selection technique performs better than other techniques since it produces lower time latency and improves classification performance.