In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified fro...In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified from the point of view of introducing Jensen-Shannon divergence to measure the importance of covariates. The idea of the method is to calculate the Jensen-Shannon divergence between the conditional probability distribution of the covariates on a given response variable and the unconditional probability distribution of the covariates, and then use the probabilities of the response variables as weights to calculate the weighted Jensen-Shannon divergence, where a larger weighted Jensen-Shannon divergence means that the covariates are more important. Additionally, we also investigated an adapted version of the method, which is to measure the relationship between the covariates and the response variable using the weighted Jensen-Shannon divergence adjusted by the logarithmic factor of the number of categories when the number of categories in each covariate varies. Then, through both theoretical and simulation experiments, it was demonstrated that the proposed methods have sure screening and ranking consistency properties. Finally, the results from simulation and real-dataset experiments show that in feature screening, the proposed methods investigated are robust in performance and faster in computational speed compared with an existing method.展开更多
It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable ...It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited;to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.展开更多
It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limit...It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.展开更多
Current high-dimensional feature screening methods still face significant challenges in handling mixed linear and nonlinear relationships,controlling redundant information,and improving model robustness.In this study,...Current high-dimensional feature screening methods still face significant challenges in handling mixed linear and nonlinear relationships,controlling redundant information,and improving model robustness.In this study,we propose a Dynamic Conditional Feature Screening(DCFS)method tailored for high-dimensional economic forecasting tasks.Our goal is to accurately identify key variables,enhance predictive performance,and provide both theoretical foundations and practical tools for macroeconomic modeling.The DCFS method constructs a comprehensive test statistic by integrating conditional mutual information with conditional regression error differences.By introducing a dynamic weighting mechanism,DCFS adaptively balances the linear and nonlinear contributions of features during the screening process.In addition,a dynamic thresholding mechanism is designed to effectively control the false discovery rate(FDR),thereby improving the stability and reliability of the screening results.On the theoretical front,we rigorously prove that the proposed method satisfies the sure screening property and rank consistency,ensuring accurate identification of the truly important feature set in high-dimensional settings.Simulation results demonstrate that under purely linear,purely nonlinear,and mixed dependency structures,DCFS consistently outperforms classical screening methods such as SIS,CSIS,and IG-SIS in terms of true positive rate(TPR),false discovery rate(FDR),and rank correlation.These results highlight the superior accuracy,robustness,and stability of our method.Furthermore,an empirical analysis based on the U.S.FRED-MD macroeconomic dataset confirms the practical value of DCFS in real-world forecasting tasks.The experimental results show that DCFS achieves lower prediction errors(RMSE and MAE)and higher R2 values in forecasting GDP growth.The selected key variables-including the Industrial Production Index(IP),Federal Funds Rate,Consumer Price Index(CPI),and Money Supply(M2)-possess clear economic interpretability,offering reliable support for economic forecasting and policy formulation.展开更多
With the rapid-growth-in-size scientific data in various disciplines, feature screening plays an important role to reduce the high-dimensionality to a moderate scale in many scientific fields. In this paper, we introd...With the rapid-growth-in-size scientific data in various disciplines, feature screening plays an important role to reduce the high-dimensionality to a moderate scale in many scientific fields. In this paper, we introduce a unified and robust model-free feature screening approach for high-dimensional survival data with censoring, which has several advantages: it is a model-free approach under a general model framework, and hence avoids the complication to specify an actual model form with huge number of candidate variables; under mild conditions without requiring the existence of any moment of the response, it enjoys the ranking consistency and sure screening properties in ultra-high dimension. In particular, we impose a conditional independence assumption of the response and the censoring variable given each covariate, instead of assuming the censoring variable is independent of the response and the covariates. Moreover, we also propose a more robust variant to the new procedure, which possesses desirable theoretical properties without any finite moment condition of the predictors and the response. The computation of the newly proposed methods does not require any complicated numerical optimization and it is fast and easy to implement. Extensive numerical studies demonstrate that the proposed methods perform competitively for various configurations. Application is illustrated with an analysis of a genetic data set.展开更多
In this paper,we propose a new correlation,called stable correlation,to measure the dependence between two random vectors.The new correlation is well defined without the moment condition and is zero if and only if the...In this paper,we propose a new correlation,called stable correlation,to measure the dependence between two random vectors.The new correlation is well defined without the moment condition and is zero if and only if the two random vectors are independent.We also study its other theoretical properties.Based on the new correlation,we further propose a robust model-free feature screening procedure for ultrahigh dimensional data and establish its sure screening property and rank consistency property without imposing the subexponential or sub-Gaussian tail condition,which is commonly required in the literature of feature screening.We also examine the finite sample performance of the proposed robust feature screening procedure via Monte Carlo simulation studies and illustrate the proposed procedure by a real data example.展开更多
This paper proposes a new sure independence screening procedure for high-dimensional survival data based on censored quantile correlation(CQC).This framework has two distinctive features:1)Via incorporating a weightin...This paper proposes a new sure independence screening procedure for high-dimensional survival data based on censored quantile correlation(CQC).This framework has two distinctive features:1)Via incorporating a weighting scheme,our metric is a natural extension of quantile correlation(QC),considered by Li(2015),to handle high-dimensional survival data;2)The proposed method not only is robust against outliers,but also can discover the nonlinear relationship between independent variables and censored dependent variable.Additionally,the proposed method enjoys the sure screening property under certain technical conditions.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors.展开更多
Three-dimensional(3D)reconstruction based on aerial images has broad prospects,and feature matching is an important step of it.However,for high-resolution aerial images,there are usually problems such as long time,mis...Three-dimensional(3D)reconstruction based on aerial images has broad prospects,and feature matching is an important step of it.However,for high-resolution aerial images,there are usually problems such as long time,mismatching and sparse feature pairs using traditional algorithms.Therefore,an algorithm is proposed to realize fast,accurate and dense feature matching.The algorithm consists of four steps.Firstly,we achieve a balance between the feature matching time and the number of matching pairs by appropriately reducing the image resolution.Secondly,to realize further screening of the mismatches,a feature screening algorithm based on similarity judgment or local optimization is proposed.Thirdly,to make the algorithm more widely applicable,we combine the results of different algorithms to get dense results.Finally,all matching feature pairs in the low-resolution images are restored to the original images.Comparisons between the original algorithms and our algorithm show that the proposed algorithm can effectively reduce the matching time,screen out the mismatches,and improve the number of matches.展开更多
To reduce damage caused by insect pests,farmers use insecticides to protect produce from crop pests.This practice leads to high synthetic chemical usage because a large portion of the applied insecticide does not reac...To reduce damage caused by insect pests,farmers use insecticides to protect produce from crop pests.This practice leads to high synthetic chemical usage because a large portion of the applied insecticide does not reach its intended target;instead,it may affect non-target organisms and pollute the environment.One approach to mitigating this is through the selective application of insecticides to only those crop plants(or patches of plants)where the insect pests are located,avoiding non-targets and beneficials.The first step to achieve this is the identification of insects on plants and discrimination between pests and beneficial non-targets.However,detecting small-sized individual insects is challenging using image-based machine learning techniques,especially in natural field settings.This paper proposes a method based on explainable artificial intelligence feature selection and machine learning to detect pests and beneficial insects in field crops.An insect-plant dataset reflecting real field conditions was created.It comprises two pest insects—the Colorado potato beetle(CPB,Leptinotarsa decemlineata)and green peach aphid(Myzus persicae)—and the beneficial seven-spot ladybird(Coccinella septempunctata).The specialist herbivore CPB was imaged only on potato plants(Solanum tuberosum)while green peach aphids and seven-spot ladybirds were imaged on three crops:potato,faba bean(Vicia faba),and sugar beet(Beta vulgaris subsp.vulgaris).This increased dataset diversity,broadening the potential application of the developed method for discriminating between pests and beneficial insects in several crops.The insects were imaged in both laboratory and outdoor settings.Using the GrabCut algorithm,regions of interest in the image were identified before shape,texture,and colour features were extracted from the segmented regions.The concept of explainable artificial intelligence was adopted by incorporating permutation feature importance ranking and Shapley Additive explanations values to identify the feature set that optimized a model's performance while reducing computational complexity.The proposed explainable artificial intelligence feature selection method was compared to conventional feature selection techniques,including mutual information,chi-square coefficient,maximal information coefficient,Fisher separation criterion and variance thresholding.Results showed improved accuracy(92.62%Random forest,90.16%Support vector machine,83.61%K-nearest neighbours,and 81.97%Naïve Bayes)and a reduction in the number of model parameters and memory usage(7.22×10^(7)Random forest,6.23×10^(3)Support vector machine,3.64×10^(4)K-nearest neighbours and 1.88×10^(2)Naïve Bayes)compared to using all features.Prediction and training times were also reduced by approximately half compared to conventional feature selection techniques.This demonstrates a simple machine learning algorithm combined with an ideal feature selection methodology can achieve robust performance comparable to other methods.With feature selection,model performance can be maximized and hardware requirements reduced,which are essential for real-world applications with resource constraints.This research offers a reliable approach towards automatic detection and discrimination of pest and beneficial insects which will facilitate the development of alternative pest control approaches and other targeted pest removal methods that are less harmful to the environment than the broad-scale application of synthetic insecticides.展开更多
Feature screening with missing data is a critical problem but has not been well addressed in theliterature. In this discussion we propose a new screening index based on “information value” andapply it to feature scr...Feature screening with missing data is a critical problem but has not been well addressed in theliterature. In this discussion we propose a new screening index based on “information value” andapply it to feature screening with missing covariates.展开更多
The rapid emergence of massive datasets in various fields poses a serious challenge to tra-ditional statistical methods.Meanwhile,it provides opportunities for researchers to develop novel algorithms.Inspired by the i...The rapid emergence of massive datasets in various fields poses a serious challenge to tra-ditional statistical methods.Meanwhile,it provides opportunities for researchers to develop novel algorithms.Inspired by the idea of divide-and-conquer,various distributed frameworks for statistical estimation and inference have been proposed.They were developed to deal with large-scale statistical optimization problems.This paper aims to provide a comprehensive review for related literature.It includes parametric models,nonparametric models,and other frequently used models.Their key ideas and theoretical properties are summarized.The trade-off between communication cost and estimate precision together with other concerns is discussed.展开更多
文摘In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified from the point of view of introducing Jensen-Shannon divergence to measure the importance of covariates. The idea of the method is to calculate the Jensen-Shannon divergence between the conditional probability distribution of the covariates on a given response variable and the unconditional probability distribution of the covariates, and then use the probabilities of the response variables as weights to calculate the weighted Jensen-Shannon divergence, where a larger weighted Jensen-Shannon divergence means that the covariates are more important. Additionally, we also investigated an adapted version of the method, which is to measure the relationship between the covariates and the response variable using the weighted Jensen-Shannon divergence adjusted by the logarithmic factor of the number of categories when the number of categories in each covariate varies. Then, through both theoretical and simulation experiments, it was demonstrated that the proposed methods have sure screening and ranking consistency properties. Finally, the results from simulation and real-dataset experiments show that in feature screening, the proposed methods investigated are robust in performance and faster in computational speed compared with an existing method.
文摘It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited;to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.
文摘It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.
文摘Current high-dimensional feature screening methods still face significant challenges in handling mixed linear and nonlinear relationships,controlling redundant information,and improving model robustness.In this study,we propose a Dynamic Conditional Feature Screening(DCFS)method tailored for high-dimensional economic forecasting tasks.Our goal is to accurately identify key variables,enhance predictive performance,and provide both theoretical foundations and practical tools for macroeconomic modeling.The DCFS method constructs a comprehensive test statistic by integrating conditional mutual information with conditional regression error differences.By introducing a dynamic weighting mechanism,DCFS adaptively balances the linear and nonlinear contributions of features during the screening process.In addition,a dynamic thresholding mechanism is designed to effectively control the false discovery rate(FDR),thereby improving the stability and reliability of the screening results.On the theoretical front,we rigorously prove that the proposed method satisfies the sure screening property and rank consistency,ensuring accurate identification of the truly important feature set in high-dimensional settings.Simulation results demonstrate that under purely linear,purely nonlinear,and mixed dependency structures,DCFS consistently outperforms classical screening methods such as SIS,CSIS,and IG-SIS in terms of true positive rate(TPR),false discovery rate(FDR),and rank correlation.These results highlight the superior accuracy,robustness,and stability of our method.Furthermore,an empirical analysis based on the U.S.FRED-MD macroeconomic dataset confirms the practical value of DCFS in real-world forecasting tasks.The experimental results show that DCFS achieves lower prediction errors(RMSE and MAE)and higher R2 values in forecasting GDP growth.The selected key variables-including the Industrial Production Index(IP),Federal Funds Rate,Consumer Price Index(CPI),and Money Supply(M2)-possess clear economic interpretability,offering reliable support for economic forecasting and policy formulation.
基金supported by the Research Grant Council of Hong Kong (Grant Nos. 509413 and 14311916)Direct Grants for Research of The Chinese University of Hong Kong (Grant Nos. 3132754 and 4053235)+3 种基金the Natural Science Foundation of Jiangxi Province (Grant No. 20161BAB201024)the Key Science Fund Project of Jiangxi Province Eduction Department (Grant No. GJJ150439)National Natural Science Foundation of China (Grant Nos. 11461029, 11601197 and 61562030)the Canadian Institutes of Health Research (Grant No. 145546)
文摘With the rapid-growth-in-size scientific data in various disciplines, feature screening plays an important role to reduce the high-dimensionality to a moderate scale in many scientific fields. In this paper, we introduce a unified and robust model-free feature screening approach for high-dimensional survival data with censoring, which has several advantages: it is a model-free approach under a general model framework, and hence avoids the complication to specify an actual model form with huge number of candidate variables; under mild conditions without requiring the existence of any moment of the response, it enjoys the ranking consistency and sure screening properties in ultra-high dimension. In particular, we impose a conditional independence assumption of the response and the censoring variable given each covariate, instead of assuming the censoring variable is independent of the response and the covariates. Moreover, we also propose a more robust variant to the new procedure, which possesses desirable theoretical properties without any finite moment condition of the predictors and the response. The computation of the newly proposed methods does not require any complicated numerical optimization and it is fast and easy to implement. Extensive numerical studies demonstrate that the proposed methods perform competitively for various configurations. Application is illustrated with an analysis of a genetic data set.
基金supported by National Natural Science Foundation of China(Grant No.11701034)supported by National Science Foundation of USA(Grant No.DMS1820702)。
文摘In this paper,we propose a new correlation,called stable correlation,to measure the dependence between two random vectors.The new correlation is well defined without the moment condition and is zero if and only if the two random vectors are independent.We also study its other theoretical properties.Based on the new correlation,we further propose a robust model-free feature screening procedure for ultrahigh dimensional data and establish its sure screening property and rank consistency property without imposing the subexponential or sub-Gaussian tail condition,which is commonly required in the literature of feature screening.We also examine the finite sample performance of the proposed robust feature screening procedure via Monte Carlo simulation studies and illustrate the proposed procedure by a real data example.
基金supported by the National Natural Science Foundation of China under Grant No.11901006the Natural Science Foundation of Anhui Province under Grant Nos.1908085QA06 and 1908085MA20。
文摘This paper proposes a new sure independence screening procedure for high-dimensional survival data based on censored quantile correlation(CQC).This framework has two distinctive features:1)Via incorporating a weighting scheme,our metric is a natural extension of quantile correlation(QC),considered by Li(2015),to handle high-dimensional survival data;2)The proposed method not only is robust against outliers,but also can discover the nonlinear relationship between independent variables and censored dependent variable.Additionally,the proposed method enjoys the sure screening property under certain technical conditions.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors.
基金This work was supported by the Equipment Pre-Research Foundation of China(6140001020310).
文摘Three-dimensional(3D)reconstruction based on aerial images has broad prospects,and feature matching is an important step of it.However,for high-resolution aerial images,there are usually problems such as long time,mismatching and sparse feature pairs using traditional algorithms.Therefore,an algorithm is proposed to realize fast,accurate and dense feature matching.The algorithm consists of four steps.Firstly,we achieve a balance between the feature matching time and the number of matching pairs by appropriately reducing the image resolution.Secondly,to realize further screening of the mismatches,a feature screening algorithm based on similarity judgment or local optimization is proposed.Thirdly,to make the algorithm more widely applicable,we combine the results of different algorithms to get dense results.Finally,all matching feature pairs in the low-resolution images are restored to the original images.Comparisons between the original algorithms and our algorithm show that the proposed algorithm can effectively reduce the matching time,screen out the mismatches,and improve the number of matches.
基金funded by the German Federal Ministry of Food and Agriculture(BMEL)through the Federal Agency of Food and Agriculture(BLE)with grant number“28DE207A21”SMC's input was funded by the UK NERC/BBSRC AgZero+Project NE/W005050/1.
文摘To reduce damage caused by insect pests,farmers use insecticides to protect produce from crop pests.This practice leads to high synthetic chemical usage because a large portion of the applied insecticide does not reach its intended target;instead,it may affect non-target organisms and pollute the environment.One approach to mitigating this is through the selective application of insecticides to only those crop plants(or patches of plants)where the insect pests are located,avoiding non-targets and beneficials.The first step to achieve this is the identification of insects on plants and discrimination between pests and beneficial non-targets.However,detecting small-sized individual insects is challenging using image-based machine learning techniques,especially in natural field settings.This paper proposes a method based on explainable artificial intelligence feature selection and machine learning to detect pests and beneficial insects in field crops.An insect-plant dataset reflecting real field conditions was created.It comprises two pest insects—the Colorado potato beetle(CPB,Leptinotarsa decemlineata)and green peach aphid(Myzus persicae)—and the beneficial seven-spot ladybird(Coccinella septempunctata).The specialist herbivore CPB was imaged only on potato plants(Solanum tuberosum)while green peach aphids and seven-spot ladybirds were imaged on three crops:potato,faba bean(Vicia faba),and sugar beet(Beta vulgaris subsp.vulgaris).This increased dataset diversity,broadening the potential application of the developed method for discriminating between pests and beneficial insects in several crops.The insects were imaged in both laboratory and outdoor settings.Using the GrabCut algorithm,regions of interest in the image were identified before shape,texture,and colour features were extracted from the segmented regions.The concept of explainable artificial intelligence was adopted by incorporating permutation feature importance ranking and Shapley Additive explanations values to identify the feature set that optimized a model's performance while reducing computational complexity.The proposed explainable artificial intelligence feature selection method was compared to conventional feature selection techniques,including mutual information,chi-square coefficient,maximal information coefficient,Fisher separation criterion and variance thresholding.Results showed improved accuracy(92.62%Random forest,90.16%Support vector machine,83.61%K-nearest neighbours,and 81.97%Naïve Bayes)and a reduction in the number of model parameters and memory usage(7.22×10^(7)Random forest,6.23×10^(3)Support vector machine,3.64×10^(4)K-nearest neighbours and 1.88×10^(2)Naïve Bayes)compared to using all features.Prediction and training times were also reduced by approximately half compared to conventional feature selection techniques.This demonstrates a simple machine learning algorithm combined with an ideal feature selection methodology can achieve robust performance comparable to other methods.With feature selection,model performance can be maximized and hardware requirements reduced,which are essential for real-world applications with resource constraints.This research offers a reliable approach towards automatic detection and discrimination of pest and beneficial insects which will facilitate the development of alternative pest control approaches and other targeted pest removal methods that are less harmful to the environment than the broad-scale application of synthetic insecticides.
文摘Feature screening with missing data is a critical problem but has not been well addressed in theliterature. In this discussion we propose a new screening index based on “information value” andapply it to feature screening with missing covariates.
基金This work is supported by National Natural Science Foun-dation of China(No.11971171)the 111 Project(B14019)and Project of National Social Science Fund of China(15BTJ027)+3 种基金Weidong Liu’s research is supported by National Program on Key Basic Research Project(973 Program,2018AAA0100704)National Natural Science Foundation of China(No.11825104,11690013)Youth Talent Sup-port Program,and a grant from Australian Research Council.Hansheng Wang’s research is partially supported by National Natural Science Foundation of China(No.11831008,11525101,71532001)It is also supported in part by China’s National Key Research Special Program(No.2016YFC0207704).
文摘The rapid emergence of massive datasets in various fields poses a serious challenge to tra-ditional statistical methods.Meanwhile,it provides opportunities for researchers to develop novel algorithms.Inspired by the idea of divide-and-conquer,various distributed frameworks for statistical estimation and inference have been proposed.They were developed to deal with large-scale statistical optimization problems.This paper aims to provide a comprehensive review for related literature.It includes parametric models,nonparametric models,and other frequently used models.Their key ideas and theoretical properties are summarized.The trade-off between communication cost and estimate precision together with other concerns is discussed.