Knockoff方法是一种用于高维数据变量选择并控制错误发现率(FDR)的新型方法。虽然knockoff方法在回归分析领域发展迅速,但在分类背景下的研究却较少。为了解决二分类模型的变量选择问题,本研究提出一种基于AUC(the area under curve)的k...Knockoff方法是一种用于高维数据变量选择并控制错误发现率(FDR)的新型方法。虽然knockoff方法在回归分析领域发展迅速,但在分类背景下的研究却较少。为了解决二分类模型的变量选择问题,本研究提出一种基于AUC(the area under curve)的knockoff方法(AUC knockoff)。我们将AUC度量引入到knockoff框架里,提出一种全新的AUC统计量来评估每个变量对响应的贡献,进而计算阈值来筛选重要变量。理论证明AUC knockoff在合理条件下能控制FDR,且在样本量趋于无穷大时,功效(power)接近于1。除理论保证外,该方法在模拟实验和真实数据上也显示出优越性能。展开更多
The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theore...The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.展开更多
文摘Knockoff方法是一种用于高维数据变量选择并控制错误发现率(FDR)的新型方法。虽然knockoff方法在回归分析领域发展迅速,但在分类背景下的研究却较少。为了解决二分类模型的变量选择问题,本研究提出一种基于AUC(the area under curve)的knockoff方法(AUC knockoff)。我们将AUC度量引入到knockoff框架里,提出一种全新的AUC统计量来评估每个变量对响应的贡献,进而计算阈值来筛选重要变量。理论证明AUC knockoff在合理条件下能控制FDR,且在样本量趋于无穷大时,功效(power)接近于1。除理论保证外,该方法在模拟实验和真实数据上也显示出优越性能。
基金supported by the National Key R&D Program of China(No.2018YFB0704304)the National Natural Science Foundation of China(Nos.32070668,62002231,61832003,61433014)the K.C.Wong Education Foundation。
文摘The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.