摘要
在线性回归模型建模中,回归自变量选择是一个受到广泛关注、文献众多,具有很强的理论和实际意义的问题.回归自变量选择子集的相合性是其中一个重要问题,如果某种自变量选择方法选择的子集在样本量趋于无穷时是相合的,而且预测均方误差较小,则这种方法是可取的.利用BIC准则可以挑选相合的自变量子集,但是在自变量个数很多时计算量过大;适应lasso方法具有较高计算效率,也能找到相合的自变量子集;本文提出一种更简单的自变量选择方法,只需要计算两次普通线性回归:第一次进行全集回归,得到全集的回归系数估计,然后利用这些回归系数估计挑选子集,然后只要在挑选的自变量子集上再进行一次普通线性回归就得到了回归结果.考虑如下的回归模型:Y_n=X_nβ~*+ε^((n)),其中回归系数β~*中非零分量下标的集合为J_O,设J_n是本文方法选择的自变量子集下标集合,β^((n))是本文方法估计的回归系数(未选中的自变量对应的系数为零),本文证明了,在适当条件下,(?)其中(β^((n))-β~*)J_O表示β^((n))-β~*的分量下标在J_O中的元素的组成的向量,σ~2是误差方差,∑,c是与矩阵(X_n^TX_n)/n极限有关的矩阵和常数.数值模拟结果表明本文方法具有很好的中小样本性质.
Regression variable subset selection is one of the most important aspects in linear model theory. If the selected subset is consistent when the sample size tends to infinity, and the prediction mean square error is small, then the selection method is preferred. The BIC criterion can give consistent subset, but as the number of variables get large, it involves too much computation. The adaptive lasso has better computational efficiency, while keeping consistency. In this paper we propose a new approach for multiple linear regression variable selection~ which is much simpler than the other variable selection methods, while it gives consistent subset. The new method only compute two passes of ordinary least squares regressions, the first pass computes a complete set regression, selects a variable subset based on the regression coefficient estimates, then the second pass regresses on the selected variables.
出处
《应用概率统计》
CSCD
北大核心
2015年第1期71-88,共18页
Chinese Journal of Applied Probability and Statistics
基金
北京大学统计与信息技术教育部-微软重点实验室资助