Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research.With its advantages in both feature shrinkage and biological interpretabilit...Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research.With its advantages in both feature shrinkage and biological interpretability,Least Absolute Shrinkage and Selection Operator(LASSO)algorithm is one of the most popular methods for the scenarios of clinical biomarker development.However,in practice,applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables,leading to the overfitting of the model.Here,we present VSOLassoBag,a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data.Using a bagging strategy in combination with a parametric method or inflection point search method,VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates.The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction.In addition,by comparing with multiple existing algorithms,VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others.In summary,VSOLassoBag,which is available at https://seqworld.com/VSOLassoBag/under the GPL v3 license,provides an alternative strategy for selecting reliable biomarkers from high-dimensional omics data.For user’s convenience,we implement VSOLassoBag as an R package that provides multithreading computing configurations.展开更多
In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues....In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.展开更多
针对配电网节点电压存在多重共线性导致拓扑识别不准确的问题,文章提出了利用多时间断面的节点电压数据进行拓扑识别,找出相似性高的节点并且用最小绝对收缩与选择(least absolute shrinkage and selection operator,Lasso)算法筛选邻...针对配电网节点电压存在多重共线性导致拓扑识别不准确的问题,文章提出了利用多时间断面的节点电压数据进行拓扑识别,找出相似性高的节点并且用最小绝对收缩与选择(least absolute shrinkage and selection operator,Lasso)算法筛选邻居节点。首先利用皮尔逊算法分析节点之间的相关系数,发现多个节点与不相邻节点存在高相关性,并推导了节点电压之间的近似线性相关关系。然后利用皮尔逊算法、欧氏距离和动态时间规整(dynamic time warping,DTW)算法作为相关性评价指标进行一次识别,找出具有多重共线性的节点。以主电源端节点作为父节点,利用Lasso回归算法确定子节点,以子节点作为新的父节点,如此循环进行二次识别,生成拓扑结构。最后通过IEEE33节点算例验证了该方法的可行性和准确性。展开更多
Index tracking is known to be a passive portfolio management strategy by replicating the performance of a real or virtual index.However,the full replication,which considers all the asserts consisted of the index,often...Index tracking is known to be a passive portfolio management strategy by replicating the performance of a real or virtual index.However,the full replication,which considers all the asserts consisted of the index,often suffers from small and illiquid positions and large transaction costs.Thus,it is preferred to purchase sparse portfolios.Besides,existing literature pointed out the phenomenon of the co-movement in assert returns,indicating that the index tracking problems possibly contain group structures together with sparsity.Based on the consideration of the grouping effects and sparsity in index tracking problems,this paper proposes a grouping sparse index tracking model with nonnegative restrictions.We derive a modified version of coordinate decent algorithm for solving the model.The asymptotic properties are also discussed in detail.To show the efficiency of the model,we apply it into the constrained index tracking problem in Shanghai stock market,i.e.tracking SSE 50 Index.By selecting about 10 stocks,the result shows that nonnegative group lasso outperforms nonnegative lasso in assert allocation.展开更多
针对软件可靠性早期预测中软件复杂性度量属性维数灾难问题,提出了一种基于最小绝对值压缩与选择方法(The Least Absolute Shrinkage and Select Operator,LASSO)和最小角回归(Least Angle Regression,LARS)算法的软件复杂性度量属性特...针对软件可靠性早期预测中软件复杂性度量属性维数灾难问题,提出了一种基于最小绝对值压缩与选择方法(The Least Absolute Shrinkage and Select Operator,LASSO)和最小角回归(Least Angle Regression,LARS)算法的软件复杂性度量属性特征选择方法。该方法筛选掉一些对早期预测结果影响较小的软件复杂性度量属性,得到与早期预测关系最为密切的关键属性子集。首先分析了LASSO回归方法的特点及其在特征选择中的应用,然后对LARS算法进行了修正,使其可以解决LASSO方法所涉及的问题,得到相关的复杂性度量属性子集。最后结合学习向量量化(Learning Vector Quantization,LVQ)神经网络进行软件可靠性早期预测,并基于十折交叉方法进行实验。通过与传统特征选择方法相比较,证明所提方法可以显著提高软件可靠性早期预测精度。展开更多
基金supported by National Key R&D Program of China(2021YFA1302100 to Q.Z)the National Natural Science Foundation of China(82172861 to Q.Z)+1 种基金Guangdong Basic and Applied Basic Research Foundation(2021A1515011743 to Q.Z)National Key Clinical Discipline(to D.Z)。
文摘Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research.With its advantages in both feature shrinkage and biological interpretability,Least Absolute Shrinkage and Selection Operator(LASSO)algorithm is one of the most popular methods for the scenarios of clinical biomarker development.However,in practice,applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables,leading to the overfitting of the model.Here,we present VSOLassoBag,a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data.Using a bagging strategy in combination with a parametric method or inflection point search method,VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates.The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction.In addition,by comparing with multiple existing algorithms,VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others.In summary,VSOLassoBag,which is available at https://seqworld.com/VSOLassoBag/under the GPL v3 license,provides an alternative strategy for selecting reliable biomarkers from high-dimensional omics data.For user’s convenience,we implement VSOLassoBag as an R package that provides multithreading computing configurations.
文摘In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.
文摘针对配电网节点电压存在多重共线性导致拓扑识别不准确的问题,文章提出了利用多时间断面的节点电压数据进行拓扑识别,找出相似性高的节点并且用最小绝对收缩与选择(least absolute shrinkage and selection operator,Lasso)算法筛选邻居节点。首先利用皮尔逊算法分析节点之间的相关系数,发现多个节点与不相邻节点存在高相关性,并推导了节点电压之间的近似线性相关关系。然后利用皮尔逊算法、欧氏距离和动态时间规整(dynamic time warping,DTW)算法作为相关性评价指标进行一次识别,找出具有多重共线性的节点。以主电源端节点作为父节点,利用Lasso回归算法确定子节点,以子节点作为新的父节点,如此循环进行二次识别,生成拓扑结构。最后通过IEEE33节点算例验证了该方法的可行性和准确性。
基金supported by the Science and Technology Research Program of Chongqing Municipal Education Commission(Grant No.KJQN202400514)the Foundation Project of Chongqing Normal University(Grand No.23XLB020)+1 种基金partly supported by Chongqing Social Science Planning Doctoral Program(Grant No.2022BS064)the Science and Technology Research Program of Chongqing Municipal Education Commission(Grant No.KJQN202301541)。
文摘Index tracking is known to be a passive portfolio management strategy by replicating the performance of a real or virtual index.However,the full replication,which considers all the asserts consisted of the index,often suffers from small and illiquid positions and large transaction costs.Thus,it is preferred to purchase sparse portfolios.Besides,existing literature pointed out the phenomenon of the co-movement in assert returns,indicating that the index tracking problems possibly contain group structures together with sparsity.Based on the consideration of the grouping effects and sparsity in index tracking problems,this paper proposes a grouping sparse index tracking model with nonnegative restrictions.We derive a modified version of coordinate decent algorithm for solving the model.The asymptotic properties are also discussed in detail.To show the efficiency of the model,we apply it into the constrained index tracking problem in Shanghai stock market,i.e.tracking SSE 50 Index.By selecting about 10 stocks,the result shows that nonnegative group lasso outperforms nonnegative lasso in assert allocation.