High-dimensional heterogeneous data have acquired increasing attention and discussion in the past decade.In the context of heterogeneity,semiparametric regression emerges as a popular method to model this type of data...High-dimensional heterogeneous data have acquired increasing attention and discussion in the past decade.In the context of heterogeneity,semiparametric regression emerges as a popular method to model this type of data in statistics.In this paper,we leverage the benefits of expectile regression for computational efficiency and analytical robustness in heterogeneity,and propose a regularized partially linear additive expectile regression model with a nonconvex penalty,such as SCAD or MCP,for high-dimensional heterogeneous data.We focus on a more realistic scenario where the regression error exhibits a heavy-tailed distribution with only finite moments.This scenario challenges the classical sub-gaussian distribution assumption and is more prevalent in practical applications.Under certain regular conditions,we demonstrate that with probability tending to one,the oracle estimator is one of the local minima of the induced optimization problem.Our theoretical analysis suggests that the dimensionality of linear covariates that our estimation procedure can handle is fundamentally limited by the moment condition of the regression error.Computationally,given the nonconvex and nonsmooth nature of the induced optimization problem,we have developed a two-step algorithm.Finally,our method’s effectiveness is demonstrated through its high estimation accuracy and effective model selection,as evidenced by Monte Carlo simulation studies and a real-data application.Furthermore,by taking various expectile weights,our method effectively detects heterogeneity and explores the complete conditional distribution of the response variable,underscoring its utility in analyzing high-dimensional heterogeneous data.展开更多
Some properties of a conditioned superdiffusion are investigated. By a basic property we obtain for it, a class of linear additive functionals, so-called weighted occupation time, is studied. At last, we get an intere...Some properties of a conditioned superdiffusion are investigated. By a basic property we obtain for it, a class of linear additive functionals, so-called weighted occupation time, is studied. At last, we get an interesting result about its extinctive property.展开更多
Interpretability has drawn increasing attention in machine learning.Most works focus on post-hoc explanations rather than building a self-explaining model.So,we propose a Neural Partially Linear Additive Model(NPLAM),...Interpretability has drawn increasing attention in machine learning.Most works focus on post-hoc explanations rather than building a self-explaining model.So,we propose a Neural Partially Linear Additive Model(NPLAM),which automatically distinguishes insignificant,linear,and nonlinear features in neural networks.On the one hand,neural network construction fits data better than spline function under the same parameter amount;on the other hand,learnable gate design and sparsity regular-term maintain the ability of feature selection and structure discovery.We theoretically establish the generalization error bounds of the proposed method with Rademacher complexity.Experiments based on both simulations and real-world datasets verify its good performance and interpretability.展开更多
The generalized additive partial linear models(GAPLM)have been widely used for flexiblemodeling of various types of response.In practice,missing data usually occurs in studies of economics,medicine,and public health.W...The generalized additive partial linear models(GAPLM)have been widely used for flexiblemodeling of various types of response.In practice,missing data usually occurs in studies of economics,medicine,and public health.We address the problem of identifying and estimating GAPLM when the response variable is nonignorably missing.Three types of monotone missing data mechanism are assumed,including logistic model,probit model and complementary log-log model.In this situation,likelihood based on observed data may not be identifiable.In this article,we show that the parameters of interest are identifiable under very mild conditions,and then construct the estimators of the unknown parameters and unknown functions based on a likelihood-based approach by expanding the unknown functions as a linear combination of polynomial spline functions.We establish asymptotic normality for the estimators of the parametric components.Simulation studies demonstrate that the proposed inference procedure performs well in many settings.We apply the proposed method to the household income dataset from the Chinese Household Income Project Survey 2013.展开更多
This paper considers partially linear additive models with the number of parameters diverging when some linear cons train ts on the parame trie par t are available.This paper proposes a constrained profile least-squar...This paper considers partially linear additive models with the number of parameters diverging when some linear cons train ts on the parame trie par t are available.This paper proposes a constrained profile least-squares estimation for the parametrie components with the nonparametric functions being estimated by basis function approximations.The consistency and asymptotic normality of the restricted estimator are given under some certain conditions.The authors construct a profile likelihood ratio test statistic to test the validity of the linear constraints on the parametrie components,and demonstrate that it follows asymptotically chi-squared distribution under the null and alternative hypo theses.The finite sample performance of the proposed method is illus trated by simulation studies and a data analysis.展开更多
In this paper,we mainly investigate the optimization model that minimizes the cost function such that the cover function exceeds a required threshold in the set cover problem,where the cost function is additive linear...In this paper,we mainly investigate the optimization model that minimizes the cost function such that the cover function exceeds a required threshold in the set cover problem,where the cost function is additive linear,and the cover function is non-monotone approximately submodular.We study the problem under streaming model and propose three bicriteria approximation algorithms.Firstly,we provide an intuitive streaming algorithm under the assumption of known optimal objective value.The intuitive streaming algorithm returns a solution such that its cover function value is no less thanα(1−ϵ)times threshold,and the cost function is no more than(2+ϵ)^(2)/(ϵ^(2)ω^(2))⋅κ,whereκis a value that we suppose for the optimal solution andαis the approximation ratio of an algorithm for unconstrained maximization problem that we can call directly.Next we present a bicriteria streaming algorithm scanning the ground set multi-pass to weak the assumption that we guess the optimal objective value in advance,and maintain the same bicriteria approximation ratio.Finally we modify the multi-pass streaming algorithm to a single-pass one without compromising the performance ratio.Additionally,we also propose some numerical experiments to test our algorithm’s performance comparing with some existing methods.展开更多
基金Supported by the Hangzhou Joint Fund of the Zhejiang Provincial Natural Science Foundation of Chi-na(LHZY24A010002)the MOE Project of Humanities and Social Sciences(21YJCZH235).
文摘High-dimensional heterogeneous data have acquired increasing attention and discussion in the past decade.In the context of heterogeneity,semiparametric regression emerges as a popular method to model this type of data in statistics.In this paper,we leverage the benefits of expectile regression for computational efficiency and analytical robustness in heterogeneity,and propose a regularized partially linear additive expectile regression model with a nonconvex penalty,such as SCAD or MCP,for high-dimensional heterogeneous data.We focus on a more realistic scenario where the regression error exhibits a heavy-tailed distribution with only finite moments.This scenario challenges the classical sub-gaussian distribution assumption and is more prevalent in practical applications.Under certain regular conditions,we demonstrate that with probability tending to one,the oracle estimator is one of the local minima of the induced optimization problem.Our theoretical analysis suggests that the dimensionality of linear covariates that our estimation procedure can handle is fundamentally limited by the moment condition of the regression error.Computationally,given the nonconvex and nonsmooth nature of the induced optimization problem,we have developed a two-step algorithm.Finally,our method’s effectiveness is demonstrated through its high estimation accuracy and effective model selection,as evidenced by Monte Carlo simulation studies and a real-data application.Furthermore,by taking various expectile weights,our method effectively detects heterogeneity and explores the complete conditional distribution of the response variable,underscoring its utility in analyzing high-dimensional heterogeneous data.
文摘Some properties of a conditioned superdiffusion are investigated. By a basic property we obtain for it, a class of linear additive functionals, so-called weighted occupation time, is studied. At last, we get an interesting result about its extinctive property.
基金the National Natural Science Foundation of China(Grant No.12071166)the Fundamental Research Funds for the Central Universities of China(Nos.2662023LXPY005,2662022XXYJ005)HZAU-AGIS Cooperation Fund(No.SZYJY2023010)。
文摘Interpretability has drawn increasing attention in machine learning.Most works focus on post-hoc explanations rather than building a self-explaining model.So,we propose a Neural Partially Linear Additive Model(NPLAM),which automatically distinguishes insignificant,linear,and nonlinear features in neural networks.On the one hand,neural network construction fits data better than spline function under the same parameter amount;on the other hand,learnable gate design and sparsity regular-term maintain the ability of feature selection and structure discovery.We theoretically establish the generalization error bounds of the proposed method with Rademacher complexity.Experiments based on both simulations and real-world datasets verify its good performance and interpretability.
文摘The generalized additive partial linear models(GAPLM)have been widely used for flexiblemodeling of various types of response.In practice,missing data usually occurs in studies of economics,medicine,and public health.We address the problem of identifying and estimating GAPLM when the response variable is nonignorably missing.Three types of monotone missing data mechanism are assumed,including logistic model,probit model and complementary log-log model.In this situation,likelihood based on observed data may not be identifiable.In this article,we show that the parameters of interest are identifiable under very mild conditions,and then construct the estimators of the unknown parameters and unknown functions based on a likelihood-based approach by expanding the unknown functions as a linear combination of polynomial spline functions.We establish asymptotic normality for the estimators of the parametric components.Simulation studies demonstrate that the proposed inference procedure performs well in many settings.We apply the proposed method to the household income dataset from the Chinese Household Income Project Survey 2013.
基金supported by the National Natural Science Foundation of China under Grant No.11771250the Natural Science Foundation of Shandong Province under Grant No.ZR2019MA002the Program for Scientific Research Innovation of Graduate Dissertation under Grant No.LWCXB201803
文摘This paper considers partially linear additive models with the number of parameters diverging when some linear cons train ts on the parame trie par t are available.This paper proposes a constrained profile least-squares estimation for the parametrie components with the nonparametric functions being estimated by basis function approximations.The consistency and asymptotic normality of the restricted estimator are given under some certain conditions.The authors construct a profile likelihood ratio test statistic to test the validity of the linear constraints on the parametrie components,and demonstrate that it follows asymptotically chi-squared distribution under the null and alternative hypo theses.The finite sample performance of the proposed method is illus trated by simulation studies and a data analysis.
基金This work was supported by the National Natural Science Foundation of China(Nos.72192804,72192800,and 12201619)the China Postdoctoral Science Foundation(No.2022M723333).
文摘In this paper,we mainly investigate the optimization model that minimizes the cost function such that the cover function exceeds a required threshold in the set cover problem,where the cost function is additive linear,and the cover function is non-monotone approximately submodular.We study the problem under streaming model and propose three bicriteria approximation algorithms.Firstly,we provide an intuitive streaming algorithm under the assumption of known optimal objective value.The intuitive streaming algorithm returns a solution such that its cover function value is no less thanα(1−ϵ)times threshold,and the cost function is no more than(2+ϵ)^(2)/(ϵ^(2)ω^(2))⋅κ,whereκis a value that we suppose for the optimal solution andαis the approximation ratio of an algorithm for unconstrained maximization problem that we can call directly.Next we present a bicriteria streaming algorithm scanning the ground set multi-pass to weak the assumption that we guess the optimal objective value in advance,and maintain the same bicriteria approximation ratio.Finally we modify the multi-pass streaming algorithm to a single-pass one without compromising the performance ratio.Additionally,we also propose some numerical experiments to test our algorithm’s performance comparing with some existing methods.