Long memory is an important phenomenon that arises sometimes in the analysis of time series or spatial data.Most of the definitions concerning the long memory of a stationary process are based on the second-order prop...Long memory is an important phenomenon that arises sometimes in the analysis of time series or spatial data.Most of the definitions concerning the long memory of a stationary process are based on the second-order properties of the process.The mutual information between the past and future I_(p−f) of a stationary process represents the information stored in the history of the process which can be used to predict the future.We suggest that a stationary process can be referred to as long memory if its I_(p−f) is infinite.For a stationary process with finite block entropy,I_(p−f) is equal to the excess entropy,which is the summation of redundancies that relate the convergence rate of the conditional(differential)entropy to the entropy rate.Since the definitions of the I_(p−f) and the excess entropy of a stationary process require a very weak moment condition on the distribution of the process,it can be applied to processes whose distributions are without a bounded second moment.A significant property of I_(p−f) is that it is invariant under one-to-one transformation;this enables us to know the I_(p−f) of a stationary process from other processes.For a stationary Gaussian process,the long memory in the sense of mutual information is more strict than that in the sense of covariance.We demonstrate that the I_(p−f) of fractional Gaussian noise is infinite if and only if the Hurst parameter is H∈(1/2,1).展开更多
In multiple heterogeneous networks,developing a model that considers both individual and shared structures is crucial for improving estimation efficiency and interpretability.In this paper,we introduce a semi-parametr...In multiple heterogeneous networks,developing a model that considers both individual and shared structures is crucial for improving estimation efficiency and interpretability.In this paper,we introduce a semi-parametric individual network autoregressive model.We allow autoregression and regression coefficients to vary across networks with subgroup structure,and integrate both covariates and node relationships into network dependence using a single-index structure with unknown links.To estimate all individual and commonly shared parameters and functions,we introduce a novel penalized semiparametric approach based on the generalized method of moments.Theoretically,our proposed semiparametric estimator for heterogeneous networks exhibits estimation and selection consistency under regular conditions.Numerical experiments are conducted to illustrate the effectiveness of the proposed estimator.The proposed method is applied to analyze patient distribution in hospitals to further demonstrate its utility.展开更多
With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between...With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between and within data batches exist because data are collected from the same individual over time.Under the framework of streamed data,individual historical data are not available due to the storage and computation burden.It is urgent to develop computationally efficient methods with statistical guarantees to analyze high-dimensional streamed data and make reliable inferences in practice.In addition,the homogeneity assumption on the model parameters may not be valid in practice over time.To address the above issues,in this paper,we develop a new renewable debiased-lasso inference method for high-dimensional streamed data allowing dependences between and within data batches to exist and model parameters to gradually change.We establish the large sample properties of the proposed estimators,including consistency and asymptotic normality.The numerical results,including simulations and real data analysis,show the superior performance of the proposed method.展开更多
To learn the subgroup structure generated by multidimensional interaction, we propose a novel multiview subgroup integration technique based on tensor decomposition. Compared to the traditional subgroup analysis that ...To learn the subgroup structure generated by multidimensional interaction, we propose a novel multiview subgroup integration technique based on tensor decomposition. Compared to the traditional subgroup analysis that can only handle single-view heterogeneity, our proposed method achieves a greater level of homogeneity within the subgroups, leading to enhanced interpretability and predictive power. For computational readiness of the proposed method, we build an algorithm that incorporates pairwise shrinkage-encouraging penalties and ADMM techniques. Theoretically, we establish the asymptotic consistency and normality of the proposed estimators. Extensive simulation studies and real data analysis demonstrate that our proposal outperforms other methods in terms of prediction accuracy and grouping consistency. In addition, the analysis based on the proposed method indicates that intergenerational care significantly increases the risk of chronic diseases associated with diet and fatigue in all provinces while only reducing the risk of emotion-related chronic diseases in the eastern coastal and central regions of China.展开更多
Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,in...Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,inappropriate imputation would downgrade the performance of machine learning algorithms,leading to bad predictions and unreliable statistical inference.For high-dimensional large-scale mixed-type missing data,we develop a computationally efficient imputation method,missing value imputation via generalized factor models(MIG),under missing at random.The proposed MIG method allows missing variables to be of different types,including continuous,binary,and count variables,and are scalable to both data size n and variable dimension p while existing imputation methods rely on restrictive assumptions such as the same type of missing variables,the low dimensionality of variables,and a limited sample size.We explicitly show that the imputation error of the proposed MIG method diminishes to zero with the rate Op(max{n^(-1/2),p^(-1/2)})as both n and p tend to infinity.Five real datasets demonstrate the superior empirical performance of the proposed MIG method over existing methods that the average normalized absolute imputation error is reduced by 5.3%–34.1%.展开更多
One of the key research problems in financial markets is the investigation of inter-stock dependence.A good understanding in this regard is crucial for portfolio optimization.To this end,various econometric models hav...One of the key research problems in financial markets is the investigation of inter-stock dependence.A good understanding in this regard is crucial for portfolio optimization.To this end,various econometric models have been proposed.Most of them assume that the random noise associated with each subject is independent.However,dependence might still exist within this random noise.Ignoring this valuable information might lead to biased estimations and inaccurate predictions.In this article,we study a spatial autoregressive moving average model with exogenous covariates.Spatial dependence from both response and random noise is considered simultaneously.A quasi-maximum likelihood estimator is developed,and the estimated parameters are shown to be consistent and asymptotically normal.We then conduct an extensive analysis of the proposed method by applying it to the Chinese stock market data.展开更多
In panel data analysis,the cross-sectional dependence(CD)test has been extensively used to test the cross-sectional dependence.However,this traditional CD test does not take serial correlation into consideration,which...In panel data analysis,the cross-sectional dependence(CD)test has been extensively used to test the cross-sectional dependence.However,this traditional CD test does not take serial correlation into consideration,which commonly occurs in many fields.To solve this problem,we propose an adjusted CD test which is able to effectively handle serial correlation.More specifically,the serial correlation can be of arbitrary form in our work.Furthermore,we establish the theoretical properties of the proposed adjusted CD test.Our extensive Monte Carlo experiments show that the traditional CD test cannot work well under serial correlation,while the proposed adjusted CD test does provide rather satisfactory performance.展开更多
We propose a dynamically integrated regression model to predict the price of online auctions,including the final price.Different from existing models,the proposed method uses not only the historical price but also the...We propose a dynamically integrated regression model to predict the price of online auctions,including the final price.Different from existing models,the proposed method uses not only the historical price but also the information from bidding time.Consequently,the prediction accuracy is improved compared with the existing methods.An estimation method based on B-spline approximation is proposed for the estimation and the inference of parameters and nonparametric functions in this model.The minimax rate of convergence for the prediction risk and large-sample results including the consistency and the asymptotic normality are established.Simulation studies verify the finite sample performance and the appealing prediction accuracy and robustness.Finally,when we apply our method to a 7-day auction of iPhone 6s during December 2015 and March 2016,the proposed method predicts the ending price with a much smaller error than the existing models.展开更多
We propose a novel polynomial network autoregressive model by incorporating higher-order connected relationships to simultaneously model the effects of both direct and indirect connections. A quasimaximum likelihood e...We propose a novel polynomial network autoregressive model by incorporating higher-order connected relationships to simultaneously model the effects of both direct and indirect connections. A quasimaximum likelihood estimation method is proposed to estimate the unknown influence parameters, and we demonstrate its consistency and asymptotic normality without imposing any distribution assumption. Moreover,an extended Bayesian information criterion is set for order selection with a divergent upper order. The application of the proposed polynomial network autoregressive model is demonstrated through both the simulation and the real data analysis.展开更多
Truncated L1 regularization proposed by Fan in[5],is an approximation to the L0 regularization in high-dimensional sparse models.In this work,we prove the non-asymptotic error bound for the global optimal solution to ...Truncated L1 regularization proposed by Fan in[5],is an approximation to the L0 regularization in high-dimensional sparse models.In this work,we prove the non-asymptotic error bound for the global optimal solution to the truncated L1 regularized linear regression problem and study the support recovery property.Moreover,a primal dual active set algorithm(PDAS)for variable estimation and selection is proposed.Coupled with continuation by a warm-start strategy leads to a primal dual active set with continuation algorithm(PDASC).Data-driven parameter selection rules such as cross validation,BIC or voting method can be applied to select a proper regularization parameter.The application of the proposed method is demonstrated by applying it to simulation data and a breast cancer gene expression data set(bcTCGA).展开更多
基金supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars of State Education Ministry,the Key Scientific Research Project of Hunan Provincial Education Department (19A342)the National Natural Science Foundation of China (11671132,61903309 and 12271418)+2 种基金the National Key Research and Development Program of China (2020YFA0714200)Sichuan Science and Technology Program (2023NSFSC1355)the Applied Economics of Hunan Province.
文摘Long memory is an important phenomenon that arises sometimes in the analysis of time series or spatial data.Most of the definitions concerning the long memory of a stationary process are based on the second-order properties of the process.The mutual information between the past and future I_(p−f) of a stationary process represents the information stored in the history of the process which can be used to predict the future.We suggest that a stationary process can be referred to as long memory if its I_(p−f) is infinite.For a stationary process with finite block entropy,I_(p−f) is equal to the excess entropy,which is the summation of redundancies that relate the convergence rate of the conditional(differential)entropy to the entropy rate.Since the definitions of the I_(p−f) and the excess entropy of a stationary process require a very weak moment condition on the distribution of the process,it can be applied to processes whose distributions are without a bounded second moment.A significant property of I_(p−f) is that it is invariant under one-to-one transformation;this enables us to know the I_(p−f) of a stationary process from other processes.For a stationary Gaussian process,the long memory in the sense of mutual information is more strict than that in the sense of covariance.We demonstrate that the I_(p−f) of fractional Gaussian noise is infinite if and only if the Hurst parameter is H∈(1/2,1).
基金Supported by National Key R&D Program of China (Grant No. 2022YFA1003702)National Natural Science Foundation of China (Grant No. 11931014)New Cornerstone Science Foundation
文摘In multiple heterogeneous networks,developing a model that considers both individual and shared structures is crucial for improving estimation efficiency and interpretability.In this paper,we introduce a semi-parametric individual network autoregressive model.We allow autoregression and regression coefficients to vary across networks with subgroup structure,and integrate both covariates and node relationships into network dependence using a single-index structure with unknown links.To estimate all individual and commonly shared parameters and functions,we introduce a novel penalized semiparametric approach based on the generalized method of moments.Theoretically,our proposed semiparametric estimator for heterogeneous networks exhibits estimation and selection consistency under regular conditions.Numerical experiments are conducted to illustrate the effectiveness of the proposed estimator.The proposed method is applied to analyze patient distribution in hospitals to further demonstrate its utility.
基金Supported by National Key R&D Program of China(Grant No.2022YFA1003702)National Natural Science Foundation of China(Grant No.12271441)。
文摘With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between and within data batches exist because data are collected from the same individual over time.Under the framework of streamed data,individual historical data are not available due to the storage and computation burden.It is urgent to develop computationally efficient methods with statistical guarantees to analyze high-dimensional streamed data and make reliable inferences in practice.In addition,the homogeneity assumption on the model parameters may not be valid in practice over time.To address the above issues,in this paper,we develop a new renewable debiased-lasso inference method for high-dimensional streamed data allowing dependences between and within data batches to exist and model parameters to gradually change.We establish the large sample properties of the proposed estimators,including consistency and asymptotic normality.The numerical results,including simulations and real data analysis,show the superior performance of the proposed method.
基金supported by National Key R&D Program of China (Grant No. 2022YFA1003702)National Natural Science Foundation of China (Grant Nos. 11931014 and 12271441)New Cornerstone Science Foundation
文摘To learn the subgroup structure generated by multidimensional interaction, we propose a novel multiview subgroup integration technique based on tensor decomposition. Compared to the traditional subgroup analysis that can only handle single-view heterogeneity, our proposed method achieves a greater level of homogeneity within the subgroups, leading to enhanced interpretability and predictive power. For computational readiness of the proposed method, we build an algorithm that incorporates pairwise shrinkage-encouraging penalties and ADMM techniques. Theoretically, we establish the asymptotic consistency and normality of the proposed estimators. Extensive simulation studies and real data analysis demonstrate that our proposal outperforms other methods in terms of prediction accuracy and grouping consistency. In addition, the analysis based on the proposed method indicates that intergenerational care significantly increases the risk of chronic diseases associated with diet and fatigue in all provinces while only reducing the risk of emotion-related chronic diseases in the eastern coastal and central regions of China.
基金supported by National Key R&D Program of China(Grant No.2022YFA1003702)National Natural Science Foundation of China(Grant Nos.11931014 and 12271441)。
文摘Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,inappropriate imputation would downgrade the performance of machine learning algorithms,leading to bad predictions and unreliable statistical inference.For high-dimensional large-scale mixed-type missing data,we develop a computationally efficient imputation method,missing value imputation via generalized factor models(MIG),under missing at random.The proposed MIG method allows missing variables to be of different types,including continuous,binary,and count variables,and are scalable to both data size n and variable dimension p while existing imputation methods rely on restrictive assumptions such as the same type of missing variables,the low dimensionality of variables,and a limited sample size.We explicitly show that the imputation error of the proposed MIG method diminishes to zero with the rate Op(max{n^(-1/2),p^(-1/2)})as both n and p tend to infinity.Five real datasets demonstrate the superior empirical performance of the proposed MIG method over existing methods that the average normalized absolute imputation error is reduced by 5.3%–34.1%.
基金supported by the Major Program of the National Natural Science Foundation of China (Grant No. 11731101)National Natural Science Foundation of China (Grant No. 11671349)+6 种基金supported by National Natural Science Foundation of China (Grant No. 72171226)the Beijing Municipal Social Science Foundation (Grant No. 19GLC052)the National Statistical Science Research Project (Grant No. 2020LZ38)supported by National Natural Science Foundation of China (Grant Nos. 71532001, 11931014, 12171395 and 71991472)the Joint Lab of Data Science and Business Intelligence at Southwestern University of Finance and Economicssupported by National Natural Science Foundation of China (Grant No. 11831008)the Open Research Fund of Key Laboratory of Advanced Theory and Application in Statistics and Data Science (Grant No. Klatasds-Moe-EcnuKlatasds2101)
文摘One of the key research problems in financial markets is the investigation of inter-stock dependence.A good understanding in this regard is crucial for portfolio optimization.To this end,various econometric models have been proposed.Most of them assume that the random noise associated with each subject is independent.However,dependence might still exist within this random noise.Ignoring this valuable information might lead to biased estimations and inaccurate predictions.In this article,we study a spatial autoregressive moving average model with exogenous covariates.Spatial dependence from both response and random noise is considered simultaneously.A quasi-maximum likelihood estimator is developed,and the estimated parameters are shown to be consistent and asymptotically normal.We then conduct an extensive analysis of the proposed method by applying it to the Chinese stock market data.
基金supported by National Natural Science Foundation of China (Grant Nos. 11001225, 11401482 and 71532001)
文摘In panel data analysis,the cross-sectional dependence(CD)test has been extensively used to test the cross-sectional dependence.However,this traditional CD test does not take serial correlation into consideration,which commonly occurs in many fields.To solve this problem,we propose an adjusted CD test which is able to effectively handle serial correlation.More specifically,the serial correlation can be of arbitrary form in our work.Furthermore,we establish the theoretical properties of the proposed adjusted CD test.Our extensive Monte Carlo experiments show that the traditional CD test cannot work well under serial correlation,while the proposed adjusted CD test does provide rather satisfactory performance.
基金supported by National Natural Science Foundation of China(Grant Nos.11528102 and 11571282)Fundamental Research Funds for the Central Universities of China(Grant Nos.JBK120509 and 14TD0046)supported by the National Science Foundation of USA(Grant No.DMS-1620898)。
文摘We propose a dynamically integrated regression model to predict the price of online auctions,including the final price.Different from existing models,the proposed method uses not only the historical price but also the information from bidding time.Consequently,the prediction accuracy is improved compared with the existing methods.An estimation method based on B-spline approximation is proposed for the estimation and the inference of parameters and nonparametric functions in this model.The minimax rate of convergence for the prediction risk and large-sample results including the consistency and the asymptotic normality are established.Simulation studies verify the finite sample performance and the appealing prediction accuracy and robustness.Finally,when we apply our method to a 7-day auction of iPhone 6s during December 2015 and March 2016,the proposed method predicts the ending price with a much smaller error than the existing models.
基金supported by the Fundamental Research Funds for the Central Universities(Grant No.JBK2207075)The second author was supported by National Natural Science Foundation of China(Grant Nos.71991472,12171395,11931014 and 71532001)+1 种基金the Joint Lab of Data Science and Business Intelligence at Southwestern University of Finance and Economics and the Fundamental Research Funds for the Central Universities(Grant No.JBK1806002)The fourth author was supported by the Humanity and Social Science Youth Foundation of Ministry of Education of China(Grant No.19YJC790204)。
文摘We propose a novel polynomial network autoregressive model by incorporating higher-order connected relationships to simultaneously model the effects of both direct and indirect connections. A quasimaximum likelihood estimation method is proposed to estimate the unknown influence parameters, and we demonstrate its consistency and asymptotic normality without imposing any distribution assumption. Moreover,an extended Bayesian information criterion is set for order selection with a divergent upper order. The application of the proposed polynomial network autoregressive model is demonstrated through both the simulation and the real data analysis.
文摘Truncated L1 regularization proposed by Fan in[5],is an approximation to the L0 regularization in high-dimensional sparse models.In this work,we prove the non-asymptotic error bound for the global optimal solution to the truncated L1 regularized linear regression problem and study the support recovery property.Moreover,a primal dual active set algorithm(PDAS)for variable estimation and selection is proposed.Coupled with continuation by a warm-start strategy leads to a primal dual active set with continuation algorithm(PDASC).Data-driven parameter selection rules such as cross validation,BIC or voting method can be applied to select a proper regularization parameter.The application of the proposed method is demonstrated by applying it to simulation data and a breast cancer gene expression data set(bcTCGA).