The lack of covariate data is one of the hotspots of modern statistical analysis.It often appears in surveys or interviews,and becomes more complex in the presence of heavy tailed,skewed,and heteroscedastic data.In th...The lack of covariate data is one of the hotspots of modern statistical analysis.It often appears in surveys or interviews,and becomes more complex in the presence of heavy tailed,skewed,and heteroscedastic data.In this sense,a robust quantile regression method is more concerned.This paper presents an inverse weighted quantile regression method to explore the relationship between response and covariates.This method has several advantages over the naive estimator.On the one hand,it uses all available data and the missing covariates are allowed to be heavily correlated with the response;on the other hand,the estimator is uniform and asymptotically normal at all quantile levels.The effectiveness of this method is verified by simulation.Finally,in order to illustrate the effectiveness of this method,we extend it to the more general case,multivariate case and nonparametric case.展开更多
We construct an exchange-traded fund(ETF)based on the CRyptocurrency IndeX(CRIX),which closely maps nonstationary cryptocurrency(CC)dynamics by adapting the weights of its constituents dynamically.Our scenario analysi...We construct an exchange-traded fund(ETF)based on the CRyptocurrency IndeX(CRIX),which closely maps nonstationary cryptocurrency(CC)dynamics by adapting the weights of its constituents dynamically.Our scenario analysis considers the fee schedules of regulated CC exchanges,spreads obtained from order book data,and investment in-&outflows to the ETF are modelled stochastically.The scenario analysis yields valuable insights into the mechanisms,costs,and risks of this innovative financial product:i)although the composition of the CRIX ETF changes frequently(from 5 to 30 constituents),it remains robust in its core,as the weights of Bitcoin(BTC)and Ethereum(ETH)are robust over time;ii)on average,5.2%needs to be rebalanced on the rebalancing dates;iii)trading costs are low compared with traditional assets;iv)the liquidity of the CC sector increases significantly during the analysis period;spreads occur,especially for altcoins and increase with the size of the transactions.However,because BTC and ETH are the most affected by rebalancing,the cost of spreads remains limited.展开更多
Differently from the general online social network(OSN),locationbased mobile social network(LMSN),which seamlessly integrates mobile computing and social computing technologies,has unique characteristics of temporal,s...Differently from the general online social network(OSN),locationbased mobile social network(LMSN),which seamlessly integrates mobile computing and social computing technologies,has unique characteristics of temporal,spatial and social correlation.Recommending friends instantly based on current location of users in the real world has become increasingly popular in LMSN.However,the existing friend recommendation methods based on topological structures of a social network or non-topological information such as similar user profiles cannot well address the instant making friends in the real world.In this article,we analyze users' check-in behavior in a real LMSN site named Gowalla.According to this analysis,we present an approach of recommending friends instantly for LMSN users by considering the real-time physical location proximity,offline behavior similarity and friendship network information in the virtual community simultaneously.This approach effectively bridges the gap between the offline behavior of users in the real world and online friendship network information in the virtual community.Finally,we use the real user check-in dataset of Gowalla to verify the effectiveness of our approach.展开更多
The paper discusses the regression analysis of current status data,which is common in various fields such as tumorigenic research and demographic studies.Analyzing this type of data poses a significant challenge and h...The paper discusses the regression analysis of current status data,which is common in various fields such as tumorigenic research and demographic studies.Analyzing this type of data poses a significant challenge and has recently gained considerable interest.Furthermore,the authors consider an even more difficult scenario where,apart from censoring,one also faces left-truncation and informative censoring,meaning that there is a potential correlation between the examination time and the failure time of interest.The authors propose a sieve maximum likelihood estimation(MLE)method and in the proposed method for inference,a copula-based procedure is applied to depict the informative censoring.Additionally,the authors utilise the splines to estimate the unknown nonparametric functions in the model,and the asymptotic properties of the proposed estimator are established.The simulation results indicate that the developed approach is effective in practice,and it has been successfully applied to a set of real data.展开更多
This paper discusses variable selection for interval-censored failure time data,a general type of failure time data that commonly arise in many areas such as clinical trials and follow-up studies.Although some methods...This paper discusses variable selection for interval-censored failure time data,a general type of failure time data that commonly arise in many areas such as clinical trials and follow-up studies.Although some methods have been developed in the literature for the problem,most of the existing procedures apply only to specific models.In this paper,we consider the data arising from a general class of partly linear additive generalized odds rate models and propose a penalized variable selection approach through maximizing a derived penalized likelihood function.In the method,the Bernsetin polynomials are employed to approximate both the unknown baseline hazard functions and the nonlinear covariate effects functions,and for the implementation of the method,a coordinate descent algorithm is developed.Also the asymptotic properties of the proposed estimators,including the oracle property,are established.An extensive simulation study is conducted to assess the finite-sample performance of the proposed estimators and indicates that it works well in practice.Finally,the proposed method is applied to a set of real data on Alzheimer’s disease.展开更多
We discuss regression analysis of current status data with the additive hazards model when the failure status may suffer misclassification.Such data occur commonly in many scientific fields involving the diagnosis tes...We discuss regression analysis of current status data with the additive hazards model when the failure status may suffer misclassification.Such data occur commonly in many scientific fields involving the diagnosis test with imperfect sensitivity and specificity.In particular,we consider the situation where the sensitivity and specificity are known and propose a nonparametric maximum likelihood approach.For the implementation of the method,a novel EM algorithm is developed,and the asymptotic properties of the resulting estimators are established.Furthermore,the estimated regression parameters are shown to be semiparametrically efficient.We demonstrate the empirical performance of the proposed methodology in a simulation study and show its substantial advantages over the naive method.Also an application to a motivated study on chlamydia is provided.展开更多
Principal component analysis(PCA)is ubiquitous in statistics and machine learning domains.It is frequently used as an intermediate procedure in various regression and classification problems to reduce the dimensionali...Principal component analysis(PCA)is ubiquitous in statistics and machine learning domains.It is frequently used as an intermediate procedure in various regression and classification problems to reduce the dimensionality of datasets.However,as the size of datasets becomes extremely large,direct application of PCA may not be feasible since loading and storing massive datasets may exceed the computational ability of common machines.To address this problem,subsampling is usually performed,in which a small proportion of the data is used as a surrogate of the entire dataset.This paper proposes an A-optimal subsampling algorithm to decrease the computational cost of PCA for super-large datasets.To be more specific,we establish the consistency and asymptotic normality of the eigenvectors of the subsampled covariance matrix.Subsequently,we derive the optimal subsampling probabilities for PCA based on the A-optimality criterion.We validate the theoretical results by conducting extensive simulation studies.Moreover,the proposed subsampling algorithm for PCA is embedded into a classification procedure for handwriting data to assess its effectiveness in real-world applications.展开更多
This paper discusses regression analysis of interval-censored failure time data arising from the accelerated failure time model in the presence of informative censoring.For the problem,a sieve maximum likelihood estim...This paper discusses regression analysis of interval-censored failure time data arising from the accelerated failure time model in the presence of informative censoring.For the problem,a sieve maximum likelihood estimation approach is proposed and in the method,the copula model is employed to describe the relationship between the failure time of interest and the censoring or observation process.Also I-spline functions are used to approximate the unknown functions in the model,and a simulation study is carried out to assess the finite sample performance of the proposed approach and suggests that it works well in practical situations.In addition,an illustrative example is provided.展开更多
Based on Vector Aitken (VA) method, we propose an acceleration Expectation-Maximization (EM) algorithm, VA-accelerated EM algorithm, whose convergence speed is faster than that of EM algorithm. The VA-accelerated ...Based on Vector Aitken (VA) method, we propose an acceleration Expectation-Maximization (EM) algorithm, VA-accelerated EM algorithm, whose convergence speed is faster than that of EM algorithm. The VA-accelerated EM algorithm does not use the information matrix but only uses the sequence of estimates obtained from iterations of the EM algorithm, thus it keeps the flexibility and simplicity of the EM algorithm. Considering Steffensen iterative process, we have also given the Steffensen form of the VA-accelerated EM algorithm. It can be proved that the reform process is quadratic convergence. Numerical analysis illustrate the proposed methods are efficient and faster than EM algorithm.展开更多
Misclassified current status data arises if each study subject can only be observed once and the observation status is determined by a diagnostic test with imperfect sensitivity and specificity.For the situation,anoth...Misclassified current status data arises if each study subject can only be observed once and the observation status is determined by a diagnostic test with imperfect sensitivity and specificity.For the situation,another issue that may occur is that the observation time may be correlated with the interested failure time,which is often referred to as informative censoring or observation times.It is well-known that in the presence of informative censoring,the analysis that ignores it could yield biased or even misleading results.In this paper,the authors consider such data and propose a frailty-based inference procedure.In particular,an EM algorithm based on Poisson latent variables is developed and the asymptotic properties of the resulting estimators are established.The numerical results show that the proposed method works well in practice and an application to a set of real data is provided.展开更多
In this paper, linear errors-in-response models are considered in the presence of validation data on the responses. A semiparametric dimension reduction technique is employed to define an estimator of β with asymptot...In this paper, linear errors-in-response models are considered in the presence of validation data on the responses. A semiparametric dimension reduction technique is employed to define an estimator of β with asymptotic normality, the estimated empirical loglikelihoods and the adjusted empirical loglikelihoods for the vector of regression coefficients and linear combinations of the regression coefficients, respectively. The estimated empirical log-likelihoods are shown to be asymptotically distributed as weighted sums of independent X 2 1 and the adjusted empirical loglikelihoods are proved to be asymptotically distributed as standard chi-squares, respectively.展开更多
This paper proposes a novel method for testing the equality of high-dimensional means using a multiple hypothesis test. The proposed method is based on the maximum of standardized partial sums of logarithmic p-values ...This paper proposes a novel method for testing the equality of high-dimensional means using a multiple hypothesis test. The proposed method is based on the maximum of standardized partial sums of logarithmic p-values statistic. Numerical studies show that the method performs well for both normal and non-normal data and has a good power performance under both dense and sparse alternative hypotheses. For illustration, a real data analysis is implemented.展开更多
Test of independence between random vectors X and Y is an essential task in statistical inference.One type of testing methods is based on the minimal spanning tree of variables X and Y.The main idea is to generate the...Test of independence between random vectors X and Y is an essential task in statistical inference.One type of testing methods is based on the minimal spanning tree of variables X and Y.The main idea is to generate the minimal spanning tree for one random vector X,and for each edges in minimal spanning tree,the corresponding rank number can be calculated based on another random vector Y.The resulting test statistics are constructed by these rank numbers.However,the existed statistics are not symmetrical tests about the random vectors X and Y such that the power performance from minimal spanning tree of X is not the same as that from minimal spanning tree of Y.In addition,the conclusion from minimal spanning tree of X might conflict with that from minimal spanning tree of Y.In order to solve these problems,we propose several symmetrical independence tests for X and Y.The exact distributions of test statistics are investigated when the sample size is small.Also,we study the asymptotic properties of the statistics.A permutation method is introduced for getting critical values of the statistics.Compared with the existing methods,our proposed methods are more efficient demonstrated by numerical analysis.展开更多
Prior empirical studies find positive and negative momentum effect across the global nations, but few focus on explaining the mixed results. In order to address this issue, we apply the quantile regression approach to...Prior empirical studies find positive and negative momentum effect across the global nations, but few focus on explaining the mixed results. In order to address this issue, we apply the quantile regression approach to analyze the momentum effect in the context of Chinese stock market in this paper. The evidence suggests that the momentum effect in Chinese stock is not stable across firms with different levels of performance. We find that negative momentum effect in the short and medium horizon (3 months and 9 months) increases with the quantile of stock returns. And the positive momentum effect is observed in the long horizon (12 months), which also intensifies for the high performing stocks. According to our study, momentum effect needs to be examined on the basis of stock returns. OLS estimation, which gives an exclusive and biased result, provides misguiding intuitions for momentum effect across the global nations. Based on the empirical results of quantile regression, effective risk control strategies can also be inspired by adjusting the proportion of assets with past performances.展开更多
Deep learning has been increasingly popular in omics data analysis.Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability.However,because deep learning des...Deep learning has been increasingly popular in omics data analysis.Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability.However,because deep learning desires a large sample size,the existing methods may result in uncertain findings when the dataset has a small sample size,commonly seen in omics data analysis.With the explosion and availability of omics data from multiple populations/studies,the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets,which might lead to inaccurate variable selection results.We propose a penalized integrative deep neural network(PIN)to simultaneously select important variables from multiple datasets.PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework.Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets.The source code is freely available on Github(rucliyang/PINFunc).We speculate that the proposed PIN method will promote the identification of disease-related important variables based on multiple studies/datasets from diverse origins.展开更多
In this paper,we study the optimal timing to convert the risk of business for an insurance company in order to improve its solvency.The cash flow of company evolves according to a jump-diffusion process.Business conve...In this paper,we study the optimal timing to convert the risk of business for an insurance company in order to improve its solvency.The cash flow of company evolves according to a jump-diffusion process.Business conversion option offers the company an opportunity to transfer the jump risk business out.In exchange for this option,the company needs to pay both fixed and proportional transaction costs.The proportional cost can also be seen as the profit loading of the jump risk business.We formulated this problem as an optimal stopping problem.By solving this stopping problem,we find that the optimal timing of business conversion mainly depends on the profit loading of the jump risk business.A larger profit loading would make the conversion option valueless.The fixed cost,however,only delays the optimal timing of business conversion.In the end,numerical results are provided to illustrate the impacts of transaction costs and environmental parameters to the optimal strategies.展开更多
In many fields, we need to deal with hierarchically structured data.For this kind of data, hierarchical mixed effects model can show the correlationof variables in the same level by establishing a model for regression...In many fields, we need to deal with hierarchically structured data.For this kind of data, hierarchical mixed effects model can show the correlationof variables in the same level by establishing a model for regression coefficients.Due to the complexity of the random part in this model, seeking an effectivemethod to estimate the covariance matrix is an appealing issue. Iterative generalizedleast squares estimation method was proposed by Goldstein in 1986 and wasapplied in special case of hierarchical model. In this paper, we extend themethod to the general hierarchical mixed effects model, derive its expressions indetail and apply it to economic examples.展开更多
To the Editor:Esophageal cancer,one of the most common cancer types in China,with an estimated 346,633 new cases and 323,600 deaths in 2022,is becoming an increasingly serious clinical and public health problem.^([1])...To the Editor:Esophageal cancer,one of the most common cancer types in China,with an estimated 346,633 new cases and 323,600 deaths in 2022,is becoming an increasingly serious clinical and public health problem.^([1])The successful promotion of the self-management strategy has indicated that lifestyle modifications can be valuable in the primary prevention of cancer development.Adopting a healthy lifestyle has become a novel strategy for primary prevention and risk reduction in high-risk areas.Previous epidemiological studies have identified several lifestyle-related risk factors for esophageal cancer,including smoking and diet.^([2])Each factor can typically explain a modest proportion of cancer risk.However,when combined,these known risk factors may substantially affect the risk of esophageal cancer.Nevertheless,some risk factors for esophageal cancer are non-modifiable,including age,low socioeconomic status,and family history.Whether and how these non-modifiable risk factors affect primary cancer prevention by intervening with modifiable risk factors remain unclear.展开更多
This paper presents a selective review of statistical computation methods for massive data analysis.A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades.In ...This paper presents a selective review of statistical computation methods for massive data analysis.A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades.In this work,we focus on three categories of statistical computation methods:(1)distributed computing,(2)subsampling methods,and(3)minibatch gradient techniques.The first class of literature is about distributed computing and focuses on the situation,where the dataset size is too huge to be comfortably handled by one single computer.In this case,a distributed computation system with multiple computers has to be utilized.The second class of literature is about subsampling methods and concerns about the situation,where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole.The last class of literature studies those minibatch gradient related optimization techniques,which have been extensively used for optimizing various deep learning models.展开更多
Space-filling designs are popular for computer experiments.Therein space-filling designs with good two-dimensional projection are preferred as two-factor interactions are more likely to be important than three-or high...Space-filling designs are popular for computer experiments.Therein space-filling designs with good two-dimensional projection are preferred as two-factor interactions are more likely to be important than three-or higher-order interactions in practice.Considering two-dimensional projection,the authors propose a new class of designs called group strong orthogonal arrays.A group strong orthogonal array enjoys attractive two-dimensional space-filling property in the sense that it can be partitioned into groups,where any two columns can achieve stratifications on s^(u_(1))×s^(u_(2))grids for any positive integers u_(1),u_(2) with u_(1)+u_(2)=3,and any two columns from different groups can achieve stratifications on s^(v_(1))×s^(v_(2))grids for any positive integers v_(1),v_(2) with v_(1)+v_(2)=4.Few existing designs enjoy such a.ppealing two-dimensional stratification property in the literature.And the level numbers of the obtained designs can be s^(3)or s^(4).In addition to the attractive stratification property,the proposed designs perform very well under orthogonality and uniform projection criteria,and are flexible in run sizes,rendering them highly suitable for computer experiments.展开更多
基金Supported by the National Natural Science Foundation of China(Grant No.11861042)the China Statistical Research Project(Grant No.2020LZ25)。
文摘The lack of covariate data is one of the hotspots of modern statistical analysis.It often appears in surveys or interviews,and becomes more complex in the presence of heavy tailed,skewed,and heteroscedastic data.In this sense,a robust quantile regression method is more concerned.This paper presents an inverse weighted quantile regression method to explore the relationship between response and covariates.This method has several advantages over the naive estimator.On the one hand,it uses all available data and the missing covariates are allowed to be heavily correlated with the response;on the other hand,the estimator is uniform and asymptotically normal at all quantile levels.The effectiveness of this method is verified by simulation.Finally,in order to illustrate the effectiveness of this method,we extend it to the more general case,multivariate case and nonparametric case.
基金Financial support from the Deutsche Forschungsgemeinschaft via the IRTG 1792"High-dimensional,Non-stationary Time Series",Humboldt-Universitätzu Berlin,is gratefully acknowledgedfunding from the European Union’s"FIN-TECH:A Financial supervision and Technology compliance training programme"under the Grant agreement No 825215.
文摘We construct an exchange-traded fund(ETF)based on the CRyptocurrency IndeX(CRIX),which closely maps nonstationary cryptocurrency(CC)dynamics by adapting the weights of its constituents dynamically.Our scenario analysis considers the fee schedules of regulated CC exchanges,spreads obtained from order book data,and investment in-&outflows to the ETF are modelled stochastically.The scenario analysis yields valuable insights into the mechanisms,costs,and risks of this innovative financial product:i)although the composition of the CRIX ETF changes frequently(from 5 to 30 constituents),it remains robust in its core,as the weights of Bitcoin(BTC)and Ethereum(ETH)are robust over time;ii)on average,5.2%needs to be rebalanced on the rebalancing dates;iii)trading costs are low compared with traditional assets;iv)the liquidity of the CC sector increases significantly during the analysis period;spreads occur,especially for altcoins and increase with the size of the transactions.However,because BTC and ETH are the most affected by rebalancing,the cost of spreads remains limited.
基金National Key Basic Research Program of China (973 Program) under Grant No.2012CB315802 and No.2013CB329102.National Natural Science Foundation of China under Grant No.61171102 and No.61132001.New generation broadband wireless mobile communication network Key Projects for Science and Technology Development under Grant No.2011ZX03002-002-01,Beijing Nova Program under Grant No.2008B50 and Beijing Higher Education Young Elite Teacher Project under Grant No.YETP0478
文摘Differently from the general online social network(OSN),locationbased mobile social network(LMSN),which seamlessly integrates mobile computing and social computing technologies,has unique characteristics of temporal,spatial and social correlation.Recommending friends instantly based on current location of users in the real world has become increasingly popular in LMSN.However,the existing friend recommendation methods based on topological structures of a social network or non-topological information such as similar user profiles cannot well address the instant making friends in the real world.In this article,we analyze users' check-in behavior in a real LMSN site named Gowalla.According to this analysis,we present an approach of recommending friends instantly for LMSN users by considering the real-time physical location proximity,offline behavior similarity and friendship network information in the virtual community simultaneously.This approach effectively bridges the gap between the offline behavior of users in the real world and online friendship network information in the virtual community.Finally,we use the real user check-in dataset of Gowalla to verify the effectiveness of our approach.
基金supported by the National Natural Science Foundation of China under Grant Nos.12171328,12001093,12231011,and 12071176the National Key Research and Development Program of China under Grant No.2020YFA0714102Beijing Natural Science Foundation under Grant No.Z210003。
文摘The paper discusses the regression analysis of current status data,which is common in various fields such as tumorigenic research and demographic studies.Analyzing this type of data poses a significant challenge and has recently gained considerable interest.Furthermore,the authors consider an even more difficult scenario where,apart from censoring,one also faces left-truncation and informative censoring,meaning that there is a potential correlation between the examination time and the failure time of interest.The authors propose a sieve maximum likelihood estimation(MLE)method and in the proposed method for inference,a copula-based procedure is applied to depict the informative censoring.Additionally,the authors utilise the splines to estimate the unknown nonparametric functions in the model,and the asymptotic properties of the proposed estimator are established.The simulation results indicate that the developed approach is effective in practice,and it has been successfully applied to a set of real data.
基金Supported by the National Natural Science Foundation of China(Grant Nos.12071176,12031016,12171328)Scientific and Technologial Innovation Programs of Higher Education Institutions in Shanxi(Grant No.2023L012)Beijing Natural Science Foundation(Grant No.Z210003)。
文摘This paper discusses variable selection for interval-censored failure time data,a general type of failure time data that commonly arise in many areas such as clinical trials and follow-up studies.Although some methods have been developed in the literature for the problem,most of the existing procedures apply only to specific models.In this paper,we consider the data arising from a general class of partly linear additive generalized odds rate models and propose a penalized variable selection approach through maximizing a derived penalized likelihood function.In the method,the Bernsetin polynomials are employed to approximate both the unknown baseline hazard functions and the nonlinear covariate effects functions,and for the implementation of the method,a coordinate descent algorithm is developed.Also the asymptotic properties of the proposed estimators,including the oracle property,are established.An extensive simulation study is conducted to assess the finite-sample performance of the proposed estimators and indicates that it works well in practice.Finally,the proposed method is applied to a set of real data on Alzheimer’s disease.
基金Shuwei Li's research was partially supported by the National Nature Science Foundation of China(Grant No.11901128)Nature Science Foundation of Guangdong Province of China(Grant Nos.2021A1515010044,2022A1515011901)+3 种基金Science and Technology Program of Guangzhou of China(Grant No.202102010512)the National Statistical Science Research Project(Grant No.2022LY041)Shishun Zhao's research was partially supported by the National Nature Science Foundation of China(Grant No.12071176)the Science and Technology Developing Plan of Jilin Province(20200201258JC).
文摘We discuss regression analysis of current status data with the additive hazards model when the failure status may suffer misclassification.Such data occur commonly in many scientific fields involving the diagnosis test with imperfect sensitivity and specificity.In particular,we consider the situation where the sensitivity and specificity are known and propose a nonparametric maximum likelihood approach.For the implementation of the method,a novel EM algorithm is developed,and the asymptotic properties of the resulting estimators are established.Furthermore,the estimated regression parameters are shown to be semiparametrically efficient.We demonstrate the empirical performance of the proposed methodology in a simulation study and show its substantial advantages over the naive method.Also an application to a motivated study on chlamydia is provided.
基金supported by the National Key R&D Program of China(Grant No.2022YFA1003803)National Social Science Foundation of China(Grant No.21BTJ048)+3 种基金National Natural Science Foundation of China(Grant Nos.12371276 and 12131006)supported by the National Key R&D Program of China(Grant No.2023YFA1008700)National Social Science Foundation of China(Grant No.24BTJ066)National Natural Science Foundation of China(Grant No.12171033)。
文摘Principal component analysis(PCA)is ubiquitous in statistics and machine learning domains.It is frequently used as an intermediate procedure in various regression and classification problems to reduce the dimensionality of datasets.However,as the size of datasets becomes extremely large,direct application of PCA may not be feasible since loading and storing massive datasets may exceed the computational ability of common machines.To address this problem,subsampling is usually performed,in which a small proportion of the data is used as a surrogate of the entire dataset.This paper proposes an A-optimal subsampling algorithm to decrease the computational cost of PCA for super-large datasets.To be more specific,we establish the consistency and asymptotic normality of the eigenvectors of the subsampled covariance matrix.Subsequently,we derive the optimal subsampling probabilities for PCA based on the A-optimality criterion.We validate the theoretical results by conducting extensive simulation studies.Moreover,the proposed subsampling algorithm for PCA is embedded into a classification procedure for handwriting data to assess its effectiveness in real-world applications.
基金supported by the National Natural Science Foundation of China under Grant No.11671168the Science and Technology Developing Plan of Jilin Province under Grant No.20200201258JC。
文摘This paper discusses regression analysis of interval-censored failure time data arising from the accelerated failure time model in the presence of informative censoring.For the problem,a sieve maximum likelihood estimation approach is proposed and in the method,the copula model is employed to describe the relationship between the failure time of interest and the censoring or observation process.Also I-spline functions are used to approximate the unknown functions in the model,and a simulation study is carried out to assess the finite sample performance of the proposed approach and suggests that it works well in practical situations.In addition,an illustrative example is provided.
基金Supported by the National Natural Science Foundation of China(No.11071253,11471335,11626130)
文摘Based on Vector Aitken (VA) method, we propose an acceleration Expectation-Maximization (EM) algorithm, VA-accelerated EM algorithm, whose convergence speed is faster than that of EM algorithm. The VA-accelerated EM algorithm does not use the information matrix but only uses the sequence of estimates obtained from iterations of the EM algorithm, thus it keeps the flexibility and simplicity of the EM algorithm. Considering Steffensen iterative process, we have also given the Steffensen form of the VA-accelerated EM algorithm. It can be proved that the reform process is quadratic convergence. Numerical analysis illustrate the proposed methods are efficient and faster than EM algorithm.
基金supported by the National Natural Science Foundation of China under Grant Nos. 12001093,12071176the National Key Research and Development Program of China under Grant No. 2020YFA0714102the Science and Technology Developing Plan of Jilin Province under Grant No. 20200201258JC
文摘Misclassified current status data arises if each study subject can only be observed once and the observation status is determined by a diagnostic test with imperfect sensitivity and specificity.For the situation,another issue that may occur is that the observation time may be correlated with the interested failure time,which is often referred to as informative censoring or observation times.It is well-known that in the presence of informative censoring,the analysis that ignores it could yield biased or even misleading results.In this paper,the authors consider such data and propose a frailty-based inference procedure.In particular,an EM algorithm based on Poisson latent variables is developed and the asymptotic properties of the resulting estimators are established.The numerical results show that the proposed method works well in practice and an application to a set of real data is provided.
基金This work was supported by the National Natural Science Foundation of China(Key Grant 10231030,Special Grant 10241001)Humboldt-Universitat Berlin-Sonderforschungsbereich 373.
文摘In this paper, linear errors-in-response models are considered in the presence of validation data on the responses. A semiparametric dimension reduction technique is employed to define an estimator of β with asymptotic normality, the estimated empirical loglikelihoods and the adjusted empirical loglikelihoods for the vector of regression coefficients and linear combinations of the regression coefficients, respectively. The estimated empirical log-likelihoods are shown to be asymptotically distributed as weighted sums of independent X 2 1 and the adjusted empirical loglikelihoods are proved to be asymptotically distributed as standard chi-squares, respectively.
基金supported by a grant from the University Grants Council of Hong Kong, National Natural Science Foundation of China (Grant No. 11471335)the Ministry of Education Project of Key Research Institute of Humanities and Social Sciences at Universities (Grant No. 16JJD910002)Fund for Building World-Class Universities (Disciplines) of Renmin University of China
文摘This paper proposes a novel method for testing the equality of high-dimensional means using a multiple hypothesis test. The proposed method is based on the maximum of standardized partial sums of logarithmic p-values statistic. Numerical studies show that the method performs well for both normal and non-normal data and has a good power performance under both dense and sparse alternative hypotheses. For illustration, a real data analysis is implemented.
基金Beijing Natural Science Foundation(Grant No.Z200001)National Natural Science Foundation of China(Grant Nos.11871001,11971478 and 11971001)the Fundamental Research Funds for the Central Universities(Grant No.2019NTSS18)。
文摘Test of independence between random vectors X and Y is an essential task in statistical inference.One type of testing methods is based on the minimal spanning tree of variables X and Y.The main idea is to generate the minimal spanning tree for one random vector X,and for each edges in minimal spanning tree,the corresponding rank number can be calculated based on another random vector Y.The resulting test statistics are constructed by these rank numbers.However,the existed statistics are not symmetrical tests about the random vectors X and Y such that the power performance from minimal spanning tree of X is not the same as that from minimal spanning tree of Y.In addition,the conclusion from minimal spanning tree of X might conflict with that from minimal spanning tree of Y.In order to solve these problems,we propose several symmetrical independence tests for X and Y.The exact distributions of test statistics are investigated when the sample size is small.Also,we study the asymptotic properties of the statistics.A permutation method is introduced for getting critical values of the statistics.Compared with the existing methods,our proposed methods are more efficient demonstrated by numerical analysis.
基金Supported in part by National Natural Science Foundation of China(No.11271368)Beijing Philosophy and Social Science Foundation Grant(No.12JGB051)+3 种基金Project of Ministry of Education supported by the Specialized Research Fund for the Doctoral Program of Higher Education of China(Grant No.20130004110007)The Key Program of National Philosophy and Social Science Foundation Grant(No.13AZD064)Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China(No.10XNK025)China Statistical Research Project(No.2011LZ031)
文摘Prior empirical studies find positive and negative momentum effect across the global nations, but few focus on explaining the mixed results. In order to address this issue, we apply the quantile regression approach to analyze the momentum effect in the context of Chinese stock market in this paper. The evidence suggests that the momentum effect in Chinese stock is not stable across firms with different levels of performance. We find that negative momentum effect in the short and medium horizon (3 months and 9 months) increases with the quantile of stock returns. And the positive momentum effect is observed in the long horizon (12 months), which also intensifies for the high performing stocks. According to our study, momentum effect needs to be examined on the basis of stock returns. OLS estimation, which gives an exclusive and biased result, provides misguiding intuitions for momentum effect across the global nations. Based on the empirical results of quantile regression, effective risk control strategies can also be inspired by adjusting the proportion of assets with past performances.
基金National Natural Science Foundation of China,Grant/Award Number:72271237Building World-class Universities of Renmin University of China,Grant/Award Number:21XNF037。
文摘Deep learning has been increasingly popular in omics data analysis.Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability.However,because deep learning desires a large sample size,the existing methods may result in uncertain findings when the dataset has a small sample size,commonly seen in omics data analysis.With the explosion and availability of omics data from multiple populations/studies,the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets,which might lead to inaccurate variable selection results.We propose a penalized integrative deep neural network(PIN)to simultaneously select important variables from multiple datasets.PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework.Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets.The source code is freely available on Github(rucliyang/PINFunc).We speculate that the proposed PIN method will promote the identification of disease-related important variables based on multiple studies/datasets from diverse origins.
基金supported by the National Natural Science Foundation of China(No.12101300,No.12371478 and No.12071498)。
文摘In this paper,we study the optimal timing to convert the risk of business for an insurance company in order to improve its solvency.The cash flow of company evolves according to a jump-diffusion process.Business conversion option offers the company an opportunity to transfer the jump risk business out.In exchange for this option,the company needs to pay both fixed and proportional transaction costs.The proportional cost can also be seen as the profit loading of the jump risk business.We formulated this problem as an optimal stopping problem.By solving this stopping problem,we find that the optimal timing of business conversion mainly depends on the profit loading of the jump risk business.A larger profit loading would make the conversion option valueless.The fixed cost,however,only delays the optimal timing of business conversion.In the end,numerical results are provided to illustrate the impacts of transaction costs and environmental parameters to the optimal strategies.
文摘In many fields, we need to deal with hierarchically structured data.For this kind of data, hierarchical mixed effects model can show the correlationof variables in the same level by establishing a model for regression coefficients.Due to the complexity of the random part in this model, seeking an effectivemethod to estimate the covariance matrix is an appealing issue. Iterative generalizedleast squares estimation method was proposed by Goldstein in 1986 and wasapplied in special case of hierarchical model. In this paper, we extend themethod to the general hierarchical mixed effects model, derive its expressions indetail and apply it to economic examples.
基金supported by grants from the National Natural Science Foundation of China(No.72104150)MOE Project of Key Research Institute of Humanities and Social Sciences(No.22JJD910001)+2 种基金CAMS Innovation Fund for Medical Sciences(No.2021-I2M-1-010)the Natural Science Foundation of Beijing(No.7204249)Platform of Public Health&Disease Control and Prevention,Major Innovation&Planning Interdisciplinary Platform for the"Double-First Class"Initiative,Renmin University of China
文摘To the Editor:Esophageal cancer,one of the most common cancer types in China,with an estimated 346,633 new cases and 323,600 deaths in 2022,is becoming an increasingly serious clinical and public health problem.^([1])The successful promotion of the self-management strategy has indicated that lifestyle modifications can be valuable in the primary prevention of cancer development.Adopting a healthy lifestyle has become a novel strategy for primary prevention and risk reduction in high-risk areas.Previous epidemiological studies have identified several lifestyle-related risk factors for esophageal cancer,including smoking and diet.^([2])Each factor can typically explain a modest proportion of cancer risk.However,when combined,these known risk factors may substantially affect the risk of esophageal cancer.Nevertheless,some risk factors for esophageal cancer are non-modifiable,including age,low socioeconomic status,and family history.Whether and how these non-modifiable risk factors affect primary cancer prevention by intervening with modifiable risk factors remain unclear.
基金supported by the National Natural Science Foundation of China[grant numbers 72301070,72171226,12271012,12171020,12071477,72371241,72222009,71991472 and 12331009]the National Statistical Science Research Project[grant number 2023LD008]+3 种基金the Fundamental Research Funds for the Central Universities in UIBE[grant number CXTD13-04]the MOE Project of Key Research Institute of Humanities and Social Sciences[grant number 22JJD110001]the Program for Innovation Research,the disciplinary funding and the Emerging Interdisciplinary Project of Central University of Finance and Economicsthe Postdoctoral Fellowship Program of CPSF[grant numbers GZC20231522,GZC20230111 and GZB20230070].
文摘This paper presents a selective review of statistical computation methods for massive data analysis.A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades.In this work,we focus on three categories of statistical computation methods:(1)distributed computing,(2)subsampling methods,and(3)minibatch gradient techniques.The first class of literature is about distributed computing and focuses on the situation,where the dataset size is too huge to be comfortably handled by one single computer.In this case,a distributed computation system with multiple computers has to be utilized.The second class of literature is about subsampling methods and concerns about the situation,where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole.The last class of literature studies those minibatch gradient related optimization techniques,which have been extensively used for optimizing various deep learning models.
基金supported by the National Natural Science Foundation of China under Grant Nos.12301323,12261011,and 12131001the MOE Project of Key Research Institute of Humanities and Social Sciences under Grant No.22JJD110001。
文摘Space-filling designs are popular for computer experiments.Therein space-filling designs with good two-dimensional projection are preferred as two-factor interactions are more likely to be important than three-or higher-order interactions in practice.Considering two-dimensional projection,the authors propose a new class of designs called group strong orthogonal arrays.A group strong orthogonal array enjoys attractive two-dimensional space-filling property in the sense that it can be partitioned into groups,where any two columns can achieve stratifications on s^(u_(1))×s^(u_(2))grids for any positive integers u_(1),u_(2) with u_(1)+u_(2)=3,and any two columns from different groups can achieve stratifications on s^(v_(1))×s^(v_(2))grids for any positive integers v_(1),v_(2) with v_(1)+v_(2)=4.Few existing designs enjoy such a.ppealing two-dimensional stratification property in the literature.And the level numbers of the obtained designs can be s^(3)or s^(4).In addition to the attractive stratification property,the proposed designs perform very well under orthogonality and uniform projection criteria,and are flexible in run sizes,rendering them highly suitable for computer experiments.