Gaussian graphical models(GGMs) are widely used as intuitive and efficient tools for data analysis in several application domains. To address the reproducibility issue of structure learning of a GGM, it is essential t...Gaussian graphical models(GGMs) are widely used as intuitive and efficient tools for data analysis in several application domains. To address the reproducibility issue of structure learning of a GGM, it is essential to control the false discovery rate(FDR) of the estimated edge set of the graph in terms of the graphical model. Hence, in recent years, the problem of GGM estimation with FDR control is receiving more and more attention. In this paper, we propose a new GGM estimation method by implementing multiple data splitting. Instead of using the node-by-node regressions to estimate each row of the precision matrix, we suggest directly estimating the entire precision matrix using the graphical Lasso in the multiple data splitting, and our calculation speed is p times faster than the previous. We show that the proposed method can asymptotically control FDR, and the proposed method has significant advantages in computational efficiency. Finally, we demonstrate the usefulness of the proposed method through a real data analysis.展开更多
<strong>Objective</strong><span><span><span style="font-family:;" "=""><span style="font-family:Verdana;"><strong>: </strong>Since the...<strong>Objective</strong><span><span><span style="font-family:;" "=""><span style="font-family:Verdana;"><strong>: </strong>Since the identification of COVID-19 in December 2019 as a pandemic, over 4500 research papers were published with the term “COVID-19” contained in its title. Many of these reports on the COVID-19 pandemic suggested that the coronavirus was associated with more serious chronic diseases and mortality particularly in patients with chronic diseases regardless of country and age. Therefore, there is a need to understand how common comorbidities and other factors are associated with the risk of death due to COVID-19 infection. Our investigation aims at exploring this relationship. Specifically, our analysis aimed to explore the relationship between the total number of COVID-19 cases and mortality associated with COVID-19 infection accounting for other risk factors. </span><b><span style="font-family:Verdana;">Methods</span></b><span style="font-family:Verdana;">: Due to the presence of over dispersion, the Negative Binomial Regression is used to model the aggregate number of COVID-19 cases. Case-fatality associated with this infection is modeled as an outcome variable using machine learning predictive multivariable regression. The data we used are the COVID-19 cases and associated deaths from the start of the pandemic up to December 02-2020, the day Pfizer was granted approval for their new COVID-19 vaccine. </span><b><span style="font-family:Verdana;">Results</span></b><span style="font-family:Verdana;">: Our analysis found significant regional variation in case fatality. Moreover, the aggregate number of cases had several risk factors including chronic kidney disease, population density and the percentage of gross domestic product spent on healthcare. </span><b><span style="font-family:Verdana;">The Conclusions</span></b><span style="font-family:Verdana;">: There are important regional variations in COVID-19 case fatality. We identified three factors to be significantly correlated with case fatality</span></span></span></span><span style="font-family:Verdana;">.</span>展开更多
In developing Artificial Neural Networks(ANNs),the available dataset is split into three categories:training,validation and testing.However,an important problem arises:How to trust the predic-tion provided by a partic...In developing Artificial Neural Networks(ANNs),the available dataset is split into three categories:training,validation and testing.However,an important problem arises:How to trust the predic-tion provided by a particular ANN?Due to the randomness related to the network itself(architecture,initialization and learning procedure),there is usually no best choice.Considering this issue,we provide a framework,which captures the randomness related to the network itself.The idea is to perform several training and test trials based on the Jackknife resampling method.Jackknife consists of iteratively deleting a single observation each time from the sample and recomputing the ANN on the rest of the sample data.Consequently,interval prediction is available instead of point prediction.The proposed method was applied and tested using pH,Ca and P data obtained by analyzing 118 georeferenced soil points.The results,based on the dataset size simulation,showed that 60%reduction in available dataset offers compatible accuracy in relation to full dataset,and therefore a higher cost of sampling in the field would not be necessary.The re-sampling method spatially characterizes the points of greater or lesser accuracy and uncertainty.The re-sampling method increased the success rate by using interval prediction instead of using the mean as the most probable value.Although we restrict it to the regression neural network model,the resampling method proposed can also be extended to other modern statistical tools,such as Kriging,Least Squares Collocation(LSC),Convolutional Neural Network(CNN),and so on.展开更多
Conditional dependence plays a crucial role in various statistical procedures,including variable selection,network analysis and causal inference.However,there remains a paucity of relevant research in the context of h...Conditional dependence plays a crucial role in various statistical procedures,including variable selection,network analysis and causal inference.However,there remains a paucity of relevant research in the context of high-dimensional conditioning variables,a common challenge encountered in the era of big data.To address this issue,many existing studies impose certain model structures,yet high-dimensional conditioning variables often introduce spurious correlations in these models.In this paper,we systematically study the estimation biases inherent in widely-used measures of conditional dependence when spurious variables are present under high-dimensional settings.We discuss the estimation inconsistency both intuitively and theoretically,demonstrating that the conditional dependencies can be either overestimated or underestimated under different scenarios.To mitigate these biases and attain consistency,we introduce a measure based on data splitting and refitting techniques for high-dimensional conditional dependence.A conditional independence test is also developed using the newly advocated measure,with a tuning-free asymptotic null distribution.Furthermore,the proposed test is applied to generating high-dimensional network graphs in graphical modeling.The superior performances of newly proposed methods are illustrated both theoretically and through simulation studies.We also utilize the method to construct the gene-gene networks using a dataset of breast invasive carcinoma,which contains interesting discoveries that are worth further scientific exploration.展开更多
基金partially supported by the National Natural Science Foundation of China(Grant No.12171079)the National Key R&D Program of China(Grant No.2020YFA0714102)+1 种基金partially supported by the National Natural Science Foundation of China(Grant No.12101116)the National Key Research and Development Program of China(Grant No.2022YFA1003701)。
文摘Gaussian graphical models(GGMs) are widely used as intuitive and efficient tools for data analysis in several application domains. To address the reproducibility issue of structure learning of a GGM, it is essential to control the false discovery rate(FDR) of the estimated edge set of the graph in terms of the graphical model. Hence, in recent years, the problem of GGM estimation with FDR control is receiving more and more attention. In this paper, we propose a new GGM estimation method by implementing multiple data splitting. Instead of using the node-by-node regressions to estimate each row of the precision matrix, we suggest directly estimating the entire precision matrix using the graphical Lasso in the multiple data splitting, and our calculation speed is p times faster than the previous. We show that the proposed method can asymptotically control FDR, and the proposed method has significant advantages in computational efficiency. Finally, we demonstrate the usefulness of the proposed method through a real data analysis.
文摘<strong>Objective</strong><span><span><span style="font-family:;" "=""><span style="font-family:Verdana;"><strong>: </strong>Since the identification of COVID-19 in December 2019 as a pandemic, over 4500 research papers were published with the term “COVID-19” contained in its title. Many of these reports on the COVID-19 pandemic suggested that the coronavirus was associated with more serious chronic diseases and mortality particularly in patients with chronic diseases regardless of country and age. Therefore, there is a need to understand how common comorbidities and other factors are associated with the risk of death due to COVID-19 infection. Our investigation aims at exploring this relationship. Specifically, our analysis aimed to explore the relationship between the total number of COVID-19 cases and mortality associated with COVID-19 infection accounting for other risk factors. </span><b><span style="font-family:Verdana;">Methods</span></b><span style="font-family:Verdana;">: Due to the presence of over dispersion, the Negative Binomial Regression is used to model the aggregate number of COVID-19 cases. Case-fatality associated with this infection is modeled as an outcome variable using machine learning predictive multivariable regression. The data we used are the COVID-19 cases and associated deaths from the start of the pandemic up to December 02-2020, the day Pfizer was granted approval for their new COVID-19 vaccine. </span><b><span style="font-family:Verdana;">Results</span></b><span style="font-family:Verdana;">: Our analysis found significant regional variation in case fatality. Moreover, the aggregate number of cases had several risk factors including chronic kidney disease, population density and the percentage of gross domestic product spent on healthcare. </span><b><span style="font-family:Verdana;">The Conclusions</span></b><span style="font-family:Verdana;">: There are important regional variations in COVID-19 case fatality. We identified three factors to be significantly correlated with case fatality</span></span></span></span><span style="font-family:Verdana;">.</span>
文摘In developing Artificial Neural Networks(ANNs),the available dataset is split into three categories:training,validation and testing.However,an important problem arises:How to trust the predic-tion provided by a particular ANN?Due to the randomness related to the network itself(architecture,initialization and learning procedure),there is usually no best choice.Considering this issue,we provide a framework,which captures the randomness related to the network itself.The idea is to perform several training and test trials based on the Jackknife resampling method.Jackknife consists of iteratively deleting a single observation each time from the sample and recomputing the ANN on the rest of the sample data.Consequently,interval prediction is available instead of point prediction.The proposed method was applied and tested using pH,Ca and P data obtained by analyzing 118 georeferenced soil points.The results,based on the dataset size simulation,showed that 60%reduction in available dataset offers compatible accuracy in relation to full dataset,and therefore a higher cost of sampling in the field would not be necessary.The re-sampling method spatially characterizes the points of greater or lesser accuracy and uncertainty.The re-sampling method increased the success rate by using interval prediction instead of using the mean as the most probable value.Although we restrict it to the regression neural network model,the resampling method proposed can also be extended to other modern statistical tools,such as Kriging,Least Squares Collocation(LSC),Convolutional Neural Network(CNN),and so on.
基金supported by National Natural Science Foundation of China(Grant Nos.12271456,12371270 and 71988101)the Ministry of Education Research in the Humanities and Social Sciences(Grant No.22YJA910002)Shanghai Science and Technology Development Funds(Grant No.23JC1402100)。
文摘Conditional dependence plays a crucial role in various statistical procedures,including variable selection,network analysis and causal inference.However,there remains a paucity of relevant research in the context of high-dimensional conditioning variables,a common challenge encountered in the era of big data.To address this issue,many existing studies impose certain model structures,yet high-dimensional conditioning variables often introduce spurious correlations in these models.In this paper,we systematically study the estimation biases inherent in widely-used measures of conditional dependence when spurious variables are present under high-dimensional settings.We discuss the estimation inconsistency both intuitively and theoretically,demonstrating that the conditional dependencies can be either overestimated or underestimated under different scenarios.To mitigate these biases and attain consistency,we introduce a measure based on data splitting and refitting techniques for high-dimensional conditional dependence.A conditional independence test is also developed using the newly advocated measure,with a tuning-free asymptotic null distribution.Furthermore,the proposed test is applied to generating high-dimensional network graphs in graphical modeling.The superior performances of newly proposed methods are illustrated both theoretically and through simulation studies.We also utilize the method to construct the gene-gene networks using a dataset of breast invasive carcinoma,which contains interesting discoveries that are worth further scientific exploration.