A new three-parameter discrete distribution called the zero-inflated cosine geometric(ZICG)distribution is proposed for the first time herein.It can be used to analyze over-dispersed count data with excess zeros.The b...A new three-parameter discrete distribution called the zero-inflated cosine geometric(ZICG)distribution is proposed for the first time herein.It can be used to analyze over-dispersed count data with excess zeros.The basic statistical properties of the new distribution,such as the moment generating function,mean,and variance are presented.Furthermore,confidence intervals are constructed by using the Wald,Bayesian,and highest posterior density(HPD)methods to estimate the true confidence intervals for the parameters of the ZICG distribution.Their efficacies were investigated by using both simulation and real-world data comprising the number of daily COVID-19 positive cases at the Olympic Games in Tokyo 2020.The results show that the HPD interval performed better than the other methods in terms of coverage probability and average length in most cases studied.展开更多
In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objec...In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objective is to identify the best method for correcting missing data in TB/HIV Co-infection setting. We employ both empirical data analysis and extensive simulation study to examine the effects of missing data, the accuracy, sensitivity, specificity and train and test error for different approaches. The novelty of this work hinges on the use of modern statistical learning algorithm when treating missingness. In the empirical analysis, both HIV data and TB-HIV co-infection data imputations were performed, and the missing values were imputed using different approaches. In the simulation study, sets of 0% (Complete case), 10%, 30%, 50% and 80% of the data were drawn randomly and replaced with missing values. Results show complete cases only had a co-infection rate (95% Confidence Interval band) of 29% (25%, 33%), weighted method 27% (23%, 31%), likelihood-based approach 26% (24%, 28%) and multiple imputation approach 21% (20%, 22%). In conclusion, MI remains the best approach for dealing with missing data and failure to apply it, results to overestimation of HIV/TB co-infection rate by 8%.展开更多
The analysis of messenger Ribonucleic acid obtained through sequencing techniques (RNA-se- quencing) data is very challenging. Once technical difficulties have been sorted, an important choice has to be made during pr...The analysis of messenger Ribonucleic acid obtained through sequencing techniques (RNA-se- quencing) data is very challenging. Once technical difficulties have been sorted, an important choice has to be made during pre-processing: Two different paths can be chosen: Transform RNA- sequencing count data to a continuous variable or continue to work with count data. For each data type, analysis tools have been developed and seem appropriate at first sight, but a deeper analysis of data distribution and structure, are a discussion worth. In this review, open questions regarding RNA-sequencing data nature are discussed and highlighted, indicating important future research topics in statistics that should be addressed for a better analysis of already available and new appearing gene expression data. Moreover, a comparative analysis of RNAseq count and transformed data is presented. This comparison indicates that transforming RNA-seq count data seems appropriate, at least for differential expression detection.展开更多
Physical activity has been scientifically discussed as fundamental in the process of healthy ageing. Hence, this study aimed at determining the factors that influence older people to perform physical activities. The c...Physical activity has been scientifically discussed as fundamental in the process of healthy ageing. Hence, this study aimed at determining the factors that influence older people to perform physical activities. The complete IPAQ (International Physical Activity Questionnaire) was applied to a population-based sample consisting of 364 elderly persons in the city of Botucatu, SAo Paulo, Brazil. Days of physical activity performed by the older people were considered by taking into account household and leisure activities. Models for count data were fitted by including socio-demographic variables as well as those related to life satisfaction. It was shown that housework physical-activity performance is associated with female, who predominantly showed to be more active in all levels. Male seemed to be more predisposed to perform lighter recreation, sports and leisure-time physical activities, such as walking. Additionally, poor schooling showed to he decisive for not performing physical activities both at home and during leisure.展开更多
Objectives: We introduce a special form of the Generalized Poisson Distribution. The distribution has one parameter, yet it has a variance that is larger than the mean a phenomenon known as “over dispersion”. We dis...Objectives: We introduce a special form of the Generalized Poisson Distribution. The distribution has one parameter, yet it has a variance that is larger than the mean a phenomenon known as “over dispersion”. We discuss potential applications of the distribution as a model of counts, and under the assumption of independence we will perform statistical inference on the ratio of two means, with generalization to testing the homogeneity of several means. Methods: Bayesian methods depend on the choice of the prior distributions of the population parameters. In this paper, we describe a Bayesian approach for estimation and inference on the parameters of several independent Inflated Poisson (IPD) distributions with two possible priors, the first is the reciprocal of the square root of the Poisson parameter and the other is a conjugate Gamma prior. The parameters of Gamma distribution are estimated in the empirical Bayesian framework using the maximum likelihood (ML) solution using nonlinear mixed model (NLMIXED) in SAS. With these priors we construct the highest posterior confidence intervals on the ratio of two IPD parameters and test the homogeneity of several populations. Results: We encountered convergence problem in estimating the hyperparameters of the posterior distribution using the NLMIXED. However, direct maximization of the predictive density produced solutions to the maximum likelihood equations. We apply the methodologies to RNA-SEQ read count data of gene expression values.展开更多
In the area of time series modelling, several applications are encountered in real-life that involve analysis of count time series data. The distribution characteristics and dependence structure are the major issues t...In the area of time series modelling, several applications are encountered in real-life that involve analysis of count time series data. The distribution characteristics and dependence structure are the major issues that arise while specifying a modelling strategy to handle the analysis of those kinds of data. Owing to the numerous applications there is a need to develop models that can capture these features. However, accounting for both aspects simultaneously presents complexities while specifying a modeling strategy. In this paper, an alternative statistical model able to deal with issues of discreteness, overdispersion, serial correlation over time is proposed. In particular, we adopt a branching mechanism to develop a first-order stationary negative binomial autoregressive model. Inference is based on maximum likelihood estimation and a simulation study is conducted to evaluate the performance of the proposed approach. As an illustration, the model is applied to a real-life dataset in crime analysis.展开更多
This paper introduces a bivariate hysteretic integer-valued autoregressive(INAR)process driven by a bivariate Poisson innovation.It deals well with the buffered or hysteretic characteristics of the data.Model properti...This paper introduces a bivariate hysteretic integer-valued autoregressive(INAR)process driven by a bivariate Poisson innovation.It deals well with the buffered or hysteretic characteristics of the data.Model properties such as sationarity and ergodicity are studied in detail.Parameter estimation problem is also well address via methods of two-step conditional least squares(CLS)and conditional maximum likelihood(CML).The boundary parameters are estimated via triangular grid searching algorithm.The estimation effect is verified through simulations based on three scenarios.Finally,the new model is applied to the offence counts in New South Wales(NSW),Australia.展开更多
Panel count data are frequently encountered when study subjects are under discrete observations.However,limited literature has been found on variable selection for panel count data.In this paper,without considering th...Panel count data are frequently encountered when study subjects are under discrete observations.However,limited literature has been found on variable selection for panel count data.In this paper,without considering the model assumption of observation process,a more general semiparametric transformation model for panel count data with informative observation process is developed.A penalized estimation procedure based on the quantile regression function is proposed for variable selection and parameter estimation simultaneously.The consistency and oracle properties of the estimators are established under some mild conditions.Some simulations and an application are reported to evaluate the proposed approach.展开更多
The random coefficient integer-valued autoregressive process was introduced by Zheng,Basawa,and Datta in 2007.In this paper we study the asymptotic behavior of this model(in particular,weak limits of extreme values an...The random coefficient integer-valued autoregressive process was introduced by Zheng,Basawa,and Datta in 2007.In this paper we study the asymptotic behavior of this model(in particular,weak limits of extreme values and the growth rate of partial sums) in the case where the additive term in the underlying random linear recursion belongs to the domain of attraction of a stable law.展开更多
During an epidemic,accurate estimation of the numbers of viral infections in different regions and groups is important for understanding transmission and guiding public health actions.This depends on effective testing...During an epidemic,accurate estimation of the numbers of viral infections in different regions and groups is important for understanding transmission and guiding public health actions.This depends on effective testing strategies that identify a high proportion of infections(that is,provide high ascertainment rates).For the novel coronavirus SARS-CoV-2,ascertainment rates do not appear to be high in most jurisdictions,but quantitative analysis of testing has been limited.We provide statistical models for studying testing and ascertainment rates,and illustrate them on public data on testing and case counts in Ontario,Canada.展开更多
基金support from the National Science,Research and Innovation Fund (NSRF)King Mongkut’s University of Technology North Bangkok (Grant No.KMUTNB-FF-65-22).
文摘A new three-parameter discrete distribution called the zero-inflated cosine geometric(ZICG)distribution is proposed for the first time herein.It can be used to analyze over-dispersed count data with excess zeros.The basic statistical properties of the new distribution,such as the moment generating function,mean,and variance are presented.Furthermore,confidence intervals are constructed by using the Wald,Bayesian,and highest posterior density(HPD)methods to estimate the true confidence intervals for the parameters of the ZICG distribution.Their efficacies were investigated by using both simulation and real-world data comprising the number of daily COVID-19 positive cases at the Olympic Games in Tokyo 2020.The results show that the HPD interval performed better than the other methods in terms of coverage probability and average length in most cases studied.
文摘In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objective is to identify the best method for correcting missing data in TB/HIV Co-infection setting. We employ both empirical data analysis and extensive simulation study to examine the effects of missing data, the accuracy, sensitivity, specificity and train and test error for different approaches. The novelty of this work hinges on the use of modern statistical learning algorithm when treating missingness. In the empirical analysis, both HIV data and TB-HIV co-infection data imputations were performed, and the missing values were imputed using different approaches. In the simulation study, sets of 0% (Complete case), 10%, 30%, 50% and 80% of the data were drawn randomly and replaced with missing values. Results show complete cases only had a co-infection rate (95% Confidence Interval band) of 29% (25%, 33%), weighted method 27% (23%, 31%), likelihood-based approach 26% (24%, 28%) and multiple imputation approach 21% (20%, 22%). In conclusion, MI remains the best approach for dealing with missing data and failure to apply it, results to overestimation of HIV/TB co-infection rate by 8%.
文摘The analysis of messenger Ribonucleic acid obtained through sequencing techniques (RNA-se- quencing) data is very challenging. Once technical difficulties have been sorted, an important choice has to be made during pre-processing: Two different paths can be chosen: Transform RNA- sequencing count data to a continuous variable or continue to work with count data. For each data type, analysis tools have been developed and seem appropriate at first sight, but a deeper analysis of data distribution and structure, are a discussion worth. In this review, open questions regarding RNA-sequencing data nature are discussed and highlighted, indicating important future research topics in statistics that should be addressed for a better analysis of already available and new appearing gene expression data. Moreover, a comparative analysis of RNAseq count and transformed data is presented. This comparison indicates that transforming RNA-seq count data seems appropriate, at least for differential expression detection.
文摘Physical activity has been scientifically discussed as fundamental in the process of healthy ageing. Hence, this study aimed at determining the factors that influence older people to perform physical activities. The complete IPAQ (International Physical Activity Questionnaire) was applied to a population-based sample consisting of 364 elderly persons in the city of Botucatu, SAo Paulo, Brazil. Days of physical activity performed by the older people were considered by taking into account household and leisure activities. Models for count data were fitted by including socio-demographic variables as well as those related to life satisfaction. It was shown that housework physical-activity performance is associated with female, who predominantly showed to be more active in all levels. Male seemed to be more predisposed to perform lighter recreation, sports and leisure-time physical activities, such as walking. Additionally, poor schooling showed to he decisive for not performing physical activities both at home and during leisure.
文摘Objectives: We introduce a special form of the Generalized Poisson Distribution. The distribution has one parameter, yet it has a variance that is larger than the mean a phenomenon known as “over dispersion”. We discuss potential applications of the distribution as a model of counts, and under the assumption of independence we will perform statistical inference on the ratio of two means, with generalization to testing the homogeneity of several means. Methods: Bayesian methods depend on the choice of the prior distributions of the population parameters. In this paper, we describe a Bayesian approach for estimation and inference on the parameters of several independent Inflated Poisson (IPD) distributions with two possible priors, the first is the reciprocal of the square root of the Poisson parameter and the other is a conjugate Gamma prior. The parameters of Gamma distribution are estimated in the empirical Bayesian framework using the maximum likelihood (ML) solution using nonlinear mixed model (NLMIXED) in SAS. With these priors we construct the highest posterior confidence intervals on the ratio of two IPD parameters and test the homogeneity of several populations. Results: We encountered convergence problem in estimating the hyperparameters of the posterior distribution using the NLMIXED. However, direct maximization of the predictive density produced solutions to the maximum likelihood equations. We apply the methodologies to RNA-SEQ read count data of gene expression values.
文摘In the area of time series modelling, several applications are encountered in real-life that involve analysis of count time series data. The distribution characteristics and dependence structure are the major issues that arise while specifying a modelling strategy to handle the analysis of those kinds of data. Owing to the numerous applications there is a need to develop models that can capture these features. However, accounting for both aspects simultaneously presents complexities while specifying a modeling strategy. In this paper, an alternative statistical model able to deal with issues of discreteness, overdispersion, serial correlation over time is proposed. In particular, we adopt a branching mechanism to develop a first-order stationary negative binomial autoregressive model. Inference is based on maximum likelihood estimation and a simulation study is conducted to evaluate the performance of the proposed approach. As an illustration, the model is applied to a real-life dataset in crime analysis.
基金supported by the National Natural Science Foundation of China under Grant Nos.12471249 and 12101417the Natural Science Foundation of Jilin Province under Grant Nos.YDZJ202301ZYTS393 and20220101038JC+1 种基金Postdoctoral Foundation of Jilin Province under Grant No.2023337Scientific Research Project of Jilin Provincial Department of Education under Grant No.JJKH20230665KJ。
文摘This paper introduces a bivariate hysteretic integer-valued autoregressive(INAR)process driven by a bivariate Poisson innovation.It deals well with the buffered or hysteretic characteristics of the data.Model properties such as sationarity and ergodicity are studied in detail.Parameter estimation problem is also well address via methods of two-step conditional least squares(CLS)and conditional maximum likelihood(CML).The boundary parameters are estimated via triangular grid searching algorithm.The estimation effect is verified through simulations based on three scenarios.Finally,the new model is applied to the offence counts in New South Wales(NSW),Australia.
基金partially supported by the National Natural Science Foundation of China under Grant No.12001485the National Bureau of Statistics of China under Grant No.2020LY073the First Class Discipline of Zhejiang-A(Zhejiang University of Finance and Economics-Statistics)under Grant No.Z0111119010/024。
文摘Panel count data are frequently encountered when study subjects are under discrete observations.However,limited literature has been found on variable selection for panel count data.In this paper,without considering the model assumption of observation process,a more general semiparametric transformation model for panel count data with informative observation process is developed.A penalized estimation procedure based on the quantile regression function is proposed for variable selection and parameter estimation simultaneously.The consistency and oracle properties of the estimators are established under some mild conditions.Some simulations and an application are reported to evaluate the proposed approach.
文摘The random coefficient integer-valued autoregressive process was introduced by Zheng,Basawa,and Datta in 2007.In this paper we study the asymptotic behavior of this model(in particular,weak limits of extreme values and the growth rate of partial sums) in the case where the additive term in the underlying random linear recursion belongs to the domain of attraction of a stable law.
基金Research was supported in part by Discovery Grant RGPIN-2017-04055 to JFL from the Natural Sciences and Engineering Research Council of Canada.
文摘During an epidemic,accurate estimation of the numbers of viral infections in different regions and groups is important for understanding transmission and guiding public health actions.This depends on effective testing strategies that identify a high proportion of infections(that is,provide high ascertainment rates).For the novel coronavirus SARS-CoV-2,ascertainment rates do not appear to be high in most jurisdictions,but quantitative analysis of testing has been limited.We provide statistical models for studying testing and ascertainment rates,and illustrate them on public data on testing and case counts in Ontario,Canada.