Background Cotton is one of the most important commercial crops after food crops,especially in countries like India,where it’s grown extensively under rainfed conditions.Because of its usage in multiple industries,su...Background Cotton is one of the most important commercial crops after food crops,especially in countries like India,where it’s grown extensively under rainfed conditions.Because of its usage in multiple industries,such as textile,medicine,and automobile industries,it has greater commercial importance.The crop’s performance is greatly influenced by prevailing weather dynamics.As climate changes,assessing how weather changes affect crop performance is essential.Among various techniques that are available,crop models are the most effective and widely used tools for predicting yields.Results This study compares statistical and machine learning models to assess their ability to predict cotton yield across major producing districts of Karnataka,India,utilizing a long-term dataset spanning from 1990 to 2023 that includes yield and weather factors.The artificial neural networks(ANNs)performed superiorly with acceptable yield deviations ranging within±10%during both vegetative stage(F1)and mid stage(F2)for cotton.The model evaluation metrics such as root mean square error(RMSE),normalized root mean square error(nRMSE),and modelling efficiency(EF)were also within the acceptance limits in most districts.Furthermore,the tested ANN model was used to assess the importance of the dominant weather factors influencing crop yield in each district.Specifically,the use of morning relative humidity as an individual parameter and its interaction with maximum and minimum tempera-ture had a major influence on cotton yield in most of the yield predicted districts.These differences highlighted the differential interactions of weather factors in each district for cotton yield formation,highlighting individual response of each weather factor under different soils and management conditions over the major cotton growing districts of Karnataka.Conclusions Compared with statistical models,machine learning models such as ANNs proved higher efficiency in forecasting the cotton yield due to their ability to consider the interactive effects of weather factors on yield forma-tion at different growth stages.This highlights the best suitability of ANNs for yield forecasting in rainfed conditions and for the study on relative impacts of weather factors on yield.Thus,the study aims to provide valuable insights to support stakeholders in planning effective crop management strategies and formulating relevant policies.展开更多
Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certai...Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certain models,they do not invariably guarantee the extraction of the most critical or impactful features.Prior literature underscores the significance of equitable FS practices and has proposed diverse methodologies for the identification of appropriate features.However,the challenge of discerning the most relevant and influential features persists,particularly in the context of the exponential growth and heterogeneity of big data—a challenge that is increasingly salient in modern artificial intelligence(AI)applications.In response,this study introduces an innovative,automated statistical method termed Farea Similarity for Feature Selection(FSFS).The FSFS approach computes a similarity metric for each feature by benchmarking it against the record-wise mean,thereby finding feature dependencies and mitigating the influence of outliers that could potentially distort evaluation outcomes.Features are subsequently ranked according to their similarity scores,with the threshold established at the average similarity score.Notably,lower FSFS values indicate higher similarity and stronger data correlations,whereas higher values suggest lower similarity.The FSFS method is designed not only to yield reliable evaluation metrics but also to reduce data complexity without compromising model performance.Comparative analyses were performed against several established techniques,including Chi-squared(CS),Correlation Coefficient(CC),Genetic Algorithm(GA),Exhaustive Approach,Greedy Stepwise Approach,Gain Ratio,and Filtered Subset Eval,using a variety of datasets such as the Experimental Dataset,Breast Cancer Wisconsin(Original),KDD CUP 1999,NSL-KDD,UNSW-NB15,and Edge-IIoT.In the absence of the FSFS method,the highest classifier accuracies observed were 60.00%,95.13%,97.02%,98.17%,95.86%,and 94.62%for the respective datasets.When the FSFS technique was integrated with data normalization,encoding,balancing,and feature importance selection processes,accuracies improved to 100.00%,97.81%,98.63%,98.94%,94.27%,and 98.46%,respectively.The FSFS method,with a computational complexity of O(fn log n),demonstrates robust scalability and is well-suited for datasets of large size,ensuring efficient processing even when the number of features is substantial.By automatically eliminating outliers and redundant data,FSFS reduces computational overhead,resulting in faster training and improved model performance.Overall,the FSFS framework not only optimizes performance but also enhances the interpretability and explainability of data-driven models,thereby facilitating more trustworthy decision-making in AI applications.展开更多
This study presents an innovative development of the exponentially weighted moving average(EWMA)control chart,explicitly adapted for the examination of time series data distinguished by seasonal autoregressive moving ...This study presents an innovative development of the exponentially weighted moving average(EWMA)control chart,explicitly adapted for the examination of time series data distinguished by seasonal autoregressive moving average behavior—SARMA(1,1)L under exponential white noise.Unlike previous works that rely on simplified models such as AR(1)or assume independence,this research derives for the first time an exact two-sided Average Run Length(ARL)formula for theModified EWMAchart under SARMA(1,1)L conditions,using a mathematically rigorous Fredholm integral approach.The derived formulas are validated against numerical integral equation(NIE)solutions,showing strong agreement and significantly reduced computational burden.Additionally,a performance comparison index(PCI)is introduced to assess the chart’s detection capability.Results demonstrate that the proposed method exhibits superior sensitivity to mean shifts in autocorrelated environments,outperforming existing approaches.The findings offer a new,efficient framework for real-time quality control in complex seasonal processes,with potential applications in environmental monitoring and intelligent manufacturing systems.展开更多
The objective of this study is to analyze the sensitivity of the statistical models regarding the size of samples. The study carried out in Ivory Coast is based on annual maximum daily rainfall data collected from 26 ...The objective of this study is to analyze the sensitivity of the statistical models regarding the size of samples. The study carried out in Ivory Coast is based on annual maximum daily rainfall data collected from 26 stations. The methodological approach is based on the statistical modeling of maximum daily rainfall. Adjustments were made on several sample sizes and several return periods (2, 5, 10, 20, 50 and 100 years). The main results have shown that the 30 years series (1931-1960;1961-1990;1991-2020) are better adjusted by the Gumbel (26.92% - 53.85%) and Inverse Gamma (26.92% - 46.15%). Concerning the 60-years series (1931-1990;1961-2020), they are better adjusted by the Inverse Gamma (30.77%), Gamma (15.38% - 46.15%) and Gumbel (15.38% - 42.31%). The full chronicle 1931-2020 (90 years) presents a notable supremacy of 50% of Gumbel model over the Gamma (34.62%) and Gamma Inverse (15.38%) model. It is noted that the Gumbel is the most dominant model overall and more particularly in wet periods. The data for periods with normal and dry trends were better fitted by Gamma and Inverse Gamma.展开更多
This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which...This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which is a high lexical formalism. In order to apply language models based on link grammar in phrase-based models,the concept of linked phrases,an extension of the concept of traditional phrases in phrase-based models was brought out. Experiments were conducted and the results showed that the use of syntax-based language models could improve the performance of the phrase-based models greatly.展开更多
Several statistical methods have been developed for analyzing genotype×environment(GE)interactions in crop breeding programs to identify genotypes with high yield and stability performances.Four statistical metho...Several statistical methods have been developed for analyzing genotype×environment(GE)interactions in crop breeding programs to identify genotypes with high yield and stability performances.Four statistical methods,including joint regression analysis(JRA),additive mean effects and multiplicative interaction(AMMI)analysis,genotype plus GE interaction(GGE)biplot analysis,and yield–stability(YSi)statistic were used to evaluate GE interaction in20 winter wheat genotypes grown in 24 environments in Iran.The main objective was to evaluate the rank correlations among the four statistical methods in genotype rankings for yield,stability and yield–stability.Three kinds of genotypic ranks(yield ranks,stability ranks,and yield–stability ranks)were determined with each method.The results indicated the presence of GE interaction,suggesting the need for stability analysis.With respect to yield,the genotype rankings by the GGE biplot and AMMI analysis were significantly correlated(P<0.01).For stability ranking,the rank correlations ranged from 0.53(GGE–YSi;P<0.05)to0.97(JRA–YSi;P<0.01).AMMI distance(AMMID)was highly correlated(P<0.01)with variance of regression deviation(S2di)in JRA(r=0.83)and Shukla stability variance(σ2)in YSi(r=0.86),indicating that these stability indices can be used interchangeably.No correlation was found between yield ranks and stability ranks(AMMID,S2di,σ2,and GGE stability index),indicating that they measure static stability and accordingly could be used if selection is based primarily on stability.For yield–stability,rank correlation coefficients among the statistical methods varied from 0.64(JRA–YSi;P<0.01)to 0.89(AMMI–YSi;P<0.01),indicating that AMMI and YSi were closely associated in the genotype ranking for integrating yield with stability performance.Based on the results,it can be concluded that YSi was closely correlated with(i)JRA in ranking genotypes for stability and(ii)AMMI for integrating yield and stability.展开更多
[Objective] The study aimed to compare several statistical analysis models for estimating the sugarcane (Saccharum spp.) genotypic stability. [Method] The data of sugarcane regional trials in Guangdong, in 2009 was ...[Objective] The study aimed to compare several statistical analysis models for estimating the sugarcane (Saccharum spp.) genotypic stability. [Method] The data of sugarcane regional trials in Guangdong, in 2009 was analyzed by three models respectively: Finlay and Wilkinson model: the additive main effects and multiplicative interaction (AMMI) model and linear regression-principal components analysis (LR- PCA) model, so as to compare the models. [Result] The Finlay and Wilkinson model was easier, but the analysis of the other two models was more comprehensive, and there was a bit difference between the additive main effects and multiplicative inter- action (AMMI) model and linear regression-principal components analysis (LR-PCA) model. [Conclusion] In practice, while the proper statistical method was usually con- sidered according to the different data, it should be also considered that the same data should be analyzed with different statistical methods in order to get a more reasonable result by comparison.展开更多
QTL mapping for seven quality traits was conducted by using 254 recombinant inbred lines (RIL) derived from a japonica-japonica rice cross of Xiushui 79/C Bao. The seven traits investigated were grain length (GL),...QTL mapping for seven quality traits was conducted by using 254 recombinant inbred lines (RIL) derived from a japonica-japonica rice cross of Xiushui 79/C Bao. The seven traits investigated were grain length (GL), grain length to width ratio (LWR), chalk grain rate (CGR), chalkiness degree (CD), gelatinization temperature (GT), amylose content (AC) and gel consistency (GC) of head rice. Three mapping methods employed were composite interval mapping in QTLMapper 2.0 software based on mixed linear model (MCIM), inclusive composite interval mapping in QTL IciMapping 3.0 software based on stepwise regression linear model (ICIM) and multiple interval mapping with regression forward selection in Windows QTL Cartographer 2.5 based on multiple regression analysis (MIMR). Results showed that five QTLs with additive effect (A-QTLs) were detected by all the three methods simultaneously, two by two methods simultaneously, and 23 by only one method. Five A-QTLs were detected by MCIM, nine by ICIM and 28 by MIMR. The contribution rates of single A-QTL ranged from 0.89% to 38.07%. All the QTLs with epistatic effect (E-QTLs) detected by MIMR were not detected by the other two methods. Fourteen pairs of E-QTLs were detected by both MCIM and ICIM, and 142 pairs of E-QTLs were detected by only one method. Twenty-five pairs of E-QTLs were detected by MCIM, 141 pairs by ICIM and four pairs by MIMR. The contribution rates of single pair of E-QTL were from 2.60% to 23.78%. In the Xiu-Bao RIL population, epistatic effect played a major role in the variation of GL and CD, and additive effect was the dominant in the variation of LWR, while both epistatic effect and additive effect had equal importance in the variation of CGR, AC, GT and GC. QTLs detected by two or more methods simultaneously were highly reliable, and could be applied to improve the quality traits in japonica hybrid rice.展开更多
Background:Survival from birth to slaughter is an important economic trait in commercial pig productions.Increasing survival can improve both economic efficiency and animal welfare.The aim of this study is to explore ...Background:Survival from birth to slaughter is an important economic trait in commercial pig productions.Increasing survival can improve both economic efficiency and animal welfare.The aim of this study is to explore the impact of genotyping strategies and statistical models on the accuracy of genomic prediction for survival in pigs during the total growing period from birth to slaughter.Results:We simulated pig populations with different direct and maternal heritabilities and used a linear mixed model,a logit model,and a probit model to predict genomic breeding values of pig survival based on data of individual survival records with binary outcomes(0,1).The results show that in the case of only alive animals having genotype data,unbiased genomic predictions can be achieved when using variances estimated from pedigreebased model.Models using genomic information achieved up to 59.2%higher accuracy of estimated breeding value compared to pedigree-based model,dependent on genotyping scenarios.The scenario of genotyping all individuals,both dead and alive individuals,obtained the highest accuracy.When an equal number of individuals(80%)were genotyped,random sample of individuals with genotypes achieved higher accuracy than only alive individuals with genotypes.The linear model,logit model and probit model achieved similar accuracy.Conclusions:Our conclusion is that genomic prediction of pig survival is feasible in the situation that only alive pigs have genotypes,but genomic information of dead individuals can increase accuracy of genomic prediction by 2.06%to 6.04%.展开更多
The water resources of the Nadhour-Sisseb-El Alem Basin in Tunisia exhibit semi-arid and arid climatic conditions.This induces an excessive pumping of groundwater,which creates drops in water level ranging about 1-2 m...The water resources of the Nadhour-Sisseb-El Alem Basin in Tunisia exhibit semi-arid and arid climatic conditions.This induces an excessive pumping of groundwater,which creates drops in water level ranging about 1-2 m/a.Indeed,these unfavorable conditions require interventions to rationalize integrated management in decision making.The aim of this study is to determine a water recharge index(WRI),delineate the potential groundwater recharge area and estimate the potential groundwater recharge rate based on the integration of statistical models resulted from remote sensing imagery,GIS digital data(e.g.,lithology,soil,runoff),measured artificial recharge data,fuzzy set theory and multi-criteria decision making(MCDM)using the analytical hierarchy process(AHP).Eight factors affecting potential groundwater recharge were determined,namely lithology,soil,slope,topography,land cover/use,runoff,drainage and lineaments.The WRI is between 1.2 and 3.1,which is classified into five classes as poor,weak,moderate,good and very good sites of potential groundwater recharge area.The very good and good classes occupied respectively 27%and 44%of the study area.The potential groundwater recharge rate was 43%of total precipitation.According to the results of the study,river beds are favorable sites for groundwater recharge.展开更多
Forecasting the movement of stock market is a long-time attractive topic. This paper implements different statistical learning models to predict the movement of S&P 500 index. The S&P 500 index is influenced b...Forecasting the movement of stock market is a long-time attractive topic. This paper implements different statistical learning models to predict the movement of S&P 500 index. The S&P 500 index is influenced by other important financial indexes across the world such as commodity price and financial technical indicators. This paper systematically investigated four supervised learning models, including Logistic Regression, Gaussian Discriminant Analysis (GDA), Naive Bayes and Support Vector Machine (SVM) in the forecast of S&P 500 index. After several experiments of optimization in features and models, especially the SVM kernel selection and feature selection for different models, this paper concludes that a SVM model with a Radial Basis Function (RBF) kernel can achieve an accuracy rate of 62.51% for the future market trend of the S&P 500 index.展开更多
Road crash prediction models are very useful tools in highway safety, given their potential for determining both the crash frequency occurrence and the degree severity of crashes. Crash frequency refers to the predict...Road crash prediction models are very useful tools in highway safety, given their potential for determining both the crash frequency occurrence and the degree severity of crashes. Crash frequency refers to the prediction of the number of crashes that would occur on a specific road segment or intersection in a time period, while crash severity models generally explore the relationship between crash severity injury and the contributing factors such as driver behavior, vehicle characteristics, roadway geometry, and road-environment conditions. Effective interventions to reduce crash toll include design of safer infrastructure and incorporation of road safety features into land-use and transportation planning;improvement of vehicle safety features;improvement of post-crash care for victims of road crashes;and improvement of driver behavior, such as setting and enforcing laws relating to key risk factors, and raising public awareness. Despite the great efforts that transportation agencies put into preventive measures, the annual number of traffic crashes has not yet significantly decreased. For in-stance, 35,092 traffic fatalities were recorded in the US in 2015, an increase of 7.2% as compared to the previous year. With such a trend, this paper presents an overview of road crash prediction models used by transportation agencies and researchers to gain a better understanding of the techniques used in predicting road accidents and the risk factors that contribute to crash occurrence.展开更多
Recently, some results have been acquired with the Monte- Carlo statistical experiments in the design of ocean en gineering. The results show that Monte-Carlo statistical experiments can be widely used in estimating t...Recently, some results have been acquired with the Monte- Carlo statistical experiments in the design of ocean en gineering. The results show that Monte-Carlo statistical experiments can be widely used in estimating the parameters of wave statistical distributions, checking the probability model of the long- term wave extreme value distribution under a typhoon condition and calculating the failure probability of the ocean platforms.展开更多
The establishment of effective null models can provide reference networks to accurately describe statistical properties of real-life signed networks.At present,two classical null models of signed networks(i.e.,sign an...The establishment of effective null models can provide reference networks to accurately describe statistical properties of real-life signed networks.At present,two classical null models of signed networks(i.e.,sign and full-edge randomized models)shuffle both positive and negative topologies at the same time,so it is difficult to distinguish the effect on network topology of positive edges,negative edges,and the correlation between them.In this study,we construct three re-fined edge-randomized null models by only randomizing link relationships without changing positive and negative degree distributions.The results of nontrivial statistical indicators of signed networks,such as average degree connectivity and clustering coefficient,show that the position of positive edges has a stronger effect on positive-edge topology,while the signs of negative edges have a greater influence on negative-edge topology.For some specific statistics(e.g.,embeddedness),the results indicate that the proposed null models can more accurately describe real-life networks compared with the two existing ones,which can be selected to facilitate a better understanding of complex structures,functions,and dynamical behaviors on signed networks.展开更多
The cause-effect relationship is not always possible to trace in GCMs because of the simultaneous inclusion of several highly complex physical processes. Furthermore, the inter-GCM differences are large and there is n...The cause-effect relationship is not always possible to trace in GCMs because of the simultaneous inclusion of several highly complex physical processes. Furthermore, the inter-GCM differences are large and there is no simple way to reconcile them. So, simple climate models, like statistical-dynamical models (SDMs), appear to be useful in this context. This kind of models is essentially mechanistic, being directed towards understanding the dependence of a particular mechanism on the other parameters of the problem. In this paper, the utility of SDMs for studies of climate change is discussed in some detail. We show that these models are an indispensable part of hierarchy of climate models.展开更多
The paper deals with the performing of a critical analysis of the problems arising in matching the classical models of the statistical and phenomenological thermodynamics. The performed analysis shows that some concep...The paper deals with the performing of a critical analysis of the problems arising in matching the classical models of the statistical and phenomenological thermodynamics. The performed analysis shows that some concepts of the statistical and phenomenological methods of describing the classical systems do not quite correlate with each other. Particularly, in these methods various caloric ideal gas equations of state are employed, while the possibility existing in the thermodynamic cyclic processes to obtain the same distributions both due to a change of the particle concentration and owing to a change of temperature is not allowed for in the statistical methods. The above-mentioned difference of the equations of state is cleared away when using in the statistical functions corresponding to the canonical Gibbs equations instead of the Planck’s constant a new scale factor that depends on the parameters of a system and coincides with the Planck’s constant in going of the system to the degenerate state. Under such an approach, the statistical entropy is transformed into one of the forms of heat capacity. In its turn, the agreement of the methods under consideration in the question as to the dependence of the molecular distributions on the concentration of particles, apparently, will call for further refinement of the physical model of ideal gas and the techniques for its statistical description.展开更多
Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word a...Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word aligned bilingual corpus,while ignoring the effect of the number of adjacent bilingual phrases.In this paper,we propose a method to take the number of adjacent phrases into account for better estimation of reordering models.Instead of just checking whether there is one phrase adjacent to a given phrase,our method firstly uses a compact structure named reordering graph to represent all phrase segmentations of a parallel sentence,then the effect of the adjacent phrase number can be quantified in a forward-backward fashion,and finally incorporated into the estimation of reordering models.Experimental results on the NIST Chinese-English and WMT French-Spanish data sets show that our approach significantly outperforms the baseline method.展开更多
This contribution deals with a generative approach for the analysis of textual data. Instead of creating heuristic rules forthe representation of documents and word counts, we employ a distribution able to model words...This contribution deals with a generative approach for the analysis of textual data. Instead of creating heuristic rules forthe representation of documents and word counts, we employ a distribution able to model words along texts considering different topics. In this regard, following Minka proposal (2003), we implement a Dirichlet Compound Multinomial (DCM) distribution, then we propose an extension called sbDCM that takes explicitly into account the different latent topics that compound the document. We follow two alternative approaches: on one hand the topics can be unknown, thus to be estimated on the basis of the data, on the other hand topics are determined in advance on the basis of a predefined ontological schema. The two possible approaches are assessed on the basis of real data.展开更多
Landslide susceptibility mapping is vital for landslide risk management and urban planning.In this study,we used three statistical models[frequency ratio,certainty factor and index of entropy(IOE)]and a machine learni...Landslide susceptibility mapping is vital for landslide risk management and urban planning.In this study,we used three statistical models[frequency ratio,certainty factor and index of entropy(IOE)]and a machine learning model[random forest(RF)]for landslide susceptibility mapping in Wanzhou County,China.First,a landslide inventory map was prepared using earlier geotechnical investigation reports,aerial images,and field surveys.Then,the redundant factors were excluded from the initial fourteen landslide causal factors via factor correlation analysis.To determine the most effective causal factors,landslide susceptibility evaluations were performed based on four cases with different combinations of factors("cases").In the analysis,465(70%)landslide locations were randomly selected for model training,and 200(30%)landslide locations were selected for verification.The results showed that case 3 produced the best performance for the statistical models and that case 2 produced the best performance for the RF model.Finally,the receiver operating characteristic(ROC)curve was used to verify the accuracy of each model's results for its respective optimal case.The ROC curve analysis showed that the machine learning model performed better than the other three models,and among the three statistical models,the IOE model with weight coefficients was superior.展开更多
Statistical models using historical data on crop yields and weather to calibrate rela- tively simple regression equations have been widely and extensively applied in previous studies, and have provided a common altern...Statistical models using historical data on crop yields and weather to calibrate rela- tively simple regression equations have been widely and extensively applied in previous studies, and have provided a common alternative to process-based models, which require extensive input data on cultivar, management, and soil conditions. However, very few studies had been conducted to review systematically the previous statistical models for indentifying climate contributions to crop yields. This paper introduces three main statistical methods, i.e., time-series model, cross-section model and panel model, which have been used to identify such issues in the field of agrometeorology. Generally, research spatial scale could be categorized into two types using statistical models, including site scale and regional scale (e.g. global scale, national scale, provincial scale and county scale). Four issues exist in identifying response sensitivity of crop yields to climate change by statistical models. The issues include the extent of spatial and temporal scale, non-climatic trend removal, colinearity existing in climate variables and non-consideration of adaptations. Respective resolutions for the above four issues have been put forward in the section of perspective on the future of statistical models finally.展开更多
基金funded through India Meteorological Department,New Delhi,India under the Forecasting Agricultural output using Space,Agrometeorol ogy and Land based observations(FASAL)project and fund number:No.ASC/FASAL/KT-11/01/HQ-2010.
文摘Background Cotton is one of the most important commercial crops after food crops,especially in countries like India,where it’s grown extensively under rainfed conditions.Because of its usage in multiple industries,such as textile,medicine,and automobile industries,it has greater commercial importance.The crop’s performance is greatly influenced by prevailing weather dynamics.As climate changes,assessing how weather changes affect crop performance is essential.Among various techniques that are available,crop models are the most effective and widely used tools for predicting yields.Results This study compares statistical and machine learning models to assess their ability to predict cotton yield across major producing districts of Karnataka,India,utilizing a long-term dataset spanning from 1990 to 2023 that includes yield and weather factors.The artificial neural networks(ANNs)performed superiorly with acceptable yield deviations ranging within±10%during both vegetative stage(F1)and mid stage(F2)for cotton.The model evaluation metrics such as root mean square error(RMSE),normalized root mean square error(nRMSE),and modelling efficiency(EF)were also within the acceptance limits in most districts.Furthermore,the tested ANN model was used to assess the importance of the dominant weather factors influencing crop yield in each district.Specifically,the use of morning relative humidity as an individual parameter and its interaction with maximum and minimum tempera-ture had a major influence on cotton yield in most of the yield predicted districts.These differences highlighted the differential interactions of weather factors in each district for cotton yield formation,highlighting individual response of each weather factor under different soils and management conditions over the major cotton growing districts of Karnataka.Conclusions Compared with statistical models,machine learning models such as ANNs proved higher efficiency in forecasting the cotton yield due to their ability to consider the interactive effects of weather factors on yield forma-tion at different growth stages.This highlights the best suitability of ANNs for yield forecasting in rainfed conditions and for the study on relative impacts of weather factors on yield.Thus,the study aims to provide valuable insights to support stakeholders in planning effective crop management strategies and formulating relevant policies.
文摘Feature selection(FS)is a pivotal pre-processing step in developing data-driven models,influencing reliability,performance and optimization.Although existing FS techniques can yield high-performance metrics for certain models,they do not invariably guarantee the extraction of the most critical or impactful features.Prior literature underscores the significance of equitable FS practices and has proposed diverse methodologies for the identification of appropriate features.However,the challenge of discerning the most relevant and influential features persists,particularly in the context of the exponential growth and heterogeneity of big data—a challenge that is increasingly salient in modern artificial intelligence(AI)applications.In response,this study introduces an innovative,automated statistical method termed Farea Similarity for Feature Selection(FSFS).The FSFS approach computes a similarity metric for each feature by benchmarking it against the record-wise mean,thereby finding feature dependencies and mitigating the influence of outliers that could potentially distort evaluation outcomes.Features are subsequently ranked according to their similarity scores,with the threshold established at the average similarity score.Notably,lower FSFS values indicate higher similarity and stronger data correlations,whereas higher values suggest lower similarity.The FSFS method is designed not only to yield reliable evaluation metrics but also to reduce data complexity without compromising model performance.Comparative analyses were performed against several established techniques,including Chi-squared(CS),Correlation Coefficient(CC),Genetic Algorithm(GA),Exhaustive Approach,Greedy Stepwise Approach,Gain Ratio,and Filtered Subset Eval,using a variety of datasets such as the Experimental Dataset,Breast Cancer Wisconsin(Original),KDD CUP 1999,NSL-KDD,UNSW-NB15,and Edge-IIoT.In the absence of the FSFS method,the highest classifier accuracies observed were 60.00%,95.13%,97.02%,98.17%,95.86%,and 94.62%for the respective datasets.When the FSFS technique was integrated with data normalization,encoding,balancing,and feature importance selection processes,accuracies improved to 100.00%,97.81%,98.63%,98.94%,94.27%,and 98.46%,respectively.The FSFS method,with a computational complexity of O(fn log n),demonstrates robust scalability and is well-suited for datasets of large size,ensuring efficient processing even when the number of features is substantial.By automatically eliminating outliers and redundant data,FSFS reduces computational overhead,resulting in faster training and improved model performance.Overall,the FSFS framework not only optimizes performance but also enhances the interpretability and explainability of data-driven models,thereby facilitating more trustworthy decision-making in AI applications.
基金financially by the National Research Council of Thailand(NRCT)under Contract No.N42A670894.
文摘This study presents an innovative development of the exponentially weighted moving average(EWMA)control chart,explicitly adapted for the examination of time series data distinguished by seasonal autoregressive moving average behavior—SARMA(1,1)L under exponential white noise.Unlike previous works that rely on simplified models such as AR(1)or assume independence,this research derives for the first time an exact two-sided Average Run Length(ARL)formula for theModified EWMAchart under SARMA(1,1)L conditions,using a mathematically rigorous Fredholm integral approach.The derived formulas are validated against numerical integral equation(NIE)solutions,showing strong agreement and significantly reduced computational burden.Additionally,a performance comparison index(PCI)is introduced to assess the chart’s detection capability.Results demonstrate that the proposed method exhibits superior sensitivity to mean shifts in autocorrelated environments,outperforming existing approaches.The findings offer a new,efficient framework for real-time quality control in complex seasonal processes,with potential applications in environmental monitoring and intelligent manufacturing systems.
文摘The objective of this study is to analyze the sensitivity of the statistical models regarding the size of samples. The study carried out in Ivory Coast is based on annual maximum daily rainfall data collected from 26 stations. The methodological approach is based on the statistical modeling of maximum daily rainfall. Adjustments were made on several sample sizes and several return periods (2, 5, 10, 20, 50 and 100 years). The main results have shown that the 30 years series (1931-1960;1961-1990;1991-2020) are better adjusted by the Gumbel (26.92% - 53.85%) and Inverse Gamma (26.92% - 46.15%). Concerning the 60-years series (1931-1990;1961-2020), they are better adjusted by the Inverse Gamma (30.77%), Gamma (15.38% - 46.15%) and Gumbel (15.38% - 42.31%). The full chronicle 1931-2020 (90 years) presents a notable supremacy of 50% of Gumbel model over the Gamma (34.62%) and Gamma Inverse (15.38%) model. It is noted that the Gumbel is the most dominant model overall and more particularly in wet periods. The data for periods with normal and dry trends were better fitted by Gamma and Inverse Gamma.
基金National Natural Science Foundation of China ( No.60803078)National High Technology Research and Development Programs of China (No.2006AA010107, No.2006AA010108)
文摘This paper proposed a method to incorporate syntax-based language models in phrase-based statistical machine translation (SMT) systems. The syntax-based language model used in this paper is based on link grammar,which is a high lexical formalism. In order to apply language models based on link grammar in phrase-based models,the concept of linked phrases,an extension of the concept of traditional phrases in phrase-based models was brought out. Experiments were conducted and the results showed that the use of syntax-based language models could improve the performance of the phrase-based models greatly.
基金the bread wheat project of the Dryland Agricultural Research Institute (DARI)supported by the Agricultural Research and Education Organization (AREO) of Iran
文摘Several statistical methods have been developed for analyzing genotype×environment(GE)interactions in crop breeding programs to identify genotypes with high yield and stability performances.Four statistical methods,including joint regression analysis(JRA),additive mean effects and multiplicative interaction(AMMI)analysis,genotype plus GE interaction(GGE)biplot analysis,and yield–stability(YSi)statistic were used to evaluate GE interaction in20 winter wheat genotypes grown in 24 environments in Iran.The main objective was to evaluate the rank correlations among the four statistical methods in genotype rankings for yield,stability and yield–stability.Three kinds of genotypic ranks(yield ranks,stability ranks,and yield–stability ranks)were determined with each method.The results indicated the presence of GE interaction,suggesting the need for stability analysis.With respect to yield,the genotype rankings by the GGE biplot and AMMI analysis were significantly correlated(P<0.01).For stability ranking,the rank correlations ranged from 0.53(GGE–YSi;P<0.05)to0.97(JRA–YSi;P<0.01).AMMI distance(AMMID)was highly correlated(P<0.01)with variance of regression deviation(S2di)in JRA(r=0.83)and Shukla stability variance(σ2)in YSi(r=0.86),indicating that these stability indices can be used interchangeably.No correlation was found between yield ranks and stability ranks(AMMID,S2di,σ2,and GGE stability index),indicating that they measure static stability and accordingly could be used if selection is based primarily on stability.For yield–stability,rank correlation coefficients among the statistical methods varied from 0.64(JRA–YSi;P<0.01)to 0.89(AMMI–YSi;P<0.01),indicating that AMMI and YSi were closely associated in the genotype ranking for integrating yield with stability performance.Based on the results,it can be concluded that YSi was closely correlated with(i)JRA in ranking genotypes for stability and(ii)AMMI for integrating yield and stability.
基金Supported by the Guangdong Technological Program (2009B02001002)the Special Funds of National Agricultural Department for Commonweal Trade Research (nyhyzx07-019)the Earmarked Fund for Modern Agro-industry Technology Research System~~
文摘[Objective] The study aimed to compare several statistical analysis models for estimating the sugarcane (Saccharum spp.) genotypic stability. [Method] The data of sugarcane regional trials in Guangdong, in 2009 was analyzed by three models respectively: Finlay and Wilkinson model: the additive main effects and multiplicative interaction (AMMI) model and linear regression-principal components analysis (LR- PCA) model, so as to compare the models. [Result] The Finlay and Wilkinson model was easier, but the analysis of the other two models was more comprehensive, and there was a bit difference between the additive main effects and multiplicative inter- action (AMMI) model and linear regression-principal components analysis (LR-PCA) model. [Conclusion] In practice, while the proper statistical method was usually con- sidered according to the different data, it should be also considered that the same data should be analyzed with different statistical methods in order to get a more reasonable result by comparison.
基金supported by the National High Technology Research and Development Program of China (Grant No. 2010AA101301)the Program of Introducing International Advanced Agricultural Science and Technology in China (Grant No. 2006-G8[4]-31-1)the Program of Science-Technology Basis and Conditional Platform in China (Grant No. 505005)
文摘QTL mapping for seven quality traits was conducted by using 254 recombinant inbred lines (RIL) derived from a japonica-japonica rice cross of Xiushui 79/C Bao. The seven traits investigated were grain length (GL), grain length to width ratio (LWR), chalk grain rate (CGR), chalkiness degree (CD), gelatinization temperature (GT), amylose content (AC) and gel consistency (GC) of head rice. Three mapping methods employed were composite interval mapping in QTLMapper 2.0 software based on mixed linear model (MCIM), inclusive composite interval mapping in QTL IciMapping 3.0 software based on stepwise regression linear model (ICIM) and multiple interval mapping with regression forward selection in Windows QTL Cartographer 2.5 based on multiple regression analysis (MIMR). Results showed that five QTLs with additive effect (A-QTLs) were detected by all the three methods simultaneously, two by two methods simultaneously, and 23 by only one method. Five A-QTLs were detected by MCIM, nine by ICIM and 28 by MIMR. The contribution rates of single A-QTL ranged from 0.89% to 38.07%. All the QTLs with epistatic effect (E-QTLs) detected by MIMR were not detected by the other two methods. Fourteen pairs of E-QTLs were detected by both MCIM and ICIM, and 142 pairs of E-QTLs were detected by only one method. Twenty-five pairs of E-QTLs were detected by MCIM, 141 pairs by ICIM and four pairs by MIMR. The contribution rates of single pair of E-QTL were from 2.60% to 23.78%. In the Xiu-Bao RIL population, epistatic effect played a major role in the variation of GL and CD, and additive effect was the dominant in the variation of LWR, while both epistatic effect and additive effect had equal importance in the variation of CGR, AC, GT and GC. QTLs detected by two or more methods simultaneously were highly reliable, and could be applied to improve the quality traits in japonica hybrid rice.
基金funded by the"Genetic improvement of pig survival"project from Danish Pig Levy Foundation (Aarhus,Denmark)The China Scholarship Council (CSC)for providing scholarship to the first author。
文摘Background:Survival from birth to slaughter is an important economic trait in commercial pig productions.Increasing survival can improve both economic efficiency and animal welfare.The aim of this study is to explore the impact of genotyping strategies and statistical models on the accuracy of genomic prediction for survival in pigs during the total growing period from birth to slaughter.Results:We simulated pig populations with different direct and maternal heritabilities and used a linear mixed model,a logit model,and a probit model to predict genomic breeding values of pig survival based on data of individual survival records with binary outcomes(0,1).The results show that in the case of only alive animals having genotype data,unbiased genomic predictions can be achieved when using variances estimated from pedigreebased model.Models using genomic information achieved up to 59.2%higher accuracy of estimated breeding value compared to pedigree-based model,dependent on genotyping scenarios.The scenario of genotyping all individuals,both dead and alive individuals,obtained the highest accuracy.When an equal number of individuals(80%)were genotyped,random sample of individuals with genotypes achieved higher accuracy than only alive individuals with genotypes.The linear model,logit model and probit model achieved similar accuracy.Conclusions:Our conclusion is that genomic prediction of pig survival is feasible in the situation that only alive pigs have genotypes,but genomic information of dead individuals can increase accuracy of genomic prediction by 2.06%to 6.04%.
文摘The water resources of the Nadhour-Sisseb-El Alem Basin in Tunisia exhibit semi-arid and arid climatic conditions.This induces an excessive pumping of groundwater,which creates drops in water level ranging about 1-2 m/a.Indeed,these unfavorable conditions require interventions to rationalize integrated management in decision making.The aim of this study is to determine a water recharge index(WRI),delineate the potential groundwater recharge area and estimate the potential groundwater recharge rate based on the integration of statistical models resulted from remote sensing imagery,GIS digital data(e.g.,lithology,soil,runoff),measured artificial recharge data,fuzzy set theory and multi-criteria decision making(MCDM)using the analytical hierarchy process(AHP).Eight factors affecting potential groundwater recharge were determined,namely lithology,soil,slope,topography,land cover/use,runoff,drainage and lineaments.The WRI is between 1.2 and 3.1,which is classified into five classes as poor,weak,moderate,good and very good sites of potential groundwater recharge area.The very good and good classes occupied respectively 27%and 44%of the study area.The potential groundwater recharge rate was 43%of total precipitation.According to the results of the study,river beds are favorable sites for groundwater recharge.
文摘Forecasting the movement of stock market is a long-time attractive topic. This paper implements different statistical learning models to predict the movement of S&P 500 index. The S&P 500 index is influenced by other important financial indexes across the world such as commodity price and financial technical indicators. This paper systematically investigated four supervised learning models, including Logistic Regression, Gaussian Discriminant Analysis (GDA), Naive Bayes and Support Vector Machine (SVM) in the forecast of S&P 500 index. After several experiments of optimization in features and models, especially the SVM kernel selection and feature selection for different models, this paper concludes that a SVM model with a Radial Basis Function (RBF) kernel can achieve an accuracy rate of 62.51% for the future market trend of the S&P 500 index.
文摘Road crash prediction models are very useful tools in highway safety, given their potential for determining both the crash frequency occurrence and the degree severity of crashes. Crash frequency refers to the prediction of the number of crashes that would occur on a specific road segment or intersection in a time period, while crash severity models generally explore the relationship between crash severity injury and the contributing factors such as driver behavior, vehicle characteristics, roadway geometry, and road-environment conditions. Effective interventions to reduce crash toll include design of safer infrastructure and incorporation of road safety features into land-use and transportation planning;improvement of vehicle safety features;improvement of post-crash care for victims of road crashes;and improvement of driver behavior, such as setting and enforcing laws relating to key risk factors, and raising public awareness. Despite the great efforts that transportation agencies put into preventive measures, the annual number of traffic crashes has not yet significantly decreased. For in-stance, 35,092 traffic fatalities were recorded in the US in 2015, an increase of 7.2% as compared to the previous year. With such a trend, this paper presents an overview of road crash prediction models used by transportation agencies and researchers to gain a better understanding of the techniques used in predicting road accidents and the risk factors that contribute to crash occurrence.
文摘Recently, some results have been acquired with the Monte- Carlo statistical experiments in the design of ocean en gineering. The results show that Monte-Carlo statistical experiments can be widely used in estimating the parameters of wave statistical distributions, checking the probability model of the long- term wave extreme value distribution under a typhoon condition and calculating the failure probability of the ocean platforms.
基金Project supported by the National Natural Science Foundation of China(Grant Nos.61773091 and 61603073)the LiaoNing Revitalization Talents Program(Grant No.XLYC1807106)the Natural Science Foundation of Liaoning Province,China(Grant No.2020-MZLH-22).
文摘The establishment of effective null models can provide reference networks to accurately describe statistical properties of real-life signed networks.At present,two classical null models of signed networks(i.e.,sign and full-edge randomized models)shuffle both positive and negative topologies at the same time,so it is difficult to distinguish the effect on network topology of positive edges,negative edges,and the correlation between them.In this study,we construct three re-fined edge-randomized null models by only randomizing link relationships without changing positive and negative degree distributions.The results of nontrivial statistical indicators of signed networks,such as average degree connectivity and clustering coefficient,show that the position of positive edges has a stronger effect on positive-edge topology,while the signs of negative edges have a greater influence on negative-edge topology.For some specific statistics(e.g.,embeddedness),the results indicate that the proposed null models can more accurately describe real-life networks compared with the two existing ones,which can be selected to facilitate a better understanding of complex structures,functions,and dynamical behaviors on signed networks.
文摘The cause-effect relationship is not always possible to trace in GCMs because of the simultaneous inclusion of several highly complex physical processes. Furthermore, the inter-GCM differences are large and there is no simple way to reconcile them. So, simple climate models, like statistical-dynamical models (SDMs), appear to be useful in this context. This kind of models is essentially mechanistic, being directed towards understanding the dependence of a particular mechanism on the other parameters of the problem. In this paper, the utility of SDMs for studies of climate change is discussed in some detail. We show that these models are an indispensable part of hierarchy of climate models.
文摘The paper deals with the performing of a critical analysis of the problems arising in matching the classical models of the statistical and phenomenological thermodynamics. The performed analysis shows that some concepts of the statistical and phenomenological methods of describing the classical systems do not quite correlate with each other. Particularly, in these methods various caloric ideal gas equations of state are employed, while the possibility existing in the thermodynamic cyclic processes to obtain the same distributions both due to a change of the particle concentration and owing to a change of temperature is not allowed for in the statistical methods. The above-mentioned difference of the equations of state is cleared away when using in the statistical functions corresponding to the canonical Gibbs equations instead of the Planck’s constant a new scale factor that depends on the parameters of a system and coincides with the Planck’s constant in going of the system to the degenerate state. Under such an approach, the statistical entropy is transformed into one of the forms of heat capacity. In its turn, the agreement of the methods under consideration in the question as to the dependence of the molecular distributions on the concentration of particles, apparently, will call for further refinement of the physical model of ideal gas and the techniques for its statistical description.
基金supported by the National Natural Science Foundation of China(No.61303082) the Research Fund for the Doctoral Program of Higher Education of China(No.20120121120046)
文摘Lexicalized reordering models are very important components of phrasebased translation systems.By examining the reordering relationships between adjacent phrases,conventional methods learn these models from the word aligned bilingual corpus,while ignoring the effect of the number of adjacent bilingual phrases.In this paper,we propose a method to take the number of adjacent phrases into account for better estimation of reordering models.Instead of just checking whether there is one phrase adjacent to a given phrase,our method firstly uses a compact structure named reordering graph to represent all phrase segmentations of a parallel sentence,then the effect of the adjacent phrase number can be quantified in a forward-backward fashion,and finally incorporated into the estimation of reordering models.Experimental results on the NIST Chinese-English and WMT French-Spanish data sets show that our approach significantly outperforms the baseline method.
文摘This contribution deals with a generative approach for the analysis of textual data. Instead of creating heuristic rules forthe representation of documents and word counts, we employ a distribution able to model words along texts considering different topics. In this regard, following Minka proposal (2003), we implement a Dirichlet Compound Multinomial (DCM) distribution, then we propose an extension called sbDCM that takes explicitly into account the different latent topics that compound the document. We follow two alternative approaches: on one hand the topics can be unknown, thus to be estimated on the basis of the data, on the other hand topics are determined in advance on the basis of a predefined ontological schema. The two possible approaches are assessed on the basis of real data.
基金the projects ‘‘The risk assessment of geological hazards induced by reservoir water level fluctuation in Chongqing, Three-Gorges Reservoir, China.’’ (No. 2016065135)‘‘The study of mechanism and forecast criterion of the gentle-dip landslides in The Three Gorges Reservoir Region, China’’ (No. 41572292) funded by the National Natural Science Foundation of China
文摘Landslide susceptibility mapping is vital for landslide risk management and urban planning.In this study,we used three statistical models[frequency ratio,certainty factor and index of entropy(IOE)]and a machine learning model[random forest(RF)]for landslide susceptibility mapping in Wanzhou County,China.First,a landslide inventory map was prepared using earlier geotechnical investigation reports,aerial images,and field surveys.Then,the redundant factors were excluded from the initial fourteen landslide causal factors via factor correlation analysis.To determine the most effective causal factors,landslide susceptibility evaluations were performed based on four cases with different combinations of factors("cases").In the analysis,465(70%)landslide locations were randomly selected for model training,and 200(30%)landslide locations were selected for verification.The results showed that case 3 produced the best performance for the statistical models and that case 2 produced the best performance for the RF model.Finally,the receiver operating characteristic(ROC)curve was used to verify the accuracy of each model's results for its respective optimal case.The ROC curve analysis showed that the machine learning model performed better than the other three models,and among the three statistical models,the IOE model with weight coefficients was superior.
基金National Natural Science Foundation of China, No.41001057 The Science and Technology Strategic Pilot of the Chinese Academy of Sciences, No.XDA05090308+1 种基金 No.XDA05090310 Project Supported by State Key Laboratory of Earth Surface Processes and Resource Ecology, No.2011-KF-06
文摘Statistical models using historical data on crop yields and weather to calibrate rela- tively simple regression equations have been widely and extensively applied in previous studies, and have provided a common alternative to process-based models, which require extensive input data on cultivar, management, and soil conditions. However, very few studies had been conducted to review systematically the previous statistical models for indentifying climate contributions to crop yields. This paper introduces three main statistical methods, i.e., time-series model, cross-section model and panel model, which have been used to identify such issues in the field of agrometeorology. Generally, research spatial scale could be categorized into two types using statistical models, including site scale and regional scale (e.g. global scale, national scale, provincial scale and county scale). Four issues exist in identifying response sensitivity of crop yields to climate change by statistical models. The issues include the extent of spatial and temporal scale, non-climatic trend removal, colinearity existing in climate variables and non-consideration of adaptations. Respective resolutions for the above four issues have been put forward in the section of perspective on the future of statistical models finally.