The concept of missing data is important to apply statistical methods on the dataset. Statisticians and researchers may end up to an inaccurate illation about the data if the missing data are not handled properly. Of ...The concept of missing data is important to apply statistical methods on the dataset. Statisticians and researchers may end up to an inaccurate illation about the data if the missing data are not handled properly. Of late, Python and R provide diverse packages for handling missing data. In this study, an imputation algorithm, cumulative linear regression, is proposed. The proposed algorithm depends on the linear regression technique. It differs from the existing methods, in that it cumulates the imputed variables;those variables will be incorporated in the linear regression equation to filling in the missing values in the next incomplete variable. The author performed a comparative study of the proposed method and those packages. The performance was measured in terms of imputation time, root-mean-square error, mean absolute error, and coefficient of determination (R^2). On analysing on five datasets with different missing values generated from different mechanisms, it was observed that the performances vary depending on the size, missing percentage, and the missingness mechanism. The results showed that the performance of the proposed method is slightly better.展开更多
Accurately filling in missing heating data is essential for ensuring data quality in applications such as energy management optimization and building efficiency analysis.Traditional machine learning methods use histor...Accurately filling in missing heating data is essential for ensuring data quality in applications such as energy management optimization and building efficiency analysis.Traditional machine learning methods use historical heating data as an input feature to predict the following missing data.However,when the duration of missing data is long,previous estimated values are inevitably used for further imputation,leading to error accumulation and a growing deviation from true values.To overcome this problem,this paper proposes a generative network that can fill missing data solely based on weather and temporal data,without using previous imputed values for further imputation.Our method outperformed the state of the art such as Seq2seq and Transformer,achieving relative normalized root mean square error(NRMSE)reductions of 1.65%to 41.38%,0.30%to 66.43%,and 14.84%to 50.22%across three different data sources.In addition,with our proposed method,the effect of selecting different weather variables on model performance,and the benefits of transfer learning under limited data were also demonstrated.The relative NRMSE reduction is between 3.88%to 15.85%in cold months and from 7.49%to 12.29%in warm months when applying transfer learning.展开更多
Recent advances in single-cell DNA methylation have provided unprecedented opportunities to explore cellular epigenetic differences with maximal resolution.A common workflow for single-cell DNA methylation analysis is...Recent advances in single-cell DNA methylation have provided unprecedented opportunities to explore cellular epigenetic differences with maximal resolution.A common workflow for single-cell DNA methylation analysis is binning the genome into multiple regions and computing the average methylation level within each region.In this process,imputing not available(NA)values which are caused by the limited number of captured methylation sites is a necessary preprocessing step for downstream analyses.Existing studies have employed several simple imputation methods(such as zeros imputation or means imputation),however,there is a lack of theoretical studies or benchmark tests of these approaches.Through both experiments and theoretical analysis,we found that using the medians to impute NA values can effectively and simply reflect the methylation state of the NA values,providing an accurate foundation for downstream analyses.展开更多
Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel a...Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.展开更多
Single-cell RNA sequencing(scRNA-seq)provides a powerful tool to determine expression patterns of thousands of individual cells.However,the analysis of scRNA-seq data remains a computational challenge due to the high ...Single-cell RNA sequencing(scRNA-seq)provides a powerful tool to determine expression patterns of thousands of individual cells.However,the analysis of scRNA-seq data remains a computational challenge due to the high technical noise such as the presence of dropout events that lead to a large proportion of zeros for expressed genes.Taking into account the cell heterogeneity and the relationship between dropout rate and expected expression level,we present a cell sub-population based bounded low-rank(PBLR)method to impute the dropouts of scRNA-seq data.Through application to both simulated and real scRNA-seq datasets,PBLR is shown to be effective in recovering dropout events,and it can dramaimprove the low・dimensional representation and the recovery of gene-gene relationships masked by dropout events compared to several state-of-the-art methods・Moreover,PBLR also detects accurate and robust cell sub-populations automatically,shedding light on its flexibility and generality for scRNA-seq data analysis.展开更多
DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers.With the development of the high-throughput sequencing technology,there is much progress to disclose the relatio...DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers.With the development of the high-throughput sequencing technology,there is much progress to disclose the relations of DNA methylation with diseases.However,the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques.While many methods have been developed to impute the missing values,these methods are mostly based on the correlations between individual samples,and thus are limited for the abnormal samples in cancers.In this study,we present a novel transfer learning based neural network to impute missing DNA methylation data,namely the TDimpute-DNAmeth method.The method learns common relations between DNA methylation from pan-cancer samples,and then fine-tunes the learned relations over each specific cancer type for imputing the missing data.Tested on 16 cancer datasets,our method was shown to outperform other commonly-used methods.Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis.展开更多
Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light grad...Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light gradient boosting machine(LGBM)algorithm was employed to impute more than 60%of the missing data,establishing a radionuclide diffusion dataset containing 16 input features and 813 instances.The effective diffusion coefficient(D_(e))was predicted using ten ML models.The predictive accuracy of the ensemble meta-models,namely LGBM-extreme gradient boosting(XGB)and LGBM-categorical boosting(CatB),surpassed that of the other ML models,with R^(2)values of 0.94.The models were applied to predict the D_(e)values of EuEDTA^(−)and HCrO_(4)^(−)in saturated compacted bentonites at compactions ranging from 1200 to 1800 kg/m^(3),which were measured using a through-diffusion method.The generalization ability of the LGBM-XGB model surpassed that of LGB-CatB in predicting the D_(e)of HCrO_(4)^(−).Shapley additive explanations identified total porosity as the most significant influencing factor.Additionally,the partial dependence plot analysis technique yielded clearer results in the univariate correlation analysis.This study provides a regression imputation technique to refine radionuclide diffusion datasets,offering deeper insights into analyzing the diffusion mechanism of radionuclides and supporting the safety assessment of the geological disposal of high-level radioactive waste.展开更多
Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has prov...Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has proven highly successful in image generation,speech generation,time series modelling etc.and now opens a new avenue for traffic data imputation.In this paper,we propose a conditional diffusion model,called the implicit-explicit diffusion model,for traffic data imputation.This model exploits both the implicit and explicit feature of the data simultaneously.More specifically,we design two types of feature extraction modules,one to capture the implicit dependencies hidden in the raw data at multiple time scales and the other to obtain the long-term temporal dependencies of the time series.This approach not only inherits the advantages of the diffusion model for estimating missing data,but also takes into account the multiscale correlation inherent in traffic data.To illustrate the performance of the model,extensive experiments are conducted on three real-world time series datasets using different missing rates.The experimental results demonstrate that the model improves imputation accuracy and generalization capability.展开更多
Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This s...Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.展开更多
The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses,whereas a complete case analysis is deemed appro...The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses,whereas a complete case analysis is deemed appropriate only when the rate of missingness is negligible or when the missingness mechanism is missing completely at random(MCAR).This study investigates the applicability of this consensus within the context of supervised machine learning,with particular emphasis on the interactions between the imputation method,missingness mechanism,and missingness rate.Furthermore,we examine the time efficiency of these“state-of-the-art”imputation methods considering the time-sensitive nature of certain machine learning applications.Utilizing ten real-world datasets,we introduced missingness at rates ranging from approximately 5%–75%under the MCAR,missing at random(MAR),and missing not at random(MNAR)mechanisms.We subsequently address missing data using five methods:complete case analysis(CCA),mean imputation,hot deck imputation,regression imputation,and multiple imputation(MI).Statistical tests are conducted on the machine learning outcomes,and the findings are presented and analyzed.Our investigation reveals that in nearly all scenarios,CCA performs comparably to MI,even with substantial levels of missingness under the MAR and MNAR conditions and with missingness in the output variable for regression problems.Under some conditions,CCA surpasses MI in terms of its performance.Thus,given the considerable computational demands associated with MI,the application of CCA is recommended within the broader context of supervised machine learning,particularly in big-data environments.展开更多
Landslide dam failures can cause significant damage to both society and ecosystems.Predicting the failure of these dams in advance enables early preventive measures,thereby minimizing potential harm.This paper aims to...Landslide dam failures can cause significant damage to both society and ecosystems.Predicting the failure of these dams in advance enables early preventive measures,thereby minimizing potential harm.This paper aims to propose a fast and accurate model for predicting the longevity of landslide dams while also addressing the issue of missing data.Given the wide variation in the survival times of landslide dams—from mere minutes to several thousand years—predicting their longevity presents a considerable challenge.The study develops predictive models by considering key factors such as dam geometry,hydrodynamic conditions,materials,and triggering parameters.A dataset of 1045 landslide dam cases is analyzed,categorizing their longevity into three distinct groups:C1(<1 month),C2(1 month to 1 year),and C3(>1 year).Multiple imputation and knearest neighbor algorithms are used to handle missing data on geometric size,hydrodynamic conditions,materials,and triggers.Based on the imputed data,two predictive models are developed:a classification model for dam longevity categories and a regression model for precise longevity predictions.The classification model achieves an accuracy of 88.38%while the regression model outperforms existing models with an R^(2) value of 0.966.Two real-life landslide dam cases are used to validate the models,which show correct classification and small prediction errors.The longevity of landslide dams is jointly influenced by factors such as geometric size,hydrodynamic conditions,materials,and triggering events.Among these,geometric size has the greatest impact,followed by hydrodynamic conditions,materials,and triggers,as confirmed by variable importance in the model development.展开更多
Accurate lithofacies classification in low-permeability sandstone reservoirs remains challenging due to class imbalance in well-log data and the difficulty of the modeling vertical lithological dependencies.Traditiona...Accurate lithofacies classification in low-permeability sandstone reservoirs remains challenging due to class imbalance in well-log data and the difficulty of the modeling vertical lithological dependencies.Traditional core-based interpretation introduces subjectivity,while conventional deep learning models often fail to capture stratigraphic sequences effectively.To address these limitations,we propose a hybrid CNN–GRU framework that integrates spatial feature extraction and sequential modeling.Heat Kernel Imputation is applied to reconstruct missing log data,and Borderline SMOTE(BSMOTE)improves class balance by augmenting boundary-case minority samples.The CNN component extracts localized petrophysical features,and the GRU component captures depth-wise lithological transitions,to enable spatial-sequential feature fusion.Experiments on real-well datasets from tight sandstone reservoirs show that the proposed model achieves an average accuracy of 93.3%and a Macro F1-score of 0.934.It outperforms baseline models,including RF(87.8%),GBDT(81.8%),CNN-only(87.5%),and GRU-only(86.1%).Leave-one-well-out validation further confirms strong generalization ability.These results demonstrate that the proposed approach effectively addresses data imbalance and enhances classification robustness,offering a scalable and automated solution for lithofacies interpretation under complex geological conditions.展开更多
The accurate prediction and analysis of emergencies in Urban Rail Transit Systems(URTS)are essential for the development of effective early warning and prevention mechanisms.This study presents an integrated perceptio...The accurate prediction and analysis of emergencies in Urban Rail Transit Systems(URTS)are essential for the development of effective early warning and prevention mechanisms.This study presents an integrated perception model designed to predict emergencies and analyze their causes based on historical unstructured emergency data.To address issues related to data structuredness and missing values,we employed label encoding and an Elastic Net Regularization-based Generative Adversarial Interpolation Network(ER-GAIN)for data structuring and imputation.Additionally,to mitigate the impact of imbalanced data on the predictive performance of emergencies,we introduced an Adaptive Boosting Ensemble Model(AdaBoost)to forecast the key features of emergencies,including event types and levels.We also utilized Information Gain(IG)to analyze and rank the causes of various significant emergencies.Experimental results indicate that,compared to baseline data imputation models,ER-GAIN improved the prediction accuracy of key emergency features by 3.67%and 3.78%,respectively.Furthermore,AdaBoost enhanced the accuracy by over 4.34%and 3.25%compared to baseline predictivemodels.Through causation analysis,we identified the critical causes of train operation and fire incidents.The findings of this research will contribute to the establishment of early warning and prevention mechanisms for emergencies in URTS,potentially leading to safer and more reliable URTS operations.展开更多
Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attentio...Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attention, challenges remain, especially when dealing with diverse data types. In this study, we introduce a novel data imputation method based on a modified convolutional neural network, specifically, a Deep Residual-Convolutional Neural Network (DRes-CNN) architecture designed to handle missing values across various datasets. Our approach demonstrates substantial improvements over existing imputation techniques by leveraging residual connections and optimized convolutional layers to capture complex data patterns. We evaluated the model on publicly available datasets, including Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV), which contain critical care patient data, and the Beijing Multi-Site Air Quality dataset, which measures environmental air quality. The proposed DRes-CNN method achieved a root mean square error (RMSE) of 0.00006, highlighting its high accuracy and robustness. We also compared with Low Light-Convolutional Neural Network (LL-CNN) and U-Net methods, which had RMSE values of 0.00075 and 0.00073, respectively. This represented an improvement of approximately 92% over LL-CNN and 91% over U-Net. The results showed that this DRes-CNN-based imputation method outperforms current state-of-the-art models. These results established DRes-CNN as a reliable solution for addressing missing data.展开更多
Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a sign...Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a significant challenge to maintaining prediction precision.This study introduces REPTF-TMDI,a novel method that combines a Reduced Error Pruning Tree Forest(REPTree Forest)with a newly proposed Time-based Missing Data Imputation(TMDI)approach.The REP Tree Forest,an ensemble learning approach,is tailored for time-related traffic data to enhance predictive accuracy and support the evolution of sustainable urbanmobility solutions.Meanwhile,the TMDI approach exploits temporal patterns to estimate missing values reliably whenever empty fields are encountered.The proposed method was evaluated using hourly traffic flow data from a major U.S.roadway spanning 2012-2018,incorporating temporal features(e.g.,hour,day,month,year,weekday),holiday indicator,and weather conditions(temperature,rain,snow,and cloud coverage).Experimental results demonstrated that the REPTF-TMDI method outperformed conventional imputation techniques across various missing data ratios by achieving an average 11.76%improvement in terms of correlation coefficient(R).Furthermore,REPTree Forest achieved improvements of 68.62%in RMSE and 70.52%in MAE compared to existing state-of-the-art models.These findings highlight the method’s ability to significantly boost traffic flow prediction accuracy,even in the presence of missing data,thereby contributing to the broader objectives of sustainable urban transportation systems.展开更多
With the increasing complexity of production processes,there has been a growing focus on online algorithms within the domain of multivariate statistical process control(SPC).Nonetheless,conventional methods,based on t...With the increasing complexity of production processes,there has been a growing focus on online algorithms within the domain of multivariate statistical process control(SPC).Nonetheless,conventional methods,based on the assumption of complete data obtained at uniform time intervals,exhibit suboptimal performance in the presence of missing data.In our pursuit of maximizing available information,we propose an adaptive exponentially weighted moving average(EWMA)control chart employing a weighted imputation approach that leverages the relationships between complete and incomplete data.Specifically,we introduce two recovery methods:an improved K-Nearest Neighbors imputing value and the conventional univariate EWMA statistic.We then formulate an adaptive weighting function to amalgamate these methods,assigning a diminished weight to the EWMA statistic when the sample information suggests an increased likelihood of the process being out of control,and vice versa.The robustness and sensitivity of the proposed scheme are shown through simulation results and an illustrative example.展开更多
Background: Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence(WGS) data. However, sequencing thousands of individuals of interest is expensive.Imputatio...Background: Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence(WGS) data. However, sequencing thousands of individuals of interest is expensive.Imputation from SNP panels to WGS data is an attractive and less expensive approach to obtain WGS data. The aims of this study were to investigate the accuracy of imputation and to provide insight into the design and execution of genotype imputation.Results: We genotyped 450 chickens with a 600 K SNP array, and sequenced 24 key individuals by whole genome re-sequencing. Accuracy of imputation from putative 60 K and 600 K array data to WGS data was 0.620 and 0.812 for Beagle, and 0.810 and 0.914 for FImpute, respectively. By increasing the sequencing cost from 24 X to 144 X, the imputation accuracy increased from 0.525 to 0.698 for Beagle and from 0.654 to 0.823 for FImpute. With fixed sequence depth(12 X), increasing the number of sequenced animals from 1 to 24, improved accuracy from 0.421 to0.897 for FImpute and from 0.396 to 0.777 for Beagle. Using optimally selected key individuals resulted in a higher imputation accuracy compared with using randomly selected individuals as a reference population for resequencing. With fixed reference population size(24), imputation accuracy increased from 0.654 to 0.875 for FImpute and from 0.512 to 0.762 for Beagle as the sequencing depth increased from 1 X to 12 X. With a given total cost of genotyping, accuracy increased with the size of the reference population for FImpute, but the pattern was not valid for Beagle, which showed the highest accuracy at six fold coverage for the scenarios used in this study.Conclusions: In conclusion, we comprehensively investigated the impacts of several key factors on genotype imputation. Generally, increasing sequencing cost gave a higher imputation accuracy. But with a fixed sequencing cost, the optimal imputation enhance the performance of WGP and GWAS. An optimal imputation strategy should take size of reference population, imputation algorithms, marker density, and population structure of the target population and methods to select key individuals into consideration comprehensively. This work sheds additional light on how to design and execute genotype imputation for livestock populations.展开更多
Background: Improving the feed efficiency would increase profitability for producers while also reducing the environmental footprint of livestock production. This study was conducted to investigate the relationships a...Background: Improving the feed efficiency would increase profitability for producers while also reducing the environmental footprint of livestock production. This study was conducted to investigate the relationships among feed efficiency traits and metabolizable efficiency traits in 180 male broilers. Significant loci and genes affecting the metabolizable efficiency traits were explored with an imputation-based genome-wide association study. The traits measured or calculated comprised three growth traits, five feed efficiency related traits, and nine metabolizable efficiency traits.Results: The residual feed intake(RFI) showed moderate to high and positive phenotypic correlations with eight other traits measured, including average daily feed intake(ADFI), dry excreta weight(DEW), gross energy excretion(GEE), crude protein excretion(CPE), metabolizable dry matter(MDM), nitrogen corrected apparent metabolizable energy(AMEn), abdominal fat weight(Ab F), and percentage of abdominal fat(Ab P). Greater correlations were observed between growth traits and the feed conversion ratio(FCR) than RFI. In addition, the RFI, FCR, ADFI, DEW,GEE, CPE, MDM, AMEn, Ab F, and Ab P were lower in low-RFI birds than high-RFI birds(P < 0.01 or P < 0.05), whereas the coefficients of MDM and MCP of low-RFI birds were greater than those of high-RFI birds(P < 0.01). Five narrow QTLs for metabolizable efficiency traits were detected, including one 82.46-kb region for DEW and GEE on Gallus gallus chromosome(GGA) 26, one 120.13-kb region for MDM and AMEn on GGA1, one 691.25-kb region for the coefficients of MDM and AMEn on GGA5, one region for the coefficients of MDM and MCP on GGA2(103.45–103.53 Mb), and one 690.50-kb region for the coefficient of MCP on GGA14. Linkage disequilibrium(LD) analysis indicated that the five regions contained high LD blocks, as well as the genes chromosome 26 C6 orf106 homolog(C26 H6 orf106), LOC396098, SH3 and multiple ankyrin repeat domains 2(SHANK2), ETS homologous factor(EHF), and histamine receptor H3-like(HRH3 L), which are known to be involved in the regulation of neurodevelopment, cell proliferation and differentiation, and food intake.Conclusions: Selection for low RFI significantly decreased chicken feed intake, excreta output, and abdominal fat deposition, and increased nutrient digestibility without changing the weight gain. Five novel QTL regions involved in the control of metabolizable efficiency in chickens were identified. These results, combined through nutritional and genetic approaches, should facilitate novel insights into improving feed efficiency in poultry and other species.展开更多
文摘The concept of missing data is important to apply statistical methods on the dataset. Statisticians and researchers may end up to an inaccurate illation about the data if the missing data are not handled properly. Of late, Python and R provide diverse packages for handling missing data. In this study, an imputation algorithm, cumulative linear regression, is proposed. The proposed algorithm depends on the linear regression technique. It differs from the existing methods, in that it cumulates the imputed variables;those variables will be incorporated in the linear regression equation to filling in the missing values in the next incomplete variable. The author performed a comparative study of the proposed method and those packages. The performance was measured in terms of imputation time, root-mean-square error, mean absolute error, and coefficient of determination (R^2). On analysing on five datasets with different missing values generated from different mechanisms, it was observed that the performances vary depending on the size, missing percentage, and the missingness mechanism. The results showed that the performance of the proposed method is slightly better.
文摘Accurately filling in missing heating data is essential for ensuring data quality in applications such as energy management optimization and building efficiency analysis.Traditional machine learning methods use historical heating data as an input feature to predict the following missing data.However,when the duration of missing data is long,previous estimated values are inevitably used for further imputation,leading to error accumulation and a growing deviation from true values.To overcome this problem,this paper proposes a generative network that can fill missing data solely based on weather and temporal data,without using previous imputed values for further imputation.Our method outperformed the state of the art such as Seq2seq and Transformer,achieving relative normalized root mean square error(NRMSE)reductions of 1.65%to 41.38%,0.30%to 66.43%,and 14.84%to 50.22%across three different data sources.In addition,with our proposed method,the effect of selecting different weather variables on model performance,and the benefits of transfer learning under limited data were also demonstrated.The relative NRMSE reduction is between 3.88%to 15.85%in cold months and from 7.49%to 12.29%in warm months when applying transfer learning.
基金National Natural Science Foundation of China,Grant/Award Numbers:62203236,62473212Young Elite Scientists Sponsorship Program by CAST,Grant/Award Number:2023QNRC001。
文摘Recent advances in single-cell DNA methylation have provided unprecedented opportunities to explore cellular epigenetic differences with maximal resolution.A common workflow for single-cell DNA methylation analysis is binning the genome into multiple regions and computing the average methylation level within each region.In this process,imputing not available(NA)values which are caused by the limited number of captured methylation sites is a necessary preprocessing step for downstream analyses.Existing studies have employed several simple imputation methods(such as zeros imputation or means imputation),however,there is a lack of theoretical studies or benchmark tests of these approaches.Through both experiments and theoretical analysis,we found that using the medians to impute NA values can effectively and simply reflect the methylation state of the NA values,providing an accurate foundation for downstream analyses.
文摘Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.
基金This work was supported by the National Key R&D Program of China(2019YFA0709501)the National Natural Science Foundation of China(11661141019 and 61621003)+1 种基金the Nati onal Ten Thousa nd Tale nt Program for Young Top-notch Talents,the CAS Frontier Science Research Key Project for Top Young Scientist(QYZDB-SSW-SYS008)and Shanghai Municipal Science and Technology Major Project(2017SHZDZX01).
文摘Single-cell RNA sequencing(scRNA-seq)provides a powerful tool to determine expression patterns of thousands of individual cells.However,the analysis of scRNA-seq data remains a computational challenge due to the high technical noise such as the presence of dropout events that lead to a large proportion of zeros for expressed genes.Taking into account the cell heterogeneity and the relationship between dropout rate and expected expression level,we present a cell sub-population based bounded low-rank(PBLR)method to impute the dropouts of scRNA-seq data.Through application to both simulated and real scRNA-seq datasets,PBLR is shown to be effective in recovering dropout events,and it can dramaimprove the low・dimensional representation and the recovery of gene-gene relationships masked by dropout events compared to several state-of-the-art methods・Moreover,PBLR also detects accurate and robust cell sub-populations automatically,shedding light on its flexibility and generality for scRNA-seq data analysis.
基金supported by the National Key Research and Development Program of China under Grant No.2020YFB0204803the National Natural Science Foundation of China under Grant No.61772566+2 种基金the Guangdong Key Field Research and Development Plan under Grant Nos.2019B020228001 and 2018B010109006the Introducing Innovative and Entrepreneurial Teams of Guangdong under Grant No.2016ZT06D211the Guangzhou Science and Technology Research Plan under Grant No.202007030010.
文摘DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers.With the development of the high-throughput sequencing technology,there is much progress to disclose the relations of DNA methylation with diseases.However,the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques.While many methods have been developed to impute the missing values,these methods are mostly based on the correlations between individual samples,and thus are limited for the abnormal samples in cancers.In this study,we present a novel transfer learning based neural network to impute missing DNA methylation data,namely the TDimpute-DNAmeth method.The method learns common relations between DNA methylation from pan-cancer samples,and then fine-tunes the learned relations over each specific cancer type for imputing the missing data.Tested on 16 cancer datasets,our method was shown to outperform other commonly-used methods.Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis.
基金supported by the National Natural Science Foundation of China(No.12475340 and 12375350)Special Branch project of South Taihu Lakethe Scientific Research Fund of Zhejiang Provincial Education Department(No.Y202456326).
文摘Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light gradient boosting machine(LGBM)algorithm was employed to impute more than 60%of the missing data,establishing a radionuclide diffusion dataset containing 16 input features and 813 instances.The effective diffusion coefficient(D_(e))was predicted using ten ML models.The predictive accuracy of the ensemble meta-models,namely LGBM-extreme gradient boosting(XGB)and LGBM-categorical boosting(CatB),surpassed that of the other ML models,with R^(2)values of 0.94.The models were applied to predict the D_(e)values of EuEDTA^(−)and HCrO_(4)^(−)in saturated compacted bentonites at compactions ranging from 1200 to 1800 kg/m^(3),which were measured using a through-diffusion method.The generalization ability of the LGBM-XGB model surpassed that of LGB-CatB in predicting the D_(e)of HCrO_(4)^(−).Shapley additive explanations identified total porosity as the most significant influencing factor.Additionally,the partial dependence plot analysis technique yielded clearer results in the univariate correlation analysis.This study provides a regression imputation technique to refine radionuclide diffusion datasets,offering deeper insights into analyzing the diffusion mechanism of radionuclides and supporting the safety assessment of the geological disposal of high-level radioactive waste.
基金partially supported by the National Natural Science Foundation of China(62271485)the SDHS Science and Technology Project(HS2023B044)
文摘Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has proven highly successful in image generation,speech generation,time series modelling etc.and now opens a new avenue for traffic data imputation.In this paper,we propose a conditional diffusion model,called the implicit-explicit diffusion model,for traffic data imputation.This model exploits both the implicit and explicit feature of the data simultaneously.More specifically,we design two types of feature extraction modules,one to capture the implicit dependencies hidden in the raw data at multiple time scales and the other to obtain the long-term temporal dependencies of the time series.This approach not only inherits the advantages of the diffusion model for estimating missing data,but also takes into account the multiscale correlation inherent in traffic data.To illustrate the performance of the model,extensive experiments are conducted on three real-world time series datasets using different missing rates.The experimental results demonstrate that the model improves imputation accuracy and generalization capability.
基金supported by the National Natural Science Foundation of China(Grant No.52409151)the Programme of Shenzhen Key Laboratory of Green,Efficient and Intelligent Construction of Underground Metro Station(Programme No.ZDSYS20200923105200001)the Science and Technology Major Project of Xizang Autonomous Region of China(XZ202201ZD0003G).
文摘Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.
文摘The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses,whereas a complete case analysis is deemed appropriate only when the rate of missingness is negligible or when the missingness mechanism is missing completely at random(MCAR).This study investigates the applicability of this consensus within the context of supervised machine learning,with particular emphasis on the interactions between the imputation method,missingness mechanism,and missingness rate.Furthermore,we examine the time efficiency of these“state-of-the-art”imputation methods considering the time-sensitive nature of certain machine learning applications.Utilizing ten real-world datasets,we introduced missingness at rates ranging from approximately 5%–75%under the MCAR,missing at random(MAR),and missing not at random(MNAR)mechanisms.We subsequently address missing data using five methods:complete case analysis(CCA),mean imputation,hot deck imputation,regression imputation,and multiple imputation(MI).Statistical tests are conducted on the machine learning outcomes,and the findings are presented and analyzed.Our investigation reveals that in nearly all scenarios,CCA performs comparably to MI,even with substantial levels of missingness under the MAR and MNAR conditions and with missingness in the output variable for regression problems.Under some conditions,CCA surpasses MI in terms of its performance.Thus,given the considerable computational demands associated with MI,the application of CCA is recommended within the broader context of supervised machine learning,particularly in big-data environments.
基金support of the National Natural Science Foundation of China(U42107189,20A20111).
文摘Landslide dam failures can cause significant damage to both society and ecosystems.Predicting the failure of these dams in advance enables early preventive measures,thereby minimizing potential harm.This paper aims to propose a fast and accurate model for predicting the longevity of landslide dams while also addressing the issue of missing data.Given the wide variation in the survival times of landslide dams—from mere minutes to several thousand years—predicting their longevity presents a considerable challenge.The study develops predictive models by considering key factors such as dam geometry,hydrodynamic conditions,materials,and triggering parameters.A dataset of 1045 landslide dam cases is analyzed,categorizing their longevity into three distinct groups:C1(<1 month),C2(1 month to 1 year),and C3(>1 year).Multiple imputation and knearest neighbor algorithms are used to handle missing data on geometric size,hydrodynamic conditions,materials,and triggers.Based on the imputed data,two predictive models are developed:a classification model for dam longevity categories and a regression model for precise longevity predictions.The classification model achieves an accuracy of 88.38%while the regression model outperforms existing models with an R^(2) value of 0.966.Two real-life landslide dam cases are used to validate the models,which show correct classification and small prediction errors.The longevity of landslide dams is jointly influenced by factors such as geometric size,hydrodynamic conditions,materials,and triggering events.Among these,geometric size has the greatest impact,followed by hydrodynamic conditions,materials,and triggers,as confirmed by variable importance in the model development.
基金supported by the Langfang Science and Technology Program with self-raised funds under the project“Application of Deep Learning-Based Joint Well-Seismic Analysis in Lithology Prediction”(Project No.2024011013)the Science and Technology Innovation Program for Postgraduate students in IDP subsidized by Fundamental Research Funds for the Central Universities,under the project“Research on CNN Algorithm Enhanced by Physical Information for Lithofacies Prediction in Tight Sandstone Reservoirs”(Project No.ZY20250328).
文摘Accurate lithofacies classification in low-permeability sandstone reservoirs remains challenging due to class imbalance in well-log data and the difficulty of the modeling vertical lithological dependencies.Traditional core-based interpretation introduces subjectivity,while conventional deep learning models often fail to capture stratigraphic sequences effectively.To address these limitations,we propose a hybrid CNN–GRU framework that integrates spatial feature extraction and sequential modeling.Heat Kernel Imputation is applied to reconstruct missing log data,and Borderline SMOTE(BSMOTE)improves class balance by augmenting boundary-case minority samples.The CNN component extracts localized petrophysical features,and the GRU component captures depth-wise lithological transitions,to enable spatial-sequential feature fusion.Experiments on real-well datasets from tight sandstone reservoirs show that the proposed model achieves an average accuracy of 93.3%and a Macro F1-score of 0.934.It outperforms baseline models,including RF(87.8%),GBDT(81.8%),CNN-only(87.5%),and GRU-only(86.1%).Leave-one-well-out validation further confirms strong generalization ability.These results demonstrate that the proposed approach effectively addresses data imbalance and enhances classification robustness,offering a scalable and automated solution for lithofacies interpretation under complex geological conditions.
基金supported by the Fundamental Research Funds for the Central Universities(grant number 2024YJS096)National Natural Science Foundation of China(grant numbers 62433005,62272036,62173167).
文摘The accurate prediction and analysis of emergencies in Urban Rail Transit Systems(URTS)are essential for the development of effective early warning and prevention mechanisms.This study presents an integrated perception model designed to predict emergencies and analyze their causes based on historical unstructured emergency data.To address issues related to data structuredness and missing values,we employed label encoding and an Elastic Net Regularization-based Generative Adversarial Interpolation Network(ER-GAIN)for data structuring and imputation.Additionally,to mitigate the impact of imbalanced data on the predictive performance of emergencies,we introduced an Adaptive Boosting Ensemble Model(AdaBoost)to forecast the key features of emergencies,including event types and levels.We also utilized Information Gain(IG)to analyze and rank the causes of various significant emergencies.Experimental results indicate that,compared to baseline data imputation models,ER-GAIN improved the prediction accuracy of key emergency features by 3.67%and 3.78%,respectively.Furthermore,AdaBoost enhanced the accuracy by over 4.34%and 3.25%compared to baseline predictivemodels.Through causation analysis,we identified the critical causes of train operation and fire incidents.The findings of this research will contribute to the establishment of early warning and prevention mechanisms for emergencies in URTS,potentially leading to safer and more reliable URTS operations.
基金supported by the Intelligent System Research Group(ISysRG)supported by Universitas Sriwijaya funded by the Competitive Research 2024.
文摘Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attention, challenges remain, especially when dealing with diverse data types. In this study, we introduce a novel data imputation method based on a modified convolutional neural network, specifically, a Deep Residual-Convolutional Neural Network (DRes-CNN) architecture designed to handle missing values across various datasets. Our approach demonstrates substantial improvements over existing imputation techniques by leveraging residual connections and optimized convolutional layers to capture complex data patterns. We evaluated the model on publicly available datasets, including Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV), which contain critical care patient data, and the Beijing Multi-Site Air Quality dataset, which measures environmental air quality. The proposed DRes-CNN method achieved a root mean square error (RMSE) of 0.00006, highlighting its high accuracy and robustness. We also compared with Low Light-Convolutional Neural Network (LL-CNN) and U-Net methods, which had RMSE values of 0.00075 and 0.00073, respectively. This represented an improvement of approximately 92% over LL-CNN and 91% over U-Net. The results showed that this DRes-CNN-based imputation method outperforms current state-of-the-art models. These results established DRes-CNN as a reliable solution for addressing missing data.
文摘Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a significant challenge to maintaining prediction precision.This study introduces REPTF-TMDI,a novel method that combines a Reduced Error Pruning Tree Forest(REPTree Forest)with a newly proposed Time-based Missing Data Imputation(TMDI)approach.The REP Tree Forest,an ensemble learning approach,is tailored for time-related traffic data to enhance predictive accuracy and support the evolution of sustainable urbanmobility solutions.Meanwhile,the TMDI approach exploits temporal patterns to estimate missing values reliably whenever empty fields are encountered.The proposed method was evaluated using hourly traffic flow data from a major U.S.roadway spanning 2012-2018,incorporating temporal features(e.g.,hour,day,month,year,weekday),holiday indicator,and weather conditions(temperature,rain,snow,and cloud coverage).Experimental results demonstrated that the REPTF-TMDI method outperformed conventional imputation techniques across various missing data ratios by achieving an average 11.76%improvement in terms of correlation coefficient(R).Furthermore,REPTree Forest achieved improvements of 68.62%in RMSE and 70.52%in MAE compared to existing state-of-the-art models.These findings highlight the method’s ability to significantly boost traffic flow prediction accuracy,even in the presence of missing data,thereby contributing to the broader objectives of sustainable urban transportation systems.
文摘With the increasing complexity of production processes,there has been a growing focus on online algorithms within the domain of multivariate statistical process control(SPC).Nonetheless,conventional methods,based on the assumption of complete data obtained at uniform time intervals,exhibit suboptimal performance in the presence of missing data.In our pursuit of maximizing available information,we propose an adaptive exponentially weighted moving average(EWMA)control chart employing a weighted imputation approach that leverages the relationships between complete and incomplete data.Specifically,we introduce two recovery methods:an improved K-Nearest Neighbors imputing value and the conventional univariate EWMA statistic.We then formulate an adaptive weighting function to amalgamate these methods,assigning a diminished weight to the EWMA statistic when the sample information suggests an increased likelihood of the process being out of control,and vice versa.The robustness and sensitivity of the proposed scheme are shown through simulation results and an illustrative example.
基金supported by the National Natural Science Foundation of China(31772556)the China Agricultural Research System(CARS-41-G03)+2 种基金the Science Innovation Project of Guangdong(2015A020209159)the Special Program for Applied Research on Super Computation of the NSFC Guangdong Joint Fund(the second phase)under Grant No.U1501501technical support from the National Supercomputer Center in Guangzhou
文摘Background: Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence(WGS) data. However, sequencing thousands of individuals of interest is expensive.Imputation from SNP panels to WGS data is an attractive and less expensive approach to obtain WGS data. The aims of this study were to investigate the accuracy of imputation and to provide insight into the design and execution of genotype imputation.Results: We genotyped 450 chickens with a 600 K SNP array, and sequenced 24 key individuals by whole genome re-sequencing. Accuracy of imputation from putative 60 K and 600 K array data to WGS data was 0.620 and 0.812 for Beagle, and 0.810 and 0.914 for FImpute, respectively. By increasing the sequencing cost from 24 X to 144 X, the imputation accuracy increased from 0.525 to 0.698 for Beagle and from 0.654 to 0.823 for FImpute. With fixed sequence depth(12 X), increasing the number of sequenced animals from 1 to 24, improved accuracy from 0.421 to0.897 for FImpute and from 0.396 to 0.777 for Beagle. Using optimally selected key individuals resulted in a higher imputation accuracy compared with using randomly selected individuals as a reference population for resequencing. With fixed reference population size(24), imputation accuracy increased from 0.654 to 0.875 for FImpute and from 0.512 to 0.762 for Beagle as the sequencing depth increased from 1 X to 12 X. With a given total cost of genotyping, accuracy increased with the size of the reference population for FImpute, but the pattern was not valid for Beagle, which showed the highest accuracy at six fold coverage for the scenarios used in this study.Conclusions: In conclusion, we comprehensively investigated the impacts of several key factors on genotype imputation. Generally, increasing sequencing cost gave a higher imputation accuracy. But with a fixed sequencing cost, the optimal imputation enhance the performance of WGP and GWAS. An optimal imputation strategy should take size of reference population, imputation algorithms, marker density, and population structure of the target population and methods to select key individuals into consideration comprehensively. This work sheds additional light on how to design and execute genotype imputation for livestock populations.
基金supported by grants from the National Nonprofit Institute Research Grant (Y2020PT02)the earmarked fund for the modern agroindustry technology research system (CARS-41)+1 种基金Agricultural Science and Technology Innovation Program (ASTIP-IAS04ASTIP-IAS-TS-15)。
文摘Background: Improving the feed efficiency would increase profitability for producers while also reducing the environmental footprint of livestock production. This study was conducted to investigate the relationships among feed efficiency traits and metabolizable efficiency traits in 180 male broilers. Significant loci and genes affecting the metabolizable efficiency traits were explored with an imputation-based genome-wide association study. The traits measured or calculated comprised three growth traits, five feed efficiency related traits, and nine metabolizable efficiency traits.Results: The residual feed intake(RFI) showed moderate to high and positive phenotypic correlations with eight other traits measured, including average daily feed intake(ADFI), dry excreta weight(DEW), gross energy excretion(GEE), crude protein excretion(CPE), metabolizable dry matter(MDM), nitrogen corrected apparent metabolizable energy(AMEn), abdominal fat weight(Ab F), and percentage of abdominal fat(Ab P). Greater correlations were observed between growth traits and the feed conversion ratio(FCR) than RFI. In addition, the RFI, FCR, ADFI, DEW,GEE, CPE, MDM, AMEn, Ab F, and Ab P were lower in low-RFI birds than high-RFI birds(P < 0.01 or P < 0.05), whereas the coefficients of MDM and MCP of low-RFI birds were greater than those of high-RFI birds(P < 0.01). Five narrow QTLs for metabolizable efficiency traits were detected, including one 82.46-kb region for DEW and GEE on Gallus gallus chromosome(GGA) 26, one 120.13-kb region for MDM and AMEn on GGA1, one 691.25-kb region for the coefficients of MDM and AMEn on GGA5, one region for the coefficients of MDM and MCP on GGA2(103.45–103.53 Mb), and one 690.50-kb region for the coefficient of MCP on GGA14. Linkage disequilibrium(LD) analysis indicated that the five regions contained high LD blocks, as well as the genes chromosome 26 C6 orf106 homolog(C26 H6 orf106), LOC396098, SH3 and multiple ankyrin repeat domains 2(SHANK2), ETS homologous factor(EHF), and histamine receptor H3-like(HRH3 L), which are known to be involved in the regulation of neurodevelopment, cell proliferation and differentiation, and food intake.Conclusions: Selection for low RFI significantly decreased chicken feed intake, excreta output, and abdominal fat deposition, and increased nutrient digestibility without changing the weight gain. Five novel QTL regions involved in the control of metabolizable efficiency in chickens were identified. These results, combined through nutritional and genetic approaches, should facilitate novel insights into improving feed efficiency in poultry and other species.