Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has prov...Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has proven highly successful in image generation,speech generation,time series modelling etc.and now opens a new avenue for traffic data imputation.In this paper,we propose a conditional diffusion model,called the implicit-explicit diffusion model,for traffic data imputation.This model exploits both the implicit and explicit feature of the data simultaneously.More specifically,we design two types of feature extraction modules,one to capture the implicit dependencies hidden in the raw data at multiple time scales and the other to obtain the long-term temporal dependencies of the time series.This approach not only inherits the advantages of the diffusion model for estimating missing data,but also takes into account the multiscale correlation inherent in traffic data.To illustrate the performance of the model,extensive experiments are conducted on three real-world time series datasets using different missing rates.The experimental results demonstrate that the model improves imputation accuracy and generalization capability.展开更多
Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a sign...Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a significant challenge to maintaining prediction precision.This study introduces REPTF-TMDI,a novel method that combines a Reduced Error Pruning Tree Forest(REPTree Forest)with a newly proposed Time-based Missing Data Imputation(TMDI)approach.The REP Tree Forest,an ensemble learning approach,is tailored for time-related traffic data to enhance predictive accuracy and support the evolution of sustainable urbanmobility solutions.Meanwhile,the TMDI approach exploits temporal patterns to estimate missing values reliably whenever empty fields are encountered.The proposed method was evaluated using hourly traffic flow data from a major U.S.roadway spanning 2012-2018,incorporating temporal features(e.g.,hour,day,month,year,weekday),holiday indicator,and weather conditions(temperature,rain,snow,and cloud coverage).Experimental results demonstrated that the REPTF-TMDI method outperformed conventional imputation techniques across various missing data ratios by achieving an average 11.76%improvement in terms of correlation coefficient(R).Furthermore,REPTree Forest achieved improvements of 68.62%in RMSE and 70.52%in MAE compared to existing state-of-the-art models.These findings highlight the method’s ability to significantly boost traffic flow prediction accuracy,even in the presence of missing data,thereby contributing to the broader objectives of sustainable urban transportation systems.展开更多
This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bil...This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method.展开更多
Electric vehicles(EVs)are a sustainable mode of transportation,significantly reducing greenhouse gas emissions.The development of EV charging stations is crucial for supporting the growing number of EVs and integratin...Electric vehicles(EVs)are a sustainable mode of transportation,significantly reducing greenhouse gas emissions.The development of EV charging stations is crucial for supporting the growing number of EVs and integrating them into smart grid infrastructure.Efficient use of these stations requires optimized energy management and accurate forecasting of EV charging behaviors.However,forecasting accuracy is often hindered by missing data due to connectivity issues and equipment failures.To address these challenges,this study introduces a novel data imputation method ResiDualNet(Residual Dual BiLSTM-CNN Path Network),which is a residual sequence-to-sequence technique for imputing missing EV charging data.This model effectively captures underlying temporal and long-term dependencies,demonstrating strong performance across various scenarios.We compare our proposed model with two commonly used imputation methods KNN and Mean Imputation and one generative model,Generative Adversarial Network(GAN),across four different EV charging datasets.Experimental results demonstrate that our model significantly outperforms the others,showing an average improvement of 82%in terms of root mean squared error(RMSE)across all datasets.To further assess the effectiveness of our imputation model,we utilize three cutting-edge and newly introduced forecasting models:Bidirectional Long Short-Term Memory(BiLSTM),Mogrifier LSTM,and Sample Convolution and Interaction Network(SCINet)to predict EV charging load.The results indicate that SCINet outperforms the other forecasting techniques.Moreover,for SCINet,the dataset imputed by our proposed model performs second best after the real dataset,confirming the effectiveness of our imputation approach in improving forecasting accuracy for EV charging data.The complete source code is provided in the following repository:https://github.com/fffahim/ResiDualNet.git.展开更多
Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attentio...Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attention, challenges remain, especially when dealing with diverse data types. In this study, we introduce a novel data imputation method based on a modified convolutional neural network, specifically, a Deep Residual-Convolutional Neural Network (DRes-CNN) architecture designed to handle missing values across various datasets. Our approach demonstrates substantial improvements over existing imputation techniques by leveraging residual connections and optimized convolutional layers to capture complex data patterns. We evaluated the model on publicly available datasets, including Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV), which contain critical care patient data, and the Beijing Multi-Site Air Quality dataset, which measures environmental air quality. The proposed DRes-CNN method achieved a root mean square error (RMSE) of 0.00006, highlighting its high accuracy and robustness. We also compared with Low Light-Convolutional Neural Network (LL-CNN) and U-Net methods, which had RMSE values of 0.00075 and 0.00073, respectively. This represented an improvement of approximately 92% over LL-CNN and 91% over U-Net. The results showed that this DRes-CNN-based imputation method outperforms current state-of-the-art models. These results established DRes-CNN as a reliable solution for addressing missing data.展开更多
Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel a...Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.展开更多
Quick Access Recorder(QAR),an important device for storing data from various flight parameters,contains a large amount of valuable data and comprehensively records the real state of the airline flight.However,the reco...Quick Access Recorder(QAR),an important device for storing data from various flight parameters,contains a large amount of valuable data and comprehensively records the real state of the airline flight.However,the recorded data have certain missing values due to factors,such as weather and equipment anomalies.These missing values seriously affect the analysis of QAR data by aeronautical engineers,such as airline flight scenario reproduction and airline flight safety status assessment.Therefore,imputing missing values in the QAR data,which can further guarantee the flight safety of airlines,is crucial.QAR data also have multivariate,multiprocess,and temporal features.Therefore,we innovatively propose the imputation models A-AEGAN("A"denotes attention mechanism,"AE"denotes autoencoder,and"GAN"denotes generative adversarial network)and SA-AEGAN("SA"denotes self-attentive mechanism)for missing values of QAR data,which can be effectively applied to QAR data.Specifically,we apply an innovative generative adversarial network to impute missing values from QAR data.The improved gated recurrent unit is then introduced as the neural unit of GAN,which can successfully capture the temporal relationships in QAR data.In addition,we modify the basic structure of GAN by using an autoencoder as the generator and a recurrent neural network as the discriminator.The missing values in the QAR data are imputed by using the adversarial relationship between generator and discriminator.We introduce an attention mechanism in the autoencoder to further improve the capability of the proposed model to capture the features of QAR data.Attention mechanisms can maintain the correlation among QAR data and improve the capability of the model to impute missing data.Furthermore,we improve the proposed model by integrating a self-attention mechanism to further capture the relationship between different parameters within the QAR data.Experimental results on real datasets demonstrate that the model can reasonably impute the missing values in QAR data with excellent results.展开更多
Data imputation is an essential pre-processing task for data governance,aimed at filling in incomplete data.However,conventional data imputation methods can only partly alleviate data incompleteness using isolated tab...Data imputation is an essential pre-processing task for data governance,aimed at filling in incomplete data.However,conventional data imputation methods can only partly alleviate data incompleteness using isolated tabular data,and they fail to achieve the best balance between accuracy and eficiency.In this paper,we present a novel visual analysis approach for data imputation.We develop a multi-party tabular data association strategy that uses intelligent algorithms to identify similar columns and establish column correlations across multiple tables.Then,we perform the initial imputation of incomplete data using correlated data entries from other tables.Additionally,we develop a visual analysis system to refine data imputation candidates.Our interactive system combines the multi-party data imputation approach with expert knowledge,allowing for a better understanding of the relational structure of the data.This significantly enhances the accuracy and eficiency of data imputation,thereby enhancing the quality of data governance and the intrinsic value of data assets.Experimental validation and user surveys demonstrate that this method supports users in verifying and judging the associated columns and similar rows using theirdomain knowledge.展开更多
High-quality datasets are of paramount importance for the operation and planning of wind farms.However,the datasets collected by the supervisory control and data acquisition(SCADA)system may contain missing data due t...High-quality datasets are of paramount importance for the operation and planning of wind farms.However,the datasets collected by the supervisory control and data acquisition(SCADA)system may contain missing data due to various factors such as sensor failure and communication congestion.In this paper,a data-driven approach is proposed to fill the missing data of wind farms based on a context encoder(CE),which consists of an encoder,a decoder,and a discriminator.Through deep convolutional neural networks,the proposed method is able to automatically explore the complex nonlinear characteristics of the datasets that are difficult to be modeled explicitly.The proposed method can not only fully use the surrounding context information by the reconstructed loss,but also make filling data look real by the adversarial loss.In addition,the correlation among multiple missing attributes is taken into account by adjusting the format of input data.The simulation results show that CE performs better than traditional methods for the attributes of wind farms with hallmark characteristics such as large peaks,large valleys,and fast ramps.Moreover,the CE shows stronger generalization ability than traditional methods such as auto-encoder,K-means,k-nearest neighbor,back propagation neural network,cubic interpolation,and conditional generative adversarial network for different missing data scales.展开更多
Sufficient high-quality traffic data are a crucial component of various Intelligent Transportation System (ITS) applications and research related to congestion prediction, speed prediction, incident detection, and oth...Sufficient high-quality traffic data are a crucial component of various Intelligent Transportation System (ITS) applications and research related to congestion prediction, speed prediction, incident detection, and other traffic operation tasks. Nonetheless, missing traffic data are a common issue in sensor data which is inevitable due to several reasons, such as malfunctioning, poor maintenance or calibration, and intermittent communications. Such missing data issues often make data analysis and decision-making complicated and challenging. In this study, we have developed a generative adversarial network (GAN) based traffic sensor data imputation framework (TSDIGAN) to efficiently reconstruct the missing data by generating realistic synthetic data. In recent years, GANs have shown impressive success in image data generation. However, generating traffic data by taking advantage of GAN based modeling is a challenging task, since traffic data have strong time dependency. To address this problem, we propose a novel time-dependent encoding method called the Gramian Angular Summation Field (GASF) that converts the problem of traffic time-series data generation into that of image generation. We have evaluated and tested our proposed model using the benchmark dataset provided by Caltrans Performance Management Systems (PeMS). This study shows that the proposed model can significantly improve the traffic data imputation accuracy in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) compared to state-of-the-art models on the benchmark dataset. Further, the model achieves reasonably high accuracy in imputation tasks even under a very high missing data rate (>50%), which shows the robustness and efficiency of the proposed model.展开更多
Single-cell RNA sequencing(scRNA-seq)technology has become an effective tool for high-throughout transcriptomic study,which circumvents the averaging artifacts corresponding to bulk RNA-seq technology,yielding new per...Single-cell RNA sequencing(scRNA-seq)technology has become an effective tool for high-throughout transcriptomic study,which circumvents the averaging artifacts corresponding to bulk RNA-seq technology,yielding new perspectives on the cellular diversity of potential superficially homogeneous populations.Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material,the technical noise and biological variation are inevitably introduced into experimental process,resulting in high dropout events,which greatly hinder the downstream analysis.Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data,we propose a customized autoencoder based on a twopart-generalized-gamma distribution(AE-TPGG)for scRNAseq data analysis,which takes mixed discrete-continuous random variables of scRNA-seq data into account using a twopart model and utilizes the generalized gamma(GG)distribution,for fitting the positive and right-skewed continuous data.The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes.In addition to the ability of achieving low-dimensional representation,the AETPGG model also provides a denoised imputation according to statistical characteristic of gene expression.Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.展开更多
To fully leverage‘‘smart”transportation infrastructure data-stream investments,the creation of applications that provide real-time meaningful and actionable corridorperformance metrics is needed.However,the presenc...To fully leverage‘‘smart”transportation infrastructure data-stream investments,the creation of applications that provide real-time meaningful and actionable corridorperformance metrics is needed.However,the presence of gaps in data streams can lead to significant application implementation challenges.To demonstrate and help address these challenges,a digital twin smart-corridor application case study is presented with two primary research objectives:(1)explore the characteristics of volume data gaps on the case study corridor,and(2)investigate the feasibility of prioritizing data streams for data imputation to drive the real-time application.For the first objective,a K-means clustering analysis is used to identify similarities and differences among data gap patterns.The clustering analysis successfully identifies eight different data loss patterns.Patterns vary in both continuity and density of data gap occurrences,as well as time-dependent losses in several clusters.For the second objective,a temporal-neighboring interpolation approach for volume data imputation is explored.When investigating the use of temporalneighboring interpolation imputations on the digital twin application,performance is,in part,dependent on the combination of intersection approaches experiencing data loss,demand relative to capacity at individual locations,and the location of the loss along the corridor.The results indicate that these insights could be used to prioritize intersection approaches suitable for data imputation and to identify locations that require a more sensitive imputation methodology or improved maintenance and monitoring.展开更多
Environmental parameter data collected by sensors for monitoring the environment of agricultural facility operations are usually incomplete due to external environmental disturbances and device failures.And the missin...Environmental parameter data collected by sensors for monitoring the environment of agricultural facility operations are usually incomplete due to external environmental disturbances and device failures.And the missing of collected data is completely at random.In practice,missing data could create biased estimations and make multivariate time series predictions of environmental parameters difficult,leading to imprecise environmental control.A multivariate time series imputation model based on generative adversarial networks and multi-head attention(ATTN-GAN)is proposed in this work to reducing the negative consequence of missing data.ATTN-GAN can capture the temporal and spatial correlation of time series,and has a good capacity to learn data distribution.In the downstream experiments,we used ATTN-GAN and baseline models for data imputation,and predicted the imputed data,respectively.For the imputation of missing data,over the 20%,50%and 80%missing rate,ATTN-GAN had the lowest RMSE,0.1593,0.2012 and 0.2688 respectively.For water temperature prediction,data processed with ATTN-GAN over MLP,LSTM,DA-RNN prediction methods had the lowest MSE,0.6816,0.8375 and 0.3736 respectively.Those results revealed that ATTN-GAN outperformed all baseline models in terms of data imputation accuracy.The data processed by ATTN-GAN is the best for time series prediction.展开更多
Recent advances in single-cell DNA methylation have provided unprecedented opportunities to explore cellular epigenetic differences with maximal resolution.A common workflow for single-cell DNA methylation analysis is...Recent advances in single-cell DNA methylation have provided unprecedented opportunities to explore cellular epigenetic differences with maximal resolution.A common workflow for single-cell DNA methylation analysis is binning the genome into multiple regions and computing the average methylation level within each region.In this process,imputing not available(NA)values which are caused by the limited number of captured methylation sites is a necessary preprocessing step for downstream analyses.Existing studies have employed several simple imputation methods(such as zeros imputation or means imputation),however,there is a lack of theoretical studies or benchmark tests of these approaches.Through both experiments and theoretical analysis,we found that using the medians to impute NA values can effectively and simply reflect the methylation state of the NA values,providing an accurate foundation for downstream analyses.展开更多
Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This s...Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.展开更多
In order to improve the data quality,the big data cleaning method for distribution networks is studied in this paper.First,the Local Outlier Factor(LOF)algorithm based on DBSCAN clustering is used to detect outliers.H...In order to improve the data quality,the big data cleaning method for distribution networks is studied in this paper.First,the Local Outlier Factor(LOF)algorithm based on DBSCAN clustering is used to detect outliers.However,due to the difficulty in determining the LOF threshold,a method of dynamically calculating the threshold based on the transformer districts and time is proposed.In addition,the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate.Aiming at the diversity and complexity of data missing forms in power big data,this paper has improved the Random Forest imputation algorithm,which can be applied to various forms of missing data,especially the blocked missing data and even some completely missing horizontal or vertical data.The data in this paper are from real data of 44 transformer districts of a certain 10 kV line in a distribution network.Experimental results show that outlier detection is accurate and suitable for any shape and multidimensional power big data.The improved Random Forest imputation algorithm is suitable for all missing forms,with higher imputation accuracy and better model stability.By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values,it can be found that the accuracy of network loss prediction has improved by nearly 4%using the data cleaning method identified in this paper.Additionally,as the proportion of bad data increased,the difference between the prediction accuracy of cleaned data and that of uncleaned data is more significant.展开更多
With the increasing development of intelligent detection devices,a vast amount of traffic flow data can be collected from intelligent transportation systems.However,these data often encounter issues such as missing an...With the increasing development of intelligent detection devices,a vast amount of traffic flow data can be collected from intelligent transportation systems.However,these data often encounter issues such as missing and abnormal values,which can adversely affect the accuracy of future tasks like traffic flow forecasting.To address this problem,this paper proposes the Attention-based Spatiotemporal Generative Adversarial Imputation Network(ASTGAIN)model,comprising a generator and a discriminator,to conduct traffic volume imputation.The generator incorporates an information fuse module,a spatial attention mechanism,a causal inference module and a temporal attention mechanism,enabling it to capture historical information and extract spatiotemporal relationships from the traffic flow data.The discriminator features a bidirectional gated recurrent unit,which explores the temporal correlation of the imputed data to distinguish between imputed and original values.Additionally,we have devised an imputation filling technique that fully leverages the imputed data to enhance the imputation performance.Comparison experiments with several traditional imputation models demonstrate the superior performance of the ASTGAIN model across diverse missing scenarios.展开更多
Solar energy has become crucial in producing electrical energy because it is inexhaustible and sustainable.However,its uncertain generation causes problems in power system operation.Therefore,solar irradiance forecast...Solar energy has become crucial in producing electrical energy because it is inexhaustible and sustainable.However,its uncertain generation causes problems in power system operation.Therefore,solar irradiance forecasting is significant for suitable controlling power system operation,organizing the transmission expansion planning,and dispatching power system generation.Nonetheless,the forecasting performance can be decreased due to the unfitted prediction model and lacked preprocessing.To deal with mentioned issues,this paper pro-poses Meta-Learning Extreme Learning Machine optimized with Golden Eagle Optimization and Logistic Map(MGEL-ELM)and the Same Datetime Interval Averaged Imputation algorithm(SAME)for improving the fore-casting performance of incomplete solar irradiance time series datasets.Thus,the proposed method is not only imputing incomplete forecasting data but also achieving forecasting accuracy.The experimental result of fore-casting solar irradiance dataset in Thailand indicates that the proposed method can achieve the highest coeffi-cient of determination value up to 0.9307 compared to state-of-the-art models.Furthermore,the proposed method consumes less forecasting time than the deep learning model.展开更多
Metal powder contributes to the environmental burdens of additive manufacturing(AM)substantially.Current life cycle assessments(LCAs)of metal powders present considerable variations of lifecycle environmental inventor...Metal powder contributes to the environmental burdens of additive manufacturing(AM)substantially.Current life cycle assessments(LCAs)of metal powders present considerable variations of lifecycle environmental inventory due to process divergence,spatial heterogeneity,or temporalfluctuation.Most importantly,the amounts of LCA studies on metal powder are limited and primarily confined to partial material types.To this end,based on the data surveyed from a metal powder supplier,this study conducted an LCA of titanium and nickel alloy produced by electrode-inducted and vacuum-inducted melting gas atomization,respectively.Given that energy consumption dominates the environmental burden of powder production and is influenced by metal materials’physical properties,we proposed a Bayesian stochastic Kriging model to estimate the energy consumption during the gas atomization process.This model considered the inherent uncertainties of training data and adaptively updated the parameters of interest when new environmental data on gas atomization were available.With the predicted energy use information of specific powder,the corresponding lifecycle environmental impacts can be further autonomously estimated in conjunction with the other surveyed powder production stages.Results indicated the environmental impact of titanium alloy powder is slightly higher than that of nickel alloy powder and their lifecycle carbon emissions are around 20 kg CO_(2)equivalency.The proposed Bayesian stochastic Kriging model showed more accurate predictions of energy consumption compared with conventional Kriging and stochastic Kriging models.This study enables data imputation of energy consumption during gas atomization given the physical properties and producing technique of powder materials.展开更多
DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers.With the development of the high-throughput sequencing technology,there is much progress to disclose the relatio...DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers.With the development of the high-throughput sequencing technology,there is much progress to disclose the relations of DNA methylation with diseases.However,the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques.While many methods have been developed to impute the missing values,these methods are mostly based on the correlations between individual samples,and thus are limited for the abnormal samples in cancers.In this study,we present a novel transfer learning based neural network to impute missing DNA methylation data,namely the TDimpute-DNAmeth method.The method learns common relations between DNA methylation from pan-cancer samples,and then fine-tunes the learned relations over each specific cancer type for imputing the missing data.Tested on 16 cancer datasets,our method was shown to outperform other commonly-used methods.Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis.展开更多
基金partially supported by the National Natural Science Foundation of China(62271485)the SDHS Science and Technology Project(HS2023B044)
文摘Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has proven highly successful in image generation,speech generation,time series modelling etc.and now opens a new avenue for traffic data imputation.In this paper,we propose a conditional diffusion model,called the implicit-explicit diffusion model,for traffic data imputation.This model exploits both the implicit and explicit feature of the data simultaneously.More specifically,we design two types of feature extraction modules,one to capture the implicit dependencies hidden in the raw data at multiple time scales and the other to obtain the long-term temporal dependencies of the time series.This approach not only inherits the advantages of the diffusion model for estimating missing data,but also takes into account the multiscale correlation inherent in traffic data.To illustrate the performance of the model,extensive experiments are conducted on three real-world time series datasets using different missing rates.The experimental results demonstrate that the model improves imputation accuracy and generalization capability.
文摘Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a significant challenge to maintaining prediction precision.This study introduces REPTF-TMDI,a novel method that combines a Reduced Error Pruning Tree Forest(REPTree Forest)with a newly proposed Time-based Missing Data Imputation(TMDI)approach.The REP Tree Forest,an ensemble learning approach,is tailored for time-related traffic data to enhance predictive accuracy and support the evolution of sustainable urbanmobility solutions.Meanwhile,the TMDI approach exploits temporal patterns to estimate missing values reliably whenever empty fields are encountered.The proposed method was evaluated using hourly traffic flow data from a major U.S.roadway spanning 2012-2018,incorporating temporal features(e.g.,hour,day,month,year,weekday),holiday indicator,and weather conditions(temperature,rain,snow,and cloud coverage).Experimental results demonstrated that the REPTF-TMDI method outperformed conventional imputation techniques across various missing data ratios by achieving an average 11.76%improvement in terms of correlation coefficient(R).Furthermore,REPTree Forest achieved improvements of 68.62%in RMSE and 70.52%in MAE compared to existing state-of-the-art models.These findings highlight the method’s ability to significantly boost traffic flow prediction accuracy,even in the presence of missing data,thereby contributing to the broader objectives of sustainable urban transportation systems.
文摘This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method.
基金the Natural Sciences and Engineering Research Council of Canada(NSERC)a start-up grant from Concordia University,Canada.
文摘Electric vehicles(EVs)are a sustainable mode of transportation,significantly reducing greenhouse gas emissions.The development of EV charging stations is crucial for supporting the growing number of EVs and integrating them into smart grid infrastructure.Efficient use of these stations requires optimized energy management and accurate forecasting of EV charging behaviors.However,forecasting accuracy is often hindered by missing data due to connectivity issues and equipment failures.To address these challenges,this study introduces a novel data imputation method ResiDualNet(Residual Dual BiLSTM-CNN Path Network),which is a residual sequence-to-sequence technique for imputing missing EV charging data.This model effectively captures underlying temporal and long-term dependencies,demonstrating strong performance across various scenarios.We compare our proposed model with two commonly used imputation methods KNN and Mean Imputation and one generative model,Generative Adversarial Network(GAN),across four different EV charging datasets.Experimental results demonstrate that our model significantly outperforms the others,showing an average improvement of 82%in terms of root mean squared error(RMSE)across all datasets.To further assess the effectiveness of our imputation model,we utilize three cutting-edge and newly introduced forecasting models:Bidirectional Long Short-Term Memory(BiLSTM),Mogrifier LSTM,and Sample Convolution and Interaction Network(SCINet)to predict EV charging load.The results indicate that SCINet outperforms the other forecasting techniques.Moreover,for SCINet,the dataset imputed by our proposed model performs second best after the real dataset,confirming the effectiveness of our imputation approach in improving forecasting accuracy for EV charging data.The complete source code is provided in the following repository:https://github.com/fffahim/ResiDualNet.git.
基金supported by the Intelligent System Research Group(ISysRG)supported by Universitas Sriwijaya funded by the Competitive Research 2024.
文摘Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attention, challenges remain, especially when dealing with diverse data types. In this study, we introduce a novel data imputation method based on a modified convolutional neural network, specifically, a Deep Residual-Convolutional Neural Network (DRes-CNN) architecture designed to handle missing values across various datasets. Our approach demonstrates substantial improvements over existing imputation techniques by leveraging residual connections and optimized convolutional layers to capture complex data patterns. We evaluated the model on publicly available datasets, including Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV), which contain critical care patient data, and the Beijing Multi-Site Air Quality dataset, which measures environmental air quality. The proposed DRes-CNN method achieved a root mean square error (RMSE) of 0.00006, highlighting its high accuracy and robustness. We also compared with Low Light-Convolutional Neural Network (LL-CNN) and U-Net methods, which had RMSE values of 0.00075 and 0.00073, respectively. This represented an improvement of approximately 92% over LL-CNN and 91% over U-Net. The results showed that this DRes-CNN-based imputation method outperforms current state-of-the-art models. These results established DRes-CNN as a reliable solution for addressing missing data.
文摘Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.
基金This work was supported by the National Natural Science Foundation of China(Nos.61972456,61402329)the Natural Science Foundation of Tianjin(Nos.19JCYBJC15400,21YDTPJC00440)。
文摘Quick Access Recorder(QAR),an important device for storing data from various flight parameters,contains a large amount of valuable data and comprehensively records the real state of the airline flight.However,the recorded data have certain missing values due to factors,such as weather and equipment anomalies.These missing values seriously affect the analysis of QAR data by aeronautical engineers,such as airline flight scenario reproduction and airline flight safety status assessment.Therefore,imputing missing values in the QAR data,which can further guarantee the flight safety of airlines,is crucial.QAR data also have multivariate,multiprocess,and temporal features.Therefore,we innovatively propose the imputation models A-AEGAN("A"denotes attention mechanism,"AE"denotes autoencoder,and"GAN"denotes generative adversarial network)and SA-AEGAN("SA"denotes self-attentive mechanism)for missing values of QAR data,which can be effectively applied to QAR data.Specifically,we apply an innovative generative adversarial network to impute missing values from QAR data.The improved gated recurrent unit is then introduced as the neural unit of GAN,which can successfully capture the temporal relationships in QAR data.In addition,we modify the basic structure of GAN by using an autoencoder as the generator and a recurrent neural network as the discriminator.The missing values in the QAR data are imputed by using the adversarial relationship between generator and discriminator.We introduce an attention mechanism in the autoencoder to further improve the capability of the proposed model to capture the features of QAR data.Attention mechanisms can maintain the correlation among QAR data and improve the capability of the model to impute missing data.Furthermore,we improve the proposed model by integrating a self-attention mechanism to further capture the relationship between different parameters within the QAR data.Experimental results on real datasets demonstrate that the model can reasonably impute the missing values in QAR data with excellent results.
基金Project supported by the Key R&D"Pioneer"Tackling Plan Program of Zhejiang Province,China(No.2023C01119)the"Ten Thousand Talents Plan"Science and Technology Innovation Leading Talent Program of Zhejiang Province,China(No.2022R52044)+1 种基金the Major Standardization Pilot Projects for the Digital Economy(Digital Trade Sector)of Zhejiang Province,China(No.SJ-Bz/2023053)the National Natural Science Foundationof China(No.62132017)。
文摘Data imputation is an essential pre-processing task for data governance,aimed at filling in incomplete data.However,conventional data imputation methods can only partly alleviate data incompleteness using isolated tabular data,and they fail to achieve the best balance between accuracy and eficiency.In this paper,we present a novel visual analysis approach for data imputation.We develop a multi-party tabular data association strategy that uses intelligent algorithms to identify similar columns and establish column correlations across multiple tables.Then,we perform the initial imputation of incomplete data using correlated data entries from other tables.Additionally,we develop a visual analysis system to refine data imputation candidates.Our interactive system combines the multi-party data imputation approach with expert knowledge,allowing for a better understanding of the relational structure of the data.This significantly enhances the accuracy and eficiency of data imputation,thereby enhancing the quality of data governance and the intrinsic value of data assets.Experimental validation and user surveys demonstrate that this method supports users in verifying and judging the associated columns and similar rows using theirdomain knowledge.
基金This work was supported by the China Scholarship Council.
文摘High-quality datasets are of paramount importance for the operation and planning of wind farms.However,the datasets collected by the supervisory control and data acquisition(SCADA)system may contain missing data due to various factors such as sensor failure and communication congestion.In this paper,a data-driven approach is proposed to fill the missing data of wind farms based on a context encoder(CE),which consists of an encoder,a decoder,and a discriminator.Through deep convolutional neural networks,the proposed method is able to automatically explore the complex nonlinear characteristics of the datasets that are difficult to be modeled explicitly.The proposed method can not only fully use the surrounding context information by the reconstructed loss,but also make filling data look real by the adversarial loss.In addition,the correlation among multiple missing attributes is taken into account by adjusting the format of input data.The simulation results show that CE performs better than traditional methods for the attributes of wind farms with hallmark characteristics such as large peaks,large valleys,and fast ramps.Moreover,the CE shows stronger generalization ability than traditional methods such as auto-encoder,K-means,k-nearest neighbor,back propagation neural network,cubic interpolation,and conditional generative adversarial network for different missing data scales.
文摘Sufficient high-quality traffic data are a crucial component of various Intelligent Transportation System (ITS) applications and research related to congestion prediction, speed prediction, incident detection, and other traffic operation tasks. Nonetheless, missing traffic data are a common issue in sensor data which is inevitable due to several reasons, such as malfunctioning, poor maintenance or calibration, and intermittent communications. Such missing data issues often make data analysis and decision-making complicated and challenging. In this study, we have developed a generative adversarial network (GAN) based traffic sensor data imputation framework (TSDIGAN) to efficiently reconstruct the missing data by generating realistic synthetic data. In recent years, GANs have shown impressive success in image data generation. However, generating traffic data by taking advantage of GAN based modeling is a challenging task, since traffic data have strong time dependency. To address this problem, we propose a novel time-dependent encoding method called the Gramian Angular Summation Field (GASF) that converts the problem of traffic time-series data generation into that of image generation. We have evaluated and tested our proposed model using the benchmark dataset provided by Caltrans Performance Management Systems (PeMS). This study shows that the proposed model can significantly improve the traffic data imputation accuracy in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) compared to state-of-the-art models on the benchmark dataset. Further, the model achieves reasonably high accuracy in imputation tasks even under a very high missing data rate (>50%), which shows the robustness and efficiency of the proposed model.
基金This research was supported by the National Natural Science Foundation of China(Grant Nos.62136004,61802193)the National Key R&D Program of China(2018YFC2001600,2018YFC2001602)+1 种基金the Natural Science Foundation of Jiangsu Province(BK20170934)the Fundamental Research Funds for the Central Universities(NJ2020023)。
文摘Single-cell RNA sequencing(scRNA-seq)technology has become an effective tool for high-throughout transcriptomic study,which circumvents the averaging artifacts corresponding to bulk RNA-seq technology,yielding new perspectives on the cellular diversity of potential superficially homogeneous populations.Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material,the technical noise and biological variation are inevitably introduced into experimental process,resulting in high dropout events,which greatly hinder the downstream analysis.Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data,we propose a customized autoencoder based on a twopart-generalized-gamma distribution(AE-TPGG)for scRNAseq data analysis,which takes mixed discrete-continuous random variables of scRNA-seq data into account using a twopart model and utilizes the generalized gamma(GG)distribution,for fitting the positive and right-skewed continuous data.The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes.In addition to the ability of achieving low-dimensional representation,the AETPGG model also provides a denoised imputation according to statistical characteristic of gene expression.Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.
基金supported in part by the City of Atlanta(CoA)under Research Project FC-9930-Smart Cities Traffic Congestion Mitigation Program and in part by the National Center of Sustainable Transportation(NCST)under NCST Dissertation Fund.
文摘To fully leverage‘‘smart”transportation infrastructure data-stream investments,the creation of applications that provide real-time meaningful and actionable corridorperformance metrics is needed.However,the presence of gaps in data streams can lead to significant application implementation challenges.To demonstrate and help address these challenges,a digital twin smart-corridor application case study is presented with two primary research objectives:(1)explore the characteristics of volume data gaps on the case study corridor,and(2)investigate the feasibility of prioritizing data streams for data imputation to drive the real-time application.For the first objective,a K-means clustering analysis is used to identify similarities and differences among data gap patterns.The clustering analysis successfully identifies eight different data loss patterns.Patterns vary in both continuity and density of data gap occurrences,as well as time-dependent losses in several clusters.For the second objective,a temporal-neighboring interpolation approach for volume data imputation is explored.When investigating the use of temporalneighboring interpolation imputations on the digital twin application,performance is,in part,dependent on the combination of intersection approaches experiencing data loss,demand relative to capacity at individual locations,and the location of the loss along the corridor.The results indicate that these insights could be used to prioritize intersection approaches suitable for data imputation and to identify locations that require a more sensitive imputation methodology or improved maintenance and monitoring.
基金supported by the National Natural Science Foundation of China:“Regularity and prediction model of juvenile fish growth under synergistic effect of water temperature and flow fields in recirculating aquaculture”(Grant No.32373185)2115 Talent Development Program of China Agricultural University,Overseas High-level Youth Talents Program(China Agricultural University,China,Grant No.62339001)+2 种基金China Agricultural University Excellent Talents Plan(Grant No.31051015)Major Science and Technology Innovation Fund 2019 of Shandong Province(Grant No.2019JZZY010703)National Innovation Center for Digital Fishery,and Beijing Engineering and Technology Research Center for Internet of Things in Agriculture.
文摘Environmental parameter data collected by sensors for monitoring the environment of agricultural facility operations are usually incomplete due to external environmental disturbances and device failures.And the missing of collected data is completely at random.In practice,missing data could create biased estimations and make multivariate time series predictions of environmental parameters difficult,leading to imprecise environmental control.A multivariate time series imputation model based on generative adversarial networks and multi-head attention(ATTN-GAN)is proposed in this work to reducing the negative consequence of missing data.ATTN-GAN can capture the temporal and spatial correlation of time series,and has a good capacity to learn data distribution.In the downstream experiments,we used ATTN-GAN and baseline models for data imputation,and predicted the imputed data,respectively.For the imputation of missing data,over the 20%,50%and 80%missing rate,ATTN-GAN had the lowest RMSE,0.1593,0.2012 and 0.2688 respectively.For water temperature prediction,data processed with ATTN-GAN over MLP,LSTM,DA-RNN prediction methods had the lowest MSE,0.6816,0.8375 and 0.3736 respectively.Those results revealed that ATTN-GAN outperformed all baseline models in terms of data imputation accuracy.The data processed by ATTN-GAN is the best for time series prediction.
基金National Natural Science Foundation of China,Grant/Award Numbers:62203236,62473212Young Elite Scientists Sponsorship Program by CAST,Grant/Award Number:2023QNRC001。
文摘Recent advances in single-cell DNA methylation have provided unprecedented opportunities to explore cellular epigenetic differences with maximal resolution.A common workflow for single-cell DNA methylation analysis is binning the genome into multiple regions and computing the average methylation level within each region.In this process,imputing not available(NA)values which are caused by the limited number of captured methylation sites is a necessary preprocessing step for downstream analyses.Existing studies have employed several simple imputation methods(such as zeros imputation or means imputation),however,there is a lack of theoretical studies or benchmark tests of these approaches.Through both experiments and theoretical analysis,we found that using the medians to impute NA values can effectively and simply reflect the methylation state of the NA values,providing an accurate foundation for downstream analyses.
基金supported by the National Natural Science Foundation of China(Grant No.52409151)the Programme of Shenzhen Key Laboratory of Green,Efficient and Intelligent Construction of Underground Metro Station(Programme No.ZDSYS20200923105200001)the Science and Technology Major Project of Xizang Autonomous Region of China(XZ202201ZD0003G).
文摘Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.
基金supported in part by the National Natural Science Foundation of China(NSFC)under Grant U1966207 and 51822702in part by the Key Research and Development Program of Hunan Province of China under Grant 2018GK2031+1 种基金in part by the 111 Project of China under Grant B17016,in part by the Innovative Construction Program of Hunan Province of China under Grant 2019RS1016in part by the Excellent Innovation Youth Program of Changsha of China under Grant KQ1802029.
文摘In order to improve the data quality,the big data cleaning method for distribution networks is studied in this paper.First,the Local Outlier Factor(LOF)algorithm based on DBSCAN clustering is used to detect outliers.However,due to the difficulty in determining the LOF threshold,a method of dynamically calculating the threshold based on the transformer districts and time is proposed.In addition,the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate.Aiming at the diversity and complexity of data missing forms in power big data,this paper has improved the Random Forest imputation algorithm,which can be applied to various forms of missing data,especially the blocked missing data and even some completely missing horizontal or vertical data.The data in this paper are from real data of 44 transformer districts of a certain 10 kV line in a distribution network.Experimental results show that outlier detection is accurate and suitable for any shape and multidimensional power big data.The improved Random Forest imputation algorithm is suitable for all missing forms,with higher imputation accuracy and better model stability.By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values,it can be found that the accuracy of network loss prediction has improved by nearly 4%using the data cleaning method identified in this paper.Additionally,as the proportion of bad data increased,the difference between the prediction accuracy of cleaned data and that of uncleaned data is more significant.
基金funded in part by Key R&D Program of Hunan Province(Grant No.2023GK2014)Key technology projects in the transportation industry(Grant No.2022-ZD6-077)+1 种基金Transportation Science and Technology Plan Project of Shandong Transportation Department(Grant No.2022B62)the Fundamental Research Funds for the Central Universities of Central South University(Grant No.2023ZZTS0683)。
文摘With the increasing development of intelligent detection devices,a vast amount of traffic flow data can be collected from intelligent transportation systems.However,these data often encounter issues such as missing and abnormal values,which can adversely affect the accuracy of future tasks like traffic flow forecasting.To address this problem,this paper proposes the Attention-based Spatiotemporal Generative Adversarial Imputation Network(ASTGAIN)model,comprising a generator and a discriminator,to conduct traffic volume imputation.The generator incorporates an information fuse module,a spatial attention mechanism,a causal inference module and a temporal attention mechanism,enabling it to capture historical information and extract spatiotemporal relationships from the traffic flow data.The discriminator features a bidirectional gated recurrent unit,which explores the temporal correlation of the imputed data to distinguish between imputed and original values.Additionally,we have devised an imputation filling technique that fully leverages the imputed data to enhance the imputation performance.Comparison experiments with several traditional imputation models demonstrate the superior performance of the ASTGAIN model across diverse missing scenarios.
文摘Solar energy has become crucial in producing electrical energy because it is inexhaustible and sustainable.However,its uncertain generation causes problems in power system operation.Therefore,solar irradiance forecasting is significant for suitable controlling power system operation,organizing the transmission expansion planning,and dispatching power system generation.Nonetheless,the forecasting performance can be decreased due to the unfitted prediction model and lacked preprocessing.To deal with mentioned issues,this paper pro-poses Meta-Learning Extreme Learning Machine optimized with Golden Eagle Optimization and Logistic Map(MGEL-ELM)and the Same Datetime Interval Averaged Imputation algorithm(SAME)for improving the fore-casting performance of incomplete solar irradiance time series datasets.Thus,the proposed method is not only imputing incomplete forecasting data but also achieving forecasting accuracy.The experimental result of fore-casting solar irradiance dataset in Thailand indicates that the proposed method can achieve the highest coeffi-cient of determination value up to 0.9307 compared to state-of-the-art models.Furthermore,the proposed method consumes less forecasting time than the deep learning model.
基金funded by the National Natural Science Foundation of China under Grant No.52305544the Project of Guangdong Science and Technology Innovation Strategy under Grant No.STKJ202209065.
文摘Metal powder contributes to the environmental burdens of additive manufacturing(AM)substantially.Current life cycle assessments(LCAs)of metal powders present considerable variations of lifecycle environmental inventory due to process divergence,spatial heterogeneity,or temporalfluctuation.Most importantly,the amounts of LCA studies on metal powder are limited and primarily confined to partial material types.To this end,based on the data surveyed from a metal powder supplier,this study conducted an LCA of titanium and nickel alloy produced by electrode-inducted and vacuum-inducted melting gas atomization,respectively.Given that energy consumption dominates the environmental burden of powder production and is influenced by metal materials’physical properties,we proposed a Bayesian stochastic Kriging model to estimate the energy consumption during the gas atomization process.This model considered the inherent uncertainties of training data and adaptively updated the parameters of interest when new environmental data on gas atomization were available.With the predicted energy use information of specific powder,the corresponding lifecycle environmental impacts can be further autonomously estimated in conjunction with the other surveyed powder production stages.Results indicated the environmental impact of titanium alloy powder is slightly higher than that of nickel alloy powder and their lifecycle carbon emissions are around 20 kg CO_(2)equivalency.The proposed Bayesian stochastic Kriging model showed more accurate predictions of energy consumption compared with conventional Kriging and stochastic Kriging models.This study enables data imputation of energy consumption during gas atomization given the physical properties and producing technique of powder materials.
基金supported by the National Key Research and Development Program of China under Grant No.2020YFB0204803the National Natural Science Foundation of China under Grant No.61772566+2 种基金the Guangdong Key Field Research and Development Plan under Grant Nos.2019B020228001 and 2018B010109006the Introducing Innovative and Entrepreneurial Teams of Guangdong under Grant No.2016ZT06D211the Guangzhou Science and Technology Research Plan under Grant No.202007030010.
文摘DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers.With the development of the high-throughput sequencing technology,there is much progress to disclose the relations of DNA methylation with diseases.However,the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques.While many methods have been developed to impute the missing values,these methods are mostly based on the correlations between individual samples,and thus are limited for the abnormal samples in cancers.In this study,we present a novel transfer learning based neural network to impute missing DNA methylation data,namely the TDimpute-DNAmeth method.The method learns common relations between DNA methylation from pan-cancer samples,and then fine-tunes the learned relations over each specific cancer type for imputing the missing data.Tested on 16 cancer datasets,our method was shown to outperform other commonly-used methods.Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis.