期刊文献+
共找到123篇文章
< 1 2 7 >
每页显示 20 50 100
Why Can Multiple Imputations and How (MICE) Algorithm Work?
1
作者 Abdullah Z. Alruhaymi Charles J. Kim 《Open Journal of Statistics》 2021年第5期759-777,共19页
Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a spe... Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span> 展开更多
关键词 Multiple imputations imputations ALGORITHMS MICE Algorithm
在线阅读 下载PDF
Missing Data Imputations for Upper Air Temperature at 24 Standard Pressure Levels over Pakistan Collected from Aqua Satellite 被引量:4
2
作者 Muhammad Usman Saleem Sajid Rashid Ahmed 《Journal of Data Analysis and Information Processing》 2016年第3期132-146,共16页
This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bil... This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method. 展开更多
关键词 Missing Data imputations Spatial Interpolation AQUA Satellite Upper Level Air Temperature AIRX3STML
在线阅读 下载PDF
Determining Sufficient Number of Imputations Using Variance of Imputation Variances: Data from 2012 NAMCS Physician Workflow Mail Survey
3
作者 Qiyuan Pan Rong Wei +1 位作者 Iris Shimizu Eric Jamoom 《Applied Mathematics》 2014年第21期3421-3430,共10页
How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situatio... How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situations. In this study, η, the minimally sufficient number of imputations, was determined based on the relationship between m, the number of imputations, and ω, the standard error of imputation variances using the 2012 National Ambulatory Medical Care Survey (NAMCS) Physician Workflow mail survey. Five variables of various value ranges, variances, and missing data percentages were tested. For all variables tested, ω decreased as m increased. The m value above which the cost of further increase in m would outweigh the benefit of reducing ω was recognized as the η. This method has a potential to be used by anyone to determine η that fits his or her own data situation. 展开更多
关键词 Multiple IMPUTATION SUFFICIENT NUMBER of imputations Hot-Deck IMPUTATION
暂未订购
Comparative Study of Four Methods in Missing Value Imputations under Missing Completely at Random Mechanism 被引量:3
4
作者 Michikazu Nakai Ding-Geng Chen +1 位作者 Kunihiro Nishimura Yoshihiro Miyamoto 《Open Journal of Statistics》 2014年第1期27-37,共11页
In analyzing data from clinical trials and longitudinal studies, the issue of missing values is always a fundamental challenge since the missing data could introduce bias and lead to erroneous statistical inferences. ... In analyzing data from clinical trials and longitudinal studies, the issue of missing values is always a fundamental challenge since the missing data could introduce bias and lead to erroneous statistical inferences. To deal with this challenge, several imputation methods have been developed in the literature to handle missing values where the most commonly used are complete case method, mean imputation method, last observation carried forward (LOCF) method, and multiple imputation (MI) method. In this paper, we conduct a simulation study to investigate the efficiency of these four typical imputation methods with longitudinal data setting under missing completely at random (MCAR). We categorize missingness with three cases from a lower percentage of 5% to a higher percentage of 30% and 50% missingness. With this simulation study, we make a conclusion that LOCF method has more bias than the other three methods in most situations. MI method has the least bias with the best coverage probability. Thus, we conclude that MI method is the most effective imputation method in our MCAR simulation study. 展开更多
关键词 MISSING Data IMPUTATION MCAR COMPLETE Case LOCF
暂未订购
Longevity prediction and missing data treatment of landslide dams
5
作者 WANG Danyan YANG Xingguo +2 位作者 ZHOU Jiawen FENG Zhenyu LIAO Haimei 《Journal of Mountain Science》 2025年第7期2640-2653,共14页
Landslide dam failures can cause significant damage to both society and ecosystems.Predicting the failure of these dams in advance enables early preventive measures,thereby minimizing potential harm.This paper aims to... Landslide dam failures can cause significant damage to both society and ecosystems.Predicting the failure of these dams in advance enables early preventive measures,thereby minimizing potential harm.This paper aims to propose a fast and accurate model for predicting the longevity of landslide dams while also addressing the issue of missing data.Given the wide variation in the survival times of landslide dams—from mere minutes to several thousand years—predicting their longevity presents a considerable challenge.The study develops predictive models by considering key factors such as dam geometry,hydrodynamic conditions,materials,and triggering parameters.A dataset of 1045 landslide dam cases is analyzed,categorizing their longevity into three distinct groups:C1(<1 month),C2(1 month to 1 year),and C3(>1 year).Multiple imputation and knearest neighbor algorithms are used to handle missing data on geometric size,hydrodynamic conditions,materials,and triggers.Based on the imputed data,two predictive models are developed:a classification model for dam longevity categories and a regression model for precise longevity predictions.The classification model achieves an accuracy of 88.38%while the regression model outperforms existing models with an R^(2) value of 0.966.Two real-life landslide dam cases are used to validate the models,which show correct classification and small prediction errors.The longevity of landslide dams is jointly influenced by factors such as geometric size,hydrodynamic conditions,materials,and triggering events.Among these,geometric size has the greatest impact,followed by hydrodynamic conditions,materials,and triggers,as confirmed by variable importance in the model development. 展开更多
关键词 CATEGORY Longevity range IMPUTATION Prediction models Decision Tree
原文传递
Prediction of radionuclide diffusion enabled by missing data imputation and ensemble machine learning
6
作者 Jun-Lei Tian Jia-Xing Feng +4 位作者 Jia-Cong Shen Lei Yao Jing-Yan Wang Tao Wu Yao-Lin Zhao 《Nuclear Science and Techniques》 2025年第10期47-61,共15页
Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light grad... Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of the machine learning(ML)models.In this study,regression-based missing data imputation method using a light gradient boosting machine(LGBM)algorithm was employed to impute more than 60%of the missing data,establishing a radionuclide diffusion dataset containing 16 input features and 813 instances.The effective diffusion coefficient(D_(e))was predicted using ten ML models.The predictive accuracy of the ensemble meta-models,namely LGBM-extreme gradient boosting(XGB)and LGBM-categorical boosting(CatB),surpassed that of the other ML models,with R^(2)values of 0.94.The models were applied to predict the D_(e)values of EuEDTA^(−)and HCrO_(4)^(−)in saturated compacted bentonites at compactions ranging from 1200 to 1800 kg/m^(3),which were measured using a through-diffusion method.The generalization ability of the LGBM-XGB model surpassed that of LGB-CatB in predicting the D_(e)of HCrO_(4)^(−).Shapley additive explanations identified total porosity as the most significant influencing factor.Additionally,the partial dependence plot analysis technique yielded clearer results in the univariate correlation analysis.This study provides a regression imputation technique to refine radionuclide diffusion datasets,offering deeper insights into analyzing the diffusion mechanism of radionuclides and supporting the safety assessment of the geological disposal of high-level radioactive waste. 展开更多
关键词 Machine learning Radionuclide diffusion BENTONITE Regression imputation Missing data Diffusion experiments
在线阅读 下载PDF
A Diffusion Model for Traffic Data Imputation
7
作者 Bo Lu Qinghai Miao +5 位作者 Yahui Liu Tariku Sinshaw Tamir Hongxia Zhao Xiqiao Zhang Yisheng Lv Fei-Yue Wang 《IEEE/CAA Journal of Automatica Sinica》 2025年第3期606-617,共12页
Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has prov... Imputation of missing data has long been an important topic and an essential application for intelligent transportation systems(ITS)in the real world.As a state-of-the-art generative model,the diffusion model has proven highly successful in image generation,speech generation,time series modelling etc.and now opens a new avenue for traffic data imputation.In this paper,we propose a conditional diffusion model,called the implicit-explicit diffusion model,for traffic data imputation.This model exploits both the implicit and explicit feature of the data simultaneously.More specifically,we design two types of feature extraction modules,one to capture the implicit dependencies hidden in the raw data at multiple time scales and the other to obtain the long-term temporal dependencies of the time series.This approach not only inherits the advantages of the diffusion model for estimating missing data,but also takes into account the multiscale correlation inherent in traffic data.To illustrate the performance of the model,extensive experiments are conducted on three real-world time series datasets using different missing rates.The experimental results demonstrate that the model improves imputation accuracy and generalization capability. 展开更多
关键词 Data imputation diffusion model implicit feature time series traffic data
在线阅读 下载PDF
An Integrated Perception Model for Predicting and Analyzing Urban Rail Transit Emergencies Based on Unstructured Data
8
作者 Liang Mu Yurui Kang +1 位作者 Zixu Yan Guangyu Zhu 《Computers, Materials & Continua》 2025年第8期2495-2512,共18页
The accurate prediction and analysis of emergencies in Urban Rail Transit Systems(URTS)are essential for the development of effective early warning and prevention mechanisms.This study presents an integrated perceptio... The accurate prediction and analysis of emergencies in Urban Rail Transit Systems(URTS)are essential for the development of effective early warning and prevention mechanisms.This study presents an integrated perception model designed to predict emergencies and analyze their causes based on historical unstructured emergency data.To address issues related to data structuredness and missing values,we employed label encoding and an Elastic Net Regularization-based Generative Adversarial Interpolation Network(ER-GAIN)for data structuring and imputation.Additionally,to mitigate the impact of imbalanced data on the predictive performance of emergencies,we introduced an Adaptive Boosting Ensemble Model(AdaBoost)to forecast the key features of emergencies,including event types and levels.We also utilized Information Gain(IG)to analyze and rank the causes of various significant emergencies.Experimental results indicate that,compared to baseline data imputation models,ER-GAIN improved the prediction accuracy of key emergency features by 3.67%and 3.78%,respectively.Furthermore,AdaBoost enhanced the accuracy by over 4.34%and 3.25%compared to baseline predictivemodels.Through causation analysis,we identified the critical causes of train operation and fire incidents.The findings of this research will contribute to the establishment of early warning and prevention mechanisms for emergencies in URTS,potentially leading to safer and more reliable URTS operations. 展开更多
关键词 Urban rail transit system emergency prediction generative adversarial imputation network ensemble learning cause analysis
在线阅读 下载PDF
A Modified Deep Residual-Convolutional Neural Network for Accurate Imputation of Missing Data
9
作者 Firdaus Firdaus Siti Nurmaini +8 位作者 Anggun Islami Annisa Darmawahyuni Ade Iriani Sapitri Muhammad Naufal Rachmatullah Bambang Tutuko Akhiar Wista Arum Muhammad Irfan Karim Yultrien Yultrien Ramadhana Noor Salassa Wandya 《Computers, Materials & Continua》 2025年第2期3419-3441,共23页
Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attentio... Handling missing data accurately is critical in clinical research, where data quality directly impacts decision-making and patient outcomes. While deep learning (DL) techniques for data imputation have gained attention, challenges remain, especially when dealing with diverse data types. In this study, we introduce a novel data imputation method based on a modified convolutional neural network, specifically, a Deep Residual-Convolutional Neural Network (DRes-CNN) architecture designed to handle missing values across various datasets. Our approach demonstrates substantial improvements over existing imputation techniques by leveraging residual connections and optimized convolutional layers to capture complex data patterns. We evaluated the model on publicly available datasets, including Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV), which contain critical care patient data, and the Beijing Multi-Site Air Quality dataset, which measures environmental air quality. The proposed DRes-CNN method achieved a root mean square error (RMSE) of 0.00006, highlighting its high accuracy and robustness. We also compared with Low Light-Convolutional Neural Network (LL-CNN) and U-Net methods, which had RMSE values of 0.00075 and 0.00073, respectively. This represented an improvement of approximately 92% over LL-CNN and 91% over U-Net. The results showed that this DRes-CNN-based imputation method outperforms current state-of-the-art models. These results established DRes-CNN as a reliable solution for addressing missing data. 展开更多
关键词 Data imputation missing data deep learning deep residual convolutional neural network
在线阅读 下载PDF
A Novel Reduced Error Pruning Tree Forest with Time-Based Missing Data Imputation(REPTF-TMDI)for Traffic Flow Prediction
10
作者 Yunus Dogan Goksu Tuysuzoglu +4 位作者 Elife Ozturk Kiyak Bita Ghasemkhani Kokten Ulas Birant Semih Utku Derya Birant 《Computer Modeling in Engineering & Sciences》 2025年第8期1677-1715,共39页
Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a sign... Accurate traffic flow prediction(TFP)is vital for efficient and sustainable transportation management and the development of intelligent traffic systems.However,missing data in real-world traffic datasets poses a significant challenge to maintaining prediction precision.This study introduces REPTF-TMDI,a novel method that combines a Reduced Error Pruning Tree Forest(REPTree Forest)with a newly proposed Time-based Missing Data Imputation(TMDI)approach.The REP Tree Forest,an ensemble learning approach,is tailored for time-related traffic data to enhance predictive accuracy and support the evolution of sustainable urbanmobility solutions.Meanwhile,the TMDI approach exploits temporal patterns to estimate missing values reliably whenever empty fields are encountered.The proposed method was evaluated using hourly traffic flow data from a major U.S.roadway spanning 2012-2018,incorporating temporal features(e.g.,hour,day,month,year,weekday),holiday indicator,and weather conditions(temperature,rain,snow,and cloud coverage).Experimental results demonstrated that the REPTF-TMDI method outperformed conventional imputation techniques across various missing data ratios by achieving an average 11.76%improvement in terms of correlation coefficient(R).Furthermore,REPTree Forest achieved improvements of 68.62%in RMSE and 70.52%in MAE compared to existing state-of-the-art models.These findings highlight the method’s ability to significantly boost traffic flow prediction accuracy,even in the presence of missing data,thereby contributing to the broader objectives of sustainable urban transportation systems. 展开更多
关键词 Machine learning traffic flow prediction missing data imputation reduced error pruning tree(REPTree) sustainable transportation systems traffic management artificial intelligence
在线阅读 下载PDF
Impact of connected corridor volume data imputations on digital twin performance measures
11
作者 Abhilasha J.Saroj Somdut Roy +1 位作者 Angshuman Guin Michael Hunter 《International Journal of Transportation Science and Technology》 2023年第2期476-491,共16页
To fully leverage‘‘smart”transportation infrastructure data-stream investments,the creation of applications that provide real-time meaningful and actionable corridorperformance metrics is needed.However,the presenc... To fully leverage‘‘smart”transportation infrastructure data-stream investments,the creation of applications that provide real-time meaningful and actionable corridorperformance metrics is needed.However,the presence of gaps in data streams can lead to significant application implementation challenges.To demonstrate and help address these challenges,a digital twin smart-corridor application case study is presented with two primary research objectives:(1)explore the characteristics of volume data gaps on the case study corridor,and(2)investigate the feasibility of prioritizing data streams for data imputation to drive the real-time application.For the first objective,a K-means clustering analysis is used to identify similarities and differences among data gap patterns.The clustering analysis successfully identifies eight different data loss patterns.Patterns vary in both continuity and density of data gap occurrences,as well as time-dependent losses in several clusters.For the second objective,a temporal-neighboring interpolation approach for volume data imputation is explored.When investigating the use of temporalneighboring interpolation imputations on the digital twin application,performance is,in part,dependent on the combination of intersection approaches experiencing data loss,demand relative to capacity at individual locations,and the location of the loss along the corridor.The results indicate that these insights could be used to prioritize intersection approaches suitable for data imputation and to identify locations that require a more sensitive imputation methodology or improved maintenance and monitoring. 展开更多
关键词 Connected corridor Missing traffic data Smart corridor application Traffic data imputation Traffic data loss
在线阅读 下载PDF
Data augmentation for bias correction in mapping PM_(2.5) based on satellite retrievals and ground observations 被引量:1
12
作者 Tan Mi Die Tang +6 位作者 Jianbo Fu Wen Zeng Michael L.Grieneisen Zihang Zhou Fengju Jia Fumo Yang Yu Zhan 《Geoscience Frontiers》 SCIE CAS CSCD 2024年第1期17-28,共12页
As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias st... As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias stems from the issue of dataset shift,as the density distributions of predictor variables differ greatly between urban and rural areas.We propose a data-augmentation approach based on the multiple imputation by chained equations(MICE-DA)to remedy the dataset shift problem.Compared with the benchmark models,MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity(Chengdu)at the foot of the Tibetan Plateau,especially for correcting the estimation bias,with the mean bias decreasing from-3.4µg/m^(3)to-1.6µg/m^(3).As a complement to the holdout validation,the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area.The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth(AOD)during the data augmentation.Consequently,the importance of AOD is largely enhanced for predicting PM2.5,and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5%to 18.4%.This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies.The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research. 展开更多
关键词 Aerosol optical depth Dataset shift Spatiotemporal Distribution Air quality monitoring Multiple imputation by chained equations
在线阅读 下载PDF
An Adaptive Multivariate EWMA Control Chart for Monitoring Missing Data 被引量:1
13
作者 PU Xiaolong XIANG Dongdong CHEN Xinyan 《应用概率统计》 CSCD 北大核心 2024年第2期343-363,共21页
With the increasing complexity of production processes,there has been a growing focus on online algorithms within the domain of multivariate statistical process control(SPC).Nonetheless,conventional methods,based on t... With the increasing complexity of production processes,there has been a growing focus on online algorithms within the domain of multivariate statistical process control(SPC).Nonetheless,conventional methods,based on the assumption of complete data obtained at uniform time intervals,exhibit suboptimal performance in the presence of missing data.In our pursuit of maximizing available information,we propose an adaptive exponentially weighted moving average(EWMA)control chart employing a weighted imputation approach that leverages the relationships between complete and incomplete data.Specifically,we introduce two recovery methods:an improved K-Nearest Neighbors imputing value and the conventional univariate EWMA statistic.We then formulate an adaptive weighting function to amalgamate these methods,assigning a diminished weight to the EWMA statistic when the sample information suggests an increased likelihood of the process being out of control,and vice versa.The robustness and sensitivity of the proposed scheme are shown through simulation results and an illustrative example. 展开更多
关键词 online monitoring completely random missing weighted imputing values EWMA improved K-nearest neighbors
在线阅读 下载PDF
An Enhanced Integrated Method for Healthcare Data Classification with Incompleteness
14
作者 Sonia Goel Meena Tushir +4 位作者 Jyoti Arora Tripti Sharma Deepali Gupta Ali Nauman Ghulam Muhammad 《Computers, Materials & Continua》 SCIE EI 2024年第11期3125-3145,共21页
In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification tasks.Traditional approaches often ... In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification tasks.Traditional approaches often rely on statistical methods for imputation,which may yield suboptimal results and be computationally intensive.This paper aims to integrate imputation and clustering techniques to enhance the classification of incomplete medical data with improved accuracy.Conventional classification methods are ill-suited for incomplete medical data.To enhance efficiency without compromising accuracy,this paper introduces a novel approach that combines imputation and clustering for the classification of incomplete data.Initially,the linear interpolation imputation method alongside an iterative Fuzzy c-means clustering method is applied and followed by a classification algorithm.The effectiveness of the proposed approach is evaluated using multiple performance metrics,including accuracy,precision,specificity,and sensitivity.The encouraging results demonstrate that our proposed method surpasses classical approaches across various performance criteria. 展开更多
关键词 Incomplete data nearest neighbor linear interpolation IMPUTATION CLUSTERING CLASSIFICATION
在线阅读 下载PDF
Missing Value Imputation for Radar-Derived Time-Series Tracks of Aerial Targets Based on Improved Self-Attention-Based Network
15
作者 Zihao Song Yan Zhou +2 位作者 Wei Cheng Futai Liang Chenhao Zhang 《Computers, Materials & Continua》 SCIE EI 2024年第3期3349-3376,共28页
The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random mis... The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random missing(RM)that differs significantly from common missing patterns of RTT-AT.The method for solving the RM may experience performance degradation or failure when applied to RTT-AT imputation.Conventional autoregressive deep learning methods are prone to error accumulation and long-term dependency loss.In this paper,a non-autoregressive imputation model that addresses the issue of missing value imputation for two common missing patterns in RTT-AT is proposed.Our model consists of two probabilistic sparse diagonal masking self-attention(PSDMSA)units and a weight fusion unit.It learns missing values by combining the representations outputted by the two units,aiming to minimize the difference between the missing values and their actual values.The PSDMSA units effectively capture temporal dependencies and attribute correlations between time steps,improving imputation quality.The weight fusion unit automatically updates the weights of the output representations from the two units to obtain a more accurate final representation.The experimental results indicate that,despite varying missing rates in the two missing patterns,our model consistently outperforms other methods in imputation performance and exhibits a low frequency of deviations in estimates for specific missing entries.Compared to the state-of-the-art autoregressive deep learning imputation model Bidirectional Recurrent Imputation for Time Series(BRITS),our proposed model reduces mean absolute error(MAE)by 31%~50%.Additionally,the model attains a training speed that is 4 to 8 times faster when compared to both BRITS and a standard Transformer model when trained on the same dataset.Finally,the findings from the ablation experiments demonstrate that the PSDMSA,the weight fusion unit,cascade network design,and imputation loss enhance imputation performance and confirm the efficacy of our design. 展开更多
关键词 Missing value imputation time-series tracks probabilistic sparsity diagonal masking self-attention weight fusion
在线阅读 下载PDF
Missing Data Imputation: A Comprehensive Review
16
作者 Majed Alwateer El-Sayed Atlam +2 位作者 Mahmoud Mohammed Abd El-Raouf Osama A. Ghoneim Ibrahim Gad 《Journal of Computer and Communications》 2024年第11期53-75,共23页
Missing data presents a significant challenge in statistical analysis and machine learning, often resulting in biased outcomes and diminished efficiency. This comprehensive review investigates various imputation techn... Missing data presents a significant challenge in statistical analysis and machine learning, often resulting in biased outcomes and diminished efficiency. This comprehensive review investigates various imputation techniques, categorizing them into three primary approaches: deterministic methods, probabilistic models, and machine learning algorithms. Traditional techniques, including mean or mode imputation, regression imputation, and last observation carried forward, are evaluated alongside more contemporary methods such as multiple imputation, expectation-maximization, and deep learning strategies. The strengths and limitations of each approach are outlined. Key considerations for selecting appropriate methods, based on data characteristics and research objectives, are discussed. The importance of evaluating imputation’s impact on subsequent analyses is emphasized. This synthesis of recent advancements and best practices provides researchers with a robust framework for effectively handling missing data, thereby improving the reliability of empirical findings across diverse disciplines. 展开更多
关键词 Missing Data Machine Learning PREDICTION Deep Learning IMPUTATION
在线阅读 下载PDF
A Study of EM Algorithm as an Imputation Method: A Model-Based Simulation Study with Application to a Synthetic Compositional Data
17
作者 Yisa Adeniyi Abolade Yichuan Zhao 《Open Journal of Modelling and Simulation》 2024年第2期33-42,共10页
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode... Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance. 展开更多
关键词 Compositional Data Linear Regression Model Least Square Method Robust Least Square Method Synthetic Data Aitchison Distance Maximum Likelihood Estimation Expectation-Maximization Algorithm k-Nearest Neighbor and Mean imputation
在线阅读 下载PDF
特征价格法在房地产价格指数中的应用 被引量:6
18
作者 孙宪华 刘振惠 张臣曦 《现代财经(天津财经大学学报)》 CSSCI 北大核心 2008年第5期61-65,共5页
特征价格法(Hedonic method)是将房地产价格变动中的质量特征因素进行分解,以显现出各项特征的隐含价格。并从价格的总变动中逐项剔除质量特征变动的影响,达到仅仅反映纯价格变动的目的。本文通过双重Imputation过程估计缺失价格和剔除... 特征价格法(Hedonic method)是将房地产价格变动中的质量特征因素进行分解,以显现出各项特征的隐含价格。并从价格的总变动中逐项剔除质量特征变动的影响,达到仅仅反映纯价格变动的目的。本文通过双重Imputation过程估计缺失价格和剔除异常值的影响,解决了可比性问题,并增强了Hedonic模型的稳定性。 展开更多
关键词 房地产价格指数 质量调整 特征价格法 双重Imputation
在线阅读 下载PDF
Establishment and verification of a surgical prognostic model for cervical spinal cord injury without radiological abnormality 被引量:7
19
作者 Jie Wang Shuai Guo +2 位作者 Xuan Cai Jia-Wei Xu Hao-Peng Li 《Neural Regeneration Research》 SCIE CAS CSCD 2019年第4期713-720,共8页
Some studies have suggested that early surgical treatment can effectively improve the prognosis of cervical spinal cord injury without radiological abnormality, but no research has focused on the development of a prog... Some studies have suggested that early surgical treatment can effectively improve the prognosis of cervical spinal cord injury without radiological abnormality, but no research has focused on the development of a prognostic model of cervical spinal cord injury without radiological abnormality. This retrospective analysis included 43 patients with cervical spinal cord injury without radiological abnormality. Seven potential factors were assessed: age, sex, external force strength causing damage, duration of disease, degree of cervical spinal stenosis, Japanese Orthopaedic Association score, and physiological cervical curvature. A model was established using multiple binary logistic regression analysis. The model was evaluated by concordant profiling and the area under the receiver operating characteristic curve. Bootstrapping was used for internal validation. The prognostic model was as follows: logit(P) =-25.4545 + 21.2576 VALUE + 1.2160SCORE-3.4224 TIME, where VALUE refers to the Pavlov ratio indicating the extent of cervical spinal stenosis, SCORE refers to the Japanese Orthopaedic Association score(0–17) after the operation, and TIME refers to the disease duration(from injury to operation). The area under the receiver operating characteristic curve for all patients was 0.8941(95% confidence interval, 0.7930–0.9952). Three factors assessed in the predictive model were associated with patient outcomes: a great extent of cervical stenosis, a poor preoperative neurological status, and a long disease duration. These three factors could worsen patient outcomes. Moreover, the disease prognosis was considered good when logit(P) ≥-2.5105. Overall, the model displayed a certain clinical value. This study was approved by the Biomedical Ethics Committee of the Second Affiliated Hospital of Xi'an Jiaotong University, China(approval number: 2018063) on May 8, 2018. 展开更多
关键词 nerve REGENERATION SURGICAL prognostic model CERVICAL SPINAL cord injury retrospective study MULTIPLE binary logistic regression analysis bootstrapping internal validation MULTIPLE imputations CERVICAL SPINAL stenosis duration of disease Pavlov ratio neural REGENERATION
暂未订购
Comparative Variance and Multiple Imputation Used for Missing Values in Land Price DataSet 被引量:1
20
作者 Longqing Zhang Xinwei Zhang +2 位作者 Liping Bai Yanghong Zhang Feng Sun Changcheng Chen 《Computers, Materials & Continua》 SCIE EI 2019年第9期1175-1187,共13页
Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by st... Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value. 展开更多
关键词 Imputation method multiple imputations probabilistic model
在线阅读 下载PDF
上一页 1 2 7 下一页 到第
使用帮助 返回顶部