摘要
为了解决表格数据中数据缺失对后续任务产生的不利影响,提出使用扩散模型进行缺失值插补的方法.针对原始扩散模型在生成过程中耗时过长的问题,设计基于加速扩散模型的数据插补方法(PNDM_Tab).扩散模型的前向过程通过高斯加噪方法实现,采用基于扩散模型的伪数值方法进行反向过程加速.使用U-Net与注意力机制相结合的网络结构从数据中高效提取显著特征,实现噪声的准确预测.为了使模型在训练阶段有监督目标,使用随机掩码处理训练数据以生成新的缺失数据.在9个数据集中的插补方法对比实验结果表明:相较其他插补方法,PNDM_Tab在6个数据集中的均方根误差最低.实验结果证明,相较于原始的扩散模型,反向过程使用扩散模型的伪数值方法能够在减少采样步数的同时保持生成性能不变.
To address the adverse effects of missing data in tabular data on subsequent tasks,a method for imputation using diffusion models was proposed.An accelerated diffusion model-based imputation method(PNDM_Tab)was designed aiming at the problem that the original diffusion models being time-consuming during the generation process.The forward process of the diffusion model was realized through Gaussian noise addition,and the pseudo-numerical methods derived from diffusion models were employed to achieve acceleration of the reverse process.Using a network structure combining U-Net with attention mechanisms,significant features were extracted efficiently from the data to predict noise accurately.To provide supervised targets during the training phase,random masking of the training data generated new missing data.Comparative experiments were conducted in nine datasets,and the results showed that PNDM_Tab achieved the lowest root mean square error in six datasets compared to other imputation methods.Experimental results demonstrate that,compared to the original diffusion models,the use of pseudo-numerical methods in the reverse process can reduce the number of sampling steps while maintaining equivalent generative performance.
作者
王圣举
张赞
WANG Shengju;ZHANG Zan(School of Electronics and Control Engineering,Chang’an University,Xi’an 710064,China)
出处
《浙江大学学报(工学版)》
北大核心
2025年第7期1471-1480,1503,共11页
Journal of Zhejiang University(Engineering Science)
关键词
表格数据
扩散模型
数据插补
注意力机制
深度学习
tabular data
diffusion model
data imputation
attention mechanism
deep learning