Small datasets are often challenging due to their limited sample size.This research introduces a novel solution to these problems:average linkage virtual sample generation(ALVSG).ALVSG leverages the underlying data st...Small datasets are often challenging due to their limited sample size.This research introduces a novel solution to these problems:average linkage virtual sample generation(ALVSG).ALVSG leverages the underlying data structure to create virtual samples,which can be used to augment the original dataset.The ALVSG process consists of two steps.First,an average-linkage clustering technique is applied to the dataset to create a dendrogram.The dendrogram represents the hierarchical structure of the dataset,with each merging operation regarded as a linkage.Next,the linkages are combined into an average-based dataset,which serves as a new representation of the dataset.The second step in the ALVSG process involves generating virtual samples using the average-based dataset.The research project generates a set of 100 virtual samples by uniformly distributing them within the provided boundary.These virtual samples are then added to the original dataset,creating a more extensive dataset with improved generalization performance.The efficacy of the ALVSG approach is validated through resampling experiments and t-tests conducted on two small real-world datasets.The experiments are conducted on three forecasting models:the support vector machine for regression(SVR),the deep learning model(DL),and XGBoost.The results show that the ALVSG approach outperforms the baseline methods in terms of mean square error(MSE),root mean square error(RMSE),and mean absolute error(MAE).展开更多
文摘目的探讨自回归移动平均模型-长短期记忆(autoregressive integrated moving average-long short-term memory,ARIMA-LSTM)组合模型在肾综合征出血热(hemorrhagic fever with renal syndrome,HFRS)不同流行模式发病率预测中应用的可行性。方法收集1961—2020年全国HFRS年发病率、2004年1月至2020年12月全国、黑龙江省、吉林省、辽宁省、陕西省、山东省、河北省、广东省HFRS逐月发病率数据;全国及黑龙江省作为冬峰较春峰高代表,吉林省、辽宁省作为春峰与冬峰相当代表,陕西省、山东省作为仅存在冬峰代表,河北省、广东省作为仅存在春峰代表。1961—2014年逐年发病率、2004年1月至2020年6月逐月发病率数据作为训练集,2015—2020年逐年发病率、2020年7-12月逐月发病率数据作为测试集。分别建立ARIMA模型、ARIMA-LSTM组合模型,采用平均绝对百分比误差下降率(decline rate of mean absolute percentage error,DR_(MAPE))、均方根误差下降率(decline rate of root mean squared error,DRRMSE)评价模型拟合及预测精度优化程度。结果全国逐年、全国及黑龙江省、吉林省、辽宁省、陕西省、山东省、河北省、广东省逐月HFRS发病率拟合最佳ARIMA模型分别为ARIMA(2,0,0)、ARIMA(3,1,0)(2,1,1)_(12)、ARIMA(2,0,1)(2,1,1)_(12)、ARIMA(3,0,0)(2,1,1)_(12)含常数项、ARIMA(2,1,1)(2,1,1)_(12)、ARIMA(1,0,3)(1,1,0)_(12)、ARIMA(0,1,3)(2,1,1)_(12)、ARIMA(1,1,3)(2,0,0)_(12)、ARIMA(3,1,1)(1,1,1)_(12)。全国逐年、全国及黑龙江省、吉林省、辽宁省、陕西省、山东省、河北省、广东省逐月数据建立ARIMA-LSTM组合模型较ARIMA模型拟合的DR_(MAPE)依次为-19.57%、-46.38%、-43.27%、-46.37%、-49.70%、-48.36%、-58.23%、-35.52%、-48.74%;DRRMSE依次为-11.21%、-36.17%、-64.89%、-55.68%、-54.81%、-31.76%、-39.69%、-55.64%、-30.06%。全国逐年、全国及黑龙江省、吉林省、辽宁省、陕西省、山东省、河北省、广东省逐月数据建立ARIMA-LSTM组合模型较ARIMA模型预测的DR_(MAPE)依次为-11.10%、-8.69%、-19.68%、-36.17%、-55.57%、-9.44%、-14.60%、-14.22%、-9.26%;DRRMSE依次为-14.43%、-7.42%、-12.66%、-13.83%、-36.56%、10.37%、81.14%、-19.68%、-1.18%。结论ARIMA-LSTM组合模型总体在各类HFRS数据中拟合及预测效果均优于ARIMA模型,LSTM适于我国HFRS预测模型优化,但陕西省和山东省不适于ARIMA-LSTM预测。
基金funding support from the National Science and Technology Council(NSTC),under Grant No.114-2410-H-011-026-MY3.
文摘Small datasets are often challenging due to their limited sample size.This research introduces a novel solution to these problems:average linkage virtual sample generation(ALVSG).ALVSG leverages the underlying data structure to create virtual samples,which can be used to augment the original dataset.The ALVSG process consists of two steps.First,an average-linkage clustering technique is applied to the dataset to create a dendrogram.The dendrogram represents the hierarchical structure of the dataset,with each merging operation regarded as a linkage.Next,the linkages are combined into an average-based dataset,which serves as a new representation of the dataset.The second step in the ALVSG process involves generating virtual samples using the average-based dataset.The research project generates a set of 100 virtual samples by uniformly distributing them within the provided boundary.These virtual samples are then added to the original dataset,creating a more extensive dataset with improved generalization performance.The efficacy of the ALVSG approach is validated through resampling experiments and t-tests conducted on two small real-world datasets.The experiments are conducted on three forecasting models:the support vector machine for regression(SVR),the deep learning model(DL),and XGBoost.The results show that the ALVSG approach outperforms the baseline methods in terms of mean square error(MSE),root mean square error(RMSE),and mean absolute error(MAE).