在近红外光谱PLS定量模型的建立过程中训练集样本的选取和潜变量数的确定是十分重要的。因此,该研究以橘叶中橙皮苷的含量检测为例,分别比较了random sampling(RS),Kennard-Stone(KS),duplex,sample set partitioning based on joint x-...在近红外光谱PLS定量模型的建立过程中训练集样本的选取和潜变量数的确定是十分重要的。因此,该研究以橘叶中橙皮苷的含量检测为例,分别比较了random sampling(RS),Kennard-Stone(KS),duplex,sample set partitioning based on joint x-y distance(SPXY)四种训练集样本的选取方法对模型的影响,以及留一交互验证法和蒙特卡罗法对潜变量数确定的影响。结果表明,SPXY法选取的训练集建立的模型优于其他三种方法,蒙特卡罗法能够较好地确定模型的潜变量数并有效地减少过拟合风险,所建模型的交互验证均方根,预测均方根及预测集相关系数分别为0.7681,0.7369,0.9752。展开更多
基于主动学习的标签噪声清洗方法(Active label noise cleaning,ALNC)是一种通过主动学习筛选疑似噪声样本,进而交给人工专家进行再标记的标签噪声清洗方法.虽然该方法既有很好的噪声识别效果又能保持原有数据的完整性,但仍存在人工额...基于主动学习的标签噪声清洗方法(Active label noise cleaning,ALNC)是一种通过主动学习筛选疑似噪声样本,进而交给人工专家进行再标记的标签噪声清洗方法.虽然该方法既有很好的噪声识别效果又能保持原有数据的完整性,但仍存在人工额外标记代价较高的问题,即筛选出的疑似噪声样本中存在一定比例的正常样本.为了解决这一问题,降低标签噪声清洗过程中的人工额外检验代价,本文提出了一种基于SPXY(Sample Set Partitioning based on Joint X-Y Distance Sampling)采样的标签噪声主动清洗方法(Active label noise cleaning based on SPXY,SPXYALNC),该方法在主动学习筛选疑似噪声样本的过程中结合了SPXY采样方法,这样既考虑了样本的不确定性,又考虑了样本的代表性,并且在原有标准数据集上针对分类问题进行了实验,实验结果表明该方法在保持原有噪声识别效果的同时可以明显降低人工额外检验代价.展开更多
两淮矿区是我国大型的煤炭基地之一,经过多年的开采已形成大面积的沉陷水域,为推进水资源合理利用,水深的测量尤为关键,使用遥感反演水深的方法可高效便捷获取大范围水深数据。本文以淮南矿业集团谢桥煤矿范围内沉陷水域为研究区域,利用...两淮矿区是我国大型的煤炭基地之一,经过多年的开采已形成大面积的沉陷水域,为推进水资源合理利用,水深的测量尤为关键,使用遥感反演水深的方法可高效便捷获取大范围水深数据。本文以淮南矿业集团谢桥煤矿范围内沉陷水域为研究区域,利用Sentinel-2B多光谱影像,研究了相对于随机样本数据集划分法(RS,Random sampling)和光谱-理化值共生距离法(SPXY,Sample set partitioning based on joint x-y distance)对线性拟合模型和神经网络模型的优化作用。实验结果表明:(1)神经网络模型在研究区域内的反演效果较为理想,随机样本数据集划分法和SPXY样本数据集划分法的决定系数分别为0.737和0.787;(2)由于水中悬浮物对光谱反射率的影响很大使得在6-9m的深水域反演效果较差,两种样本数据集划分方法构建的神经网络模型的均方根误差分别为1.178m和1.059m;(3)SPXY样本数据集划分法对构建沉陷水域水深反演模型有着优化作用,对神经网络模型的改进效果最为明显其决定系数提高了0.05,均方根误差和平均绝对误差分别降低了0.097m和0.065m。展开更多
目的:应用近红外分析技术结合化学计量学方法建立中药乳块消片醇沉液中丹参素和橙皮苷含量测定的新方法。方法:采用Sample set Partitioning based on jointx-ydistance(SPXY)法对训练集样本和预测集样本进行划分,应用不同的偏最小二乘...目的:应用近红外分析技术结合化学计量学方法建立中药乳块消片醇沉液中丹参素和橙皮苷含量测定的新方法。方法:采用Sample set Partitioning based on jointx-ydistance(SPXY)法对训练集样本和预测集样本进行划分,应用不同的偏最小二乘方法进行有效波段范围选择以及建立定量校正模型,分别比较了间隔偏最小二乘算法(interval partial least squares,iPLS),组合间隔偏最小二乘算法(Synergy interval partial least squares,SiPLS),向后间隔偏最小二乘算法(backward interval partial least squares,BiPLS),窗口移动偏最小二乘算法(moving window partial least squares,MWPLS)。结果:丹参素采用SiPLS三个区间组合、橙皮苷采用SiPLS四个区间组合建立的回归模型性能最好,预测相关系数(R)分别为0.9956和0.9940,交互验证误差均方根(RMSECV)分别为0.0096和0.0083,预测误差均方根(RMSEP)为0.0062和0.0074。结论:该近红外光谱法对丹参素和橙皮苷含量预测结果较好,且方便快捷、无前期预处理和无污染,为中药生产过程的在线检测提供了依据。展开更多
糊化特性是小米的最重要加工特性之一,对小米的加工性能及产品质量有重要的影响。基于可见-近红外光谱特征信息,在不粉碎小米颗粒的状态下,提出了一种快速无损检测小米的糊化特性的方法。首先,获取小米在370~1020 nm范围内漫反射光谱后...糊化特性是小米的最重要加工特性之一,对小米的加工性能及产品质量有重要的影响。基于可见-近红外光谱特征信息,在不粉碎小米颗粒的状态下,提出了一种快速无损检测小米的糊化特性的方法。首先,获取小米在370~1020 nm范围内漫反射光谱后,将小米粉碎成小米粉,使用RAV快速粘度分析仪测定小米粉的峰值粘度(PV)、最低粘度(TV)、衰减值(BD)、最终粘度(FV)、和回升值(SB)、糊化温度(GT)以及峰值时间(PT)等7个糊化特性指标。然后,对原始光谱进行Savitzkye-Golay(SG)平滑、多元散射校正(MSC)和一阶导数法(1-D)预处理。最后,结合三种处理光谱和小米糊化特性指标值,通过Sample set partitioning based on joint x-y distances(SPXY)方法确定样本的校正集和验证集;基于连续投影算法(SPA)选择了特征波长,利用特征波长反射光谱信号建立了小米糊化特性指标的多元线性回归(MLR)预测模型,并使用验证集样本验证MLR模型的预测准确性。糊化指标预测结果:对于粘度指标中的PV、TV和SB参数值,经过MSC预处理后光谱,分别选择了9,17和18个特征波长建立的MLR模型的预测结果最好,预测相关系数(R p)分别为0.9347,0.8255和0.8746,预测误差(SEP)分别为174.0397,67.2203和74.2818;对于BD值,经过S-G预处理后选择了14个特征波长的MLR模型预测结果最好,R p为0.9244,SEP为178.0201;此外,对于FV参数值,经过1-D处理后选择了16个特征波长所建立MLR模型的预测相关系数R p为0.8531,SEP为132.1667。研究结果表明,利用可见-近红外光谱结合SPXY和SPA算法在不粉碎小米的状态下对其糊化特性进行检测是可行的。本研究为小米产品相关企业在生产前期,通过快速测定小米原料糊化特性,进而评估产品加工品质提供一种新的技术手段,具有较强的实际应用潜力。展开更多
采用近红外光谱法结合化学计量学方法测定液态奶中蛋白质和脂肪的含量,比较分析随机法(randomsampling,RS)、kennard-stone(KS)、Duplex、基于x-y距离结合的样本划分方法(sample set partitioning based onjoint x-y distance,SPXY)4种...采用近红外光谱法结合化学计量学方法测定液态奶中蛋白质和脂肪的含量,比较分析随机法(randomsampling,RS)、kennard-stone(KS)、Duplex、基于x-y距离结合的样本划分方法(sample set partitioning based onjoint x-y distance,SPXY)4种训练集和预测集样本划分方法,使用Haaland法对异常值进行剔除,并对光谱预处理方法进行讨论。所建立的脂肪模型交叉验证均方根(RMSECV)与预测均方根(RMSEP)分别为2.434和2.099,预测集的决定系数Rp2为0.964;蛋白质模型的RMSECV与RMSEP分别为2.270和2.564,Rp2为0.940。结果表明,该方法快速、准确,可为液态奶的现场质量控制提供了有效途径。展开更多
水分是柿饼的重要组成成分,也是影响柿饼制作过程的重要因素。利用可见/近红外反射光谱对柿饼制作过程中的水分含量进行检测。首先,获取柿饼在不同加工阶段的可见/近红外反射光谱(400~1000 nm),采用烘干法测定柿饼水分含量。然后,对光...水分是柿饼的重要组成成分,也是影响柿饼制作过程的重要因素。利用可见/近红外反射光谱对柿饼制作过程中的水分含量进行检测。首先,获取柿饼在不同加工阶段的可见/近红外反射光谱(400~1000 nm),采用烘干法测定柿饼水分含量。然后,对光谱进行Mean smoothing(MS)平滑、多元散射校正(MSC)和一阶导数(1-D)预处理。最后,对不同预处理光谱,结合样本水分含量,使用Samples set partitioning based on joint x-y distance(SPXY)方法划分校正集和验证集,基于SPA方法选择特征波长,建立多元线性回归(MLR)预测模型。结果表明,反射光谱经过MS处理后,确定的9个最优波长组合建立水分检测模型的预测结果最好:预测相关系数(Rp)为0.9690,预测标准残差(SEP)为3.4729%,可见/近红外反射光谱技术可以较好地预测柿饼制作过程中的的水分含量。研究可为柿饼加工过程中的品质快速检测提供一定的技术支撑。展开更多
Model validation is the most important part of building a supervised model.For building a model with good generalization performance one must have a sensible data splitting strategy,and this is crucial for model valid...Model validation is the most important part of building a supervised model.For building a model with good generalization performance one must have a sensible data splitting strategy,and this is crucial for model validation.In this study,we con-ducted a comparative study on various reported data splitting methods.The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes.Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets.Data splitting methods tested included variants of cross-validation,bootstrapping,bootstrapped Latin partition,Kennard-Stone algorithm(K-S)and sample set partitioning based on joint X-Y distances algorithm(SPXY).These methods were employed to split the data into training and validation sets.The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the train-ing/validation procedure used in model construction.The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set.We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets.Such disparity decreased when more samples were available for training/validation,and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used.We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance,suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance.We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance,most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.展开更多
文摘在近红外光谱PLS定量模型的建立过程中训练集样本的选取和潜变量数的确定是十分重要的。因此,该研究以橘叶中橙皮苷的含量检测为例,分别比较了random sampling(RS),Kennard-Stone(KS),duplex,sample set partitioning based on joint x-y distance(SPXY)四种训练集样本的选取方法对模型的影响,以及留一交互验证法和蒙特卡罗法对潜变量数确定的影响。结果表明,SPXY法选取的训练集建立的模型优于其他三种方法,蒙特卡罗法能够较好地确定模型的潜变量数并有效地减少过拟合风险,所建模型的交互验证均方根,预测均方根及预测集相关系数分别为0.7681,0.7369,0.9752。
文摘基于主动学习的标签噪声清洗方法(Active label noise cleaning,ALNC)是一种通过主动学习筛选疑似噪声样本,进而交给人工专家进行再标记的标签噪声清洗方法.虽然该方法既有很好的噪声识别效果又能保持原有数据的完整性,但仍存在人工额外标记代价较高的问题,即筛选出的疑似噪声样本中存在一定比例的正常样本.为了解决这一问题,降低标签噪声清洗过程中的人工额外检验代价,本文提出了一种基于SPXY(Sample Set Partitioning based on Joint X-Y Distance Sampling)采样的标签噪声主动清洗方法(Active label noise cleaning based on SPXY,SPXYALNC),该方法在主动学习筛选疑似噪声样本的过程中结合了SPXY采样方法,这样既考虑了样本的不确定性,又考虑了样本的代表性,并且在原有标准数据集上针对分类问题进行了实验,实验结果表明该方法在保持原有噪声识别效果的同时可以明显降低人工额外检验代价.
文摘两淮矿区是我国大型的煤炭基地之一,经过多年的开采已形成大面积的沉陷水域,为推进水资源合理利用,水深的测量尤为关键,使用遥感反演水深的方法可高效便捷获取大范围水深数据。本文以淮南矿业集团谢桥煤矿范围内沉陷水域为研究区域,利用Sentinel-2B多光谱影像,研究了相对于随机样本数据集划分法(RS,Random sampling)和光谱-理化值共生距离法(SPXY,Sample set partitioning based on joint x-y distance)对线性拟合模型和神经网络模型的优化作用。实验结果表明:(1)神经网络模型在研究区域内的反演效果较为理想,随机样本数据集划分法和SPXY样本数据集划分法的决定系数分别为0.737和0.787;(2)由于水中悬浮物对光谱反射率的影响很大使得在6-9m的深水域反演效果较差,两种样本数据集划分方法构建的神经网络模型的均方根误差分别为1.178m和1.059m;(3)SPXY样本数据集划分法对构建沉陷水域水深反演模型有着优化作用,对神经网络模型的改进效果最为明显其决定系数提高了0.05,均方根误差和平均绝对误差分别降低了0.097m和0.065m。
文摘目的:应用近红外分析技术结合化学计量学方法建立中药乳块消片醇沉液中丹参素和橙皮苷含量测定的新方法。方法:采用Sample set Partitioning based on jointx-ydistance(SPXY)法对训练集样本和预测集样本进行划分,应用不同的偏最小二乘方法进行有效波段范围选择以及建立定量校正模型,分别比较了间隔偏最小二乘算法(interval partial least squares,iPLS),组合间隔偏最小二乘算法(Synergy interval partial least squares,SiPLS),向后间隔偏最小二乘算法(backward interval partial least squares,BiPLS),窗口移动偏最小二乘算法(moving window partial least squares,MWPLS)。结果:丹参素采用SiPLS三个区间组合、橙皮苷采用SiPLS四个区间组合建立的回归模型性能最好,预测相关系数(R)分别为0.9956和0.9940,交互验证误差均方根(RMSECV)分别为0.0096和0.0083,预测误差均方根(RMSEP)为0.0062和0.0074。结论:该近红外光谱法对丹参素和橙皮苷含量预测结果较好,且方便快捷、无前期预处理和无污染,为中药生产过程的在线检测提供了依据。
文摘糊化特性是小米的最重要加工特性之一,对小米的加工性能及产品质量有重要的影响。基于可见-近红外光谱特征信息,在不粉碎小米颗粒的状态下,提出了一种快速无损检测小米的糊化特性的方法。首先,获取小米在370~1020 nm范围内漫反射光谱后,将小米粉碎成小米粉,使用RAV快速粘度分析仪测定小米粉的峰值粘度(PV)、最低粘度(TV)、衰减值(BD)、最终粘度(FV)、和回升值(SB)、糊化温度(GT)以及峰值时间(PT)等7个糊化特性指标。然后,对原始光谱进行Savitzkye-Golay(SG)平滑、多元散射校正(MSC)和一阶导数法(1-D)预处理。最后,结合三种处理光谱和小米糊化特性指标值,通过Sample set partitioning based on joint x-y distances(SPXY)方法确定样本的校正集和验证集;基于连续投影算法(SPA)选择了特征波长,利用特征波长反射光谱信号建立了小米糊化特性指标的多元线性回归(MLR)预测模型,并使用验证集样本验证MLR模型的预测准确性。糊化指标预测结果:对于粘度指标中的PV、TV和SB参数值,经过MSC预处理后光谱,分别选择了9,17和18个特征波长建立的MLR模型的预测结果最好,预测相关系数(R p)分别为0.9347,0.8255和0.8746,预测误差(SEP)分别为174.0397,67.2203和74.2818;对于BD值,经过S-G预处理后选择了14个特征波长的MLR模型预测结果最好,R p为0.9244,SEP为178.0201;此外,对于FV参数值,经过1-D处理后选择了16个特征波长所建立MLR模型的预测相关系数R p为0.8531,SEP为132.1667。研究结果表明,利用可见-近红外光谱结合SPXY和SPA算法在不粉碎小米的状态下对其糊化特性进行检测是可行的。本研究为小米产品相关企业在生产前期,通过快速测定小米原料糊化特性,进而评估产品加工品质提供一种新的技术手段,具有较强的实际应用潜力。
文摘采用近红外光谱法结合化学计量学方法测定液态奶中蛋白质和脂肪的含量,比较分析随机法(randomsampling,RS)、kennard-stone(KS)、Duplex、基于x-y距离结合的样本划分方法(sample set partitioning based onjoint x-y distance,SPXY)4种训练集和预测集样本划分方法,使用Haaland法对异常值进行剔除,并对光谱预处理方法进行讨论。所建立的脂肪模型交叉验证均方根(RMSECV)与预测均方根(RMSEP)分别为2.434和2.099,预测集的决定系数Rp2为0.964;蛋白质模型的RMSECV与RMSEP分别为2.270和2.564,Rp2为0.940。结果表明,该方法快速、准确,可为液态奶的现场质量控制提供了有效途径。
文摘水分是柿饼的重要组成成分,也是影响柿饼制作过程的重要因素。利用可见/近红外反射光谱对柿饼制作过程中的水分含量进行检测。首先,获取柿饼在不同加工阶段的可见/近红外反射光谱(400~1000 nm),采用烘干法测定柿饼水分含量。然后,对光谱进行Mean smoothing(MS)平滑、多元散射校正(MSC)和一阶导数(1-D)预处理。最后,对不同预处理光谱,结合样本水分含量,使用Samples set partitioning based on joint x-y distance(SPXY)方法划分校正集和验证集,基于SPA方法选择特征波长,建立多元线性回归(MLR)预测模型。结果表明,反射光谱经过MS处理后,确定的9个最优波长组合建立水分检测模型的预测结果最好:预测相关系数(Rp)为0.9690,预测标准残差(SEP)为3.4729%,可见/近红外反射光谱技术可以较好地预测柿饼制作过程中的的水分含量。研究可为柿饼加工过程中的品质快速检测提供一定的技术支撑。
基金YX and RG thank Wellcome Trust for funding MetaboFlow(Grant 202952/Z/16/Z).
文摘Model validation is the most important part of building a supervised model.For building a model with good generalization performance one must have a sensible data splitting strategy,and this is crucial for model validation.In this study,we con-ducted a comparative study on various reported data splitting methods.The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes.Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets.Data splitting methods tested included variants of cross-validation,bootstrapping,bootstrapped Latin partition,Kennard-Stone algorithm(K-S)and sample set partitioning based on joint X-Y distances algorithm(SPXY).These methods were employed to split the data into training and validation sets.The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the train-ing/validation procedure used in model construction.The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set.We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets.Such disparity decreased when more samples were available for training/validation,and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used.We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance,suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance.We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance,most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.