摘要
Background The geo-traceability of cotton is crucial for ensuring the quality and integrity of cotton brands. However, effective methods for achieving this traceability are currently lacking. This study investigates the potential of explainable machine learning for the geo-traceability of raw cotton.Results The findings indicate that principal component analysis(PCA) exhibits limited effectiveness in tracing cotton origins. In contrast, partial least squares discriminant analysis(PLS-DA) demonstrates superior classification performance, identifying seven discriminating variables: Na, Mn, Ba, Rb, Al, As, and Pb. The use of decision tree(DT), support vector machine(SVM), and random forest(RF) models for origin discrimination yielded accuracies of 90%, 87%, and 97%, respectively. Notably, the light gradient boosting machine(Light GBM) model achieved perfect performance metrics, with accuracy, precision, and recall rate all reaching 100% on the test set. The output of the Light GBM model was further evaluated using the SHapley Additive ex Planation(SHAP) technique, which highlighted differences in the elemental composition of raw cotton from various countries. Specifically, the elements Pb, Ni, Na, Al, As, Ba, and Rb significantly influenced the model's predictions.Conclusion These findings suggest that explainable machine learning techniques can provide insights into the complex relationships between geographic information and raw cotton. Consequently, these methodologies enhances the precision and reliability of geographic traceability for raw cotton.
基金
supported by Agricultural Science and Technology Innovation Program of Chinese Academy of Agricultural Science。