期刊文献+

MD-HIT:Machine learning for material property prediction with dataset redundancy control

原文传递
导出
摘要 Materials datasets usually contain many redundant(highly similar)materials due to the tinkering approach historically used in material design.This redundancy skews the performance evaluation of machine learning(ML)models when using random splitting,leading to overestimated predictive performance and poor performance on out-of-distribution samples.This issue is well-known in bioinformatics for protein function prediction,where tools like CD-HIT are used to reduce redundancy by ensuring sequence similarity among samples greater than a given threshold.In this paper,we survey the overestimated ML performance in materials science for material property prediction and propose MD-HIT,a redundancy reduction algorithm for material datasets.Applying MD-HIT to composition-and structure-based formation energy and band gap prediction problems,we demonstrate that with redundancy control,the prediction performances of the ML models on test sets tend to have relatively lower performance compared to the model with high redundancy,but better reflect models’true prediction capability.
出处 《npj Computational Materials》 CSCD 2024年第1期638-648,共11页 计算材料学(英文)
基金 supported in part by National Science Foundation under the grant number 2311202.
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部