Sampling design(SD) plays a crucial role in providing reliable input for digital soil mapping(DSM) and increasing its efficiency.Sampling design, with a predetermined sample size and consideration of budget and spatia...Sampling design(SD) plays a crucial role in providing reliable input for digital soil mapping(DSM) and increasing its efficiency.Sampling design, with a predetermined sample size and consideration of budget and spatial variability, is a selection procedure for identifying a set of sample locations spread over a geographical space or with a good feature space coverage. A good feature space coverage ensures accurate estimation of regression parameters, while spatial coverage contributes to effective spatial interpolation.First, we review several statistical and geometric SDs that mainly optimize the sampling pattern in a geographical space and illustrate the strengths and weaknesses of these SDs by considering spatial coverage, simplicity, accuracy, and efficiency. Furthermore, Latin hypercube sampling, which obtains a full representation of multivariate distribution in geographical space, is described in detail for its development, improvement, and application. In addition, we discuss the fuzzy k-means sampling, response surface sampling, and Kennard-Stone sampling, which optimize sampling patterns in a feature space. We then discuss some practical applications that are mainly addressed by the conditioned Latin hypercube sampling with the flexibility and feasibility of adding multiple optimization criteria. We also discuss different methods of validation, an important stage of DSM, and conclude that an independent dataset selected from the probability sampling is superior for its free model assumptions. For future work, we recommend: 1) exploring SDs with both good spatial coverage and feature space coverage; 2) uncovering the real impacts of an SD on the integral DSM procedure;and 3) testing the feasibility and contribution of SDs in three-dimensional(3 D) DSM with variability for multiple layers.展开更多
In contrast to their empirical counterparts,machine-learning interatomic potentials(MLIAPs)promise to deliver near-quantum accuracy over broad regions of configuration space.However,due to their generic functional for...In contrast to their empirical counterparts,machine-learning interatomic potentials(MLIAPs)promise to deliver near-quantum accuracy over broad regions of configuration space.However,due to their generic functional forms and extreme flexibility,they can catastrophically fail to capture the properties of novel,out-of-sample configurations,making the quality of the training set a determining factor,especially when investigating materials under extreme conditions.We propose a novel automated dataset generation method based on the maximization of the information entropy of the feature distribution,aiming at an extremely broad coverage of the configuration space in a way that is agnostic to the properties of specific target materials.The ability of the dataset to capture unique material properties is demonstrated on a range of unary materials,including elements with the FCC(Al),BCC(W),HCP(Be,Re and Os),graphite(C),and trigonal(Sb,Te)ground states.MLIAPs trained to this dataset are shown to be accurate over a range of application-relevant metrics,as well as extremely robust over very broad swaths of configurations space,even without dataset fine-tuning or hyperparameter optimization,making the approach extremely attractive to rapidly and autonomously develop general-purpose MLIAPs suitable for simulations in extreme conditions.展开更多
基金funded by the Natural Science and Engineering Research Council (NSERC) of Canada (No. RGPIN-2014-04100)
文摘Sampling design(SD) plays a crucial role in providing reliable input for digital soil mapping(DSM) and increasing its efficiency.Sampling design, with a predetermined sample size and consideration of budget and spatial variability, is a selection procedure for identifying a set of sample locations spread over a geographical space or with a good feature space coverage. A good feature space coverage ensures accurate estimation of regression parameters, while spatial coverage contributes to effective spatial interpolation.First, we review several statistical and geometric SDs that mainly optimize the sampling pattern in a geographical space and illustrate the strengths and weaknesses of these SDs by considering spatial coverage, simplicity, accuracy, and efficiency. Furthermore, Latin hypercube sampling, which obtains a full representation of multivariate distribution in geographical space, is described in detail for its development, improvement, and application. In addition, we discuss the fuzzy k-means sampling, response surface sampling, and Kennard-Stone sampling, which optimize sampling patterns in a feature space. We then discuss some practical applications that are mainly addressed by the conditioned Latin hypercube sampling with the flexibility and feasibility of adding multiple optimization criteria. We also discuss different methods of validation, an important stage of DSM, and conclude that an independent dataset selected from the probability sampling is superior for its free model assumptions. For future work, we recommend: 1) exploring SDs with both good spatial coverage and feature space coverage; 2) uncovering the real impacts of an SD on the integral DSM procedure;and 3) testing the feasibility and contribution of SDs in three-dimensional(3 D) DSM with variability for multiple layers.
基金support by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security AdministrationThis research used computing resources provided by the Laboratory Institutional Computing Program and by the Darwin testbed at Los Alamos National Laboratory (LANL) which is funded by the Computational Systems and Software Environments subprogram of LANL’s Advanced Simulation and Computing program (NNSA/DOE)Los Alamos National Laboratory is operated by Triad National Security LLC, for the National Nuclear Security administration of the U.S. DOE under Contract No. 89233218CNA0000001。
文摘In contrast to their empirical counterparts,machine-learning interatomic potentials(MLIAPs)promise to deliver near-quantum accuracy over broad regions of configuration space.However,due to their generic functional forms and extreme flexibility,they can catastrophically fail to capture the properties of novel,out-of-sample configurations,making the quality of the training set a determining factor,especially when investigating materials under extreme conditions.We propose a novel automated dataset generation method based on the maximization of the information entropy of the feature distribution,aiming at an extremely broad coverage of the configuration space in a way that is agnostic to the properties of specific target materials.The ability of the dataset to capture unique material properties is demonstrated on a range of unary materials,including elements with the FCC(Al),BCC(W),HCP(Be,Re and Os),graphite(C),and trigonal(Sb,Te)ground states.MLIAPs trained to this dataset are shown to be accurate over a range of application-relevant metrics,as well as extremely robust over very broad swaths of configurations space,even without dataset fine-tuning or hyperparameter optimization,making the approach extremely attractive to rapidly and autonomously develop general-purpose MLIAPs suitable for simulations in extreme conditions.