Earth’s internal core and crustal magnetic fields,as measured by geomagnetic satellites like MSS-1(Macao Science Satellite-1)and Swarm,are vital for understanding core dynamics and tectonic evolution.To model these i...Earth’s internal core and crustal magnetic fields,as measured by geomagnetic satellites like MSS-1(Macao Science Satellite-1)and Swarm,are vital for understanding core dynamics and tectonic evolution.To model these internal magnetic fields accurately,data selection based on specific criteria is often employed to minimize the influence of rapidly changing current systems in the ionosphere and magnetosphere.However,the quantitative impact of various data selection criteria on internal geomagnetic field modeling is not well understood.This study aims to address this issue and provide a reference for constructing and applying geomagnetic field models.First,we collect the latest MSS-1 and Swarm satellite magnetic data and summarize widely used data selection criteria in geomagnetic field modeling.Second,we briefly describe the method to co-estimate the core,crustal,and large-scale magnetospheric fields using satellite magnetic data.Finally,we conduct a series of field modeling experiments with different data selection criteria to quantitatively estimate their influence.Our numerical experiments confirm that without selecting data from dark regions and geomagnetically quiet times,the resulting internal field differences at the Earth’s surface can range from tens to hundreds of nanotesla(nT).Additionally,we find that the uncertainties introduced into field models by different data selection criteria are significantly larger than the measurement accuracy of modern geomagnetic satellites.These uncertainties should be considered when utilizing constructed magnetic field models for scientific research and applications.展开更多
One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six...One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six months in advance,and accurate long-term influenza forecasting is essential for this.Although diverse machine learning models have been proposed for influenza forecasting,they focus on short-term forecasting,and their performance is too dependent on input variables.For a country’s longterm influenza forecasting,typical surveillance data are known to be more effective than diverse external data on the Internet.We propose a two-stage data selection scheme for worldwide surveillance data to construct a longterm forecasting model for influenza in the target country.In the first stage,using a simple forecasting model based on the country’s surveillance data,we measured the change in performance by adding surveillance data from other countries,shifted by up to 52 weeks.In the second stage,for each set of surveillance data sorted by accuracy,we incrementally added data as input if the data have a positive effect on the performance of the forecasting model in the first stage.Using the selected surveillance data,we trained a new longterm forecasting model for influenza and perform influenza forecasting for the target country.We conducted extensive experiments using six machine learning models for the three target countries to verify the effectiveness of the proposed method.We report some of the results.展开更多
Geophysical data sets are growing at an ever-increasing rate,requiring computationally efficient data selection (thinning) methods to preserve essential information.Satellites,such as WindSat,provide large data sets...Geophysical data sets are growing at an ever-increasing rate,requiring computationally efficient data selection (thinning) methods to preserve essential information.Satellites,such as WindSat,provide large data sets for assessing the accuracy and computational efficiency of data selection techniques.A new data thinning technique,based on support vector regression (SVR),is developed and tested.To manage large on-line satellite data streams,observations from WindSat are formed into subsets by Voronoi tessellation and then each is thinned by SVR (TSVR).Three experiments are performed.The first confirms the viability of TSVR for a relatively small sample,comparing it to several commonly used data thinning methods (random selection,averaging and Barnes filtering),producing a 10% thinning rate (90% data reduction),low mean absolute errors (MAE) and large correlations with the original data.A second experiment,using a larger dataset,shows TSVR retrievals with MAE < 1 m s-1 and correlations ≥ 0.98.TSVR was an order of magnitude faster than the commonly used thinning methods.A third experiment applies a two-stage pipeline to TSVR,to accommodate online data.The pipeline subsets reconstruct the wind field with the same accuracy as the second experiment,is an order of magnitude faster than the nonpipeline TSVR.Therefore,pipeline TSVR is two orders of magnitude faster than commonly used thinning methods that ingest the entire data set.This study demonstrates that TSVR pipeline thinning is an accurate and computationally efficient alternative to commonly used data selection techniques.展开更多
Post-training quantization(PTQ)can reduce the memory footprint and latency of deep model inference while still preserving the accuracy of model,with only a small unlabeled calibration set and without the retraining on...Post-training quantization(PTQ)can reduce the memory footprint and latency of deep model inference while still preserving the accuracy of model,with only a small unlabeled calibration set and without the retraining on full training set.To calibrate a quantized model,current PTQ methods usually randomly select some unlabeled data from the training set as calibration data.However,we show the random data selection would result in performance instability and degradation due to the activation distribution mismatch.In this paper,we attempt to solve the crucial task on appropriate calibration data selection,and propose a novel one-shot calibration data selection method termed SelectQ,which selects specific data for calibration via dynamic clustering.The setting of our SelectQ uses the statistic information of activation and performs layer-wise clustering to learn an activation distribution on training set.For that purpose,a new metric called knowledge distance is proposed to calculate the distances of the activation statistics to centroids.Finally,after calibration with the selected data,quantization noise can be alleviated by mitigating the distribution mismatch within activations.Extensive experiments on ImageNet dataset show that our SelectQ increases the top-1 accuracy of ResNet18 over 15% in 4-bit quantization,compared to randomly sampled calibration data.It's noteworthy that SelectQ does not involve both the backward propagation and batch normalization parameters,which means that it has fewer limitations in practical applications.展开更多
Dear Editor,We read with interest the meta-analysis by Cho et al1 on Obesity and mortality in patients with COVID-19:A me ta-analysis of prospective studies,and congratulate the authors.However,we have a few comments ...Dear Editor,We read with interest the meta-analysis by Cho et al1 on Obesity and mortality in patients with COVID-19:A me ta-analysis of prospective studies,and congratulate the authors.However,we have a few comments concerning the methodology.展开更多
For the accurate extraction of cavity decay time, a selection of data points is supplemented to the weighted least square method. We derive the expected precision, accuracy and computation cost of this improved method...For the accurate extraction of cavity decay time, a selection of data points is supplemented to the weighted least square method. We derive the expected precision, accuracy and computation cost of this improved method, and examine these performances by simulation. By comparing this method with the nonlinear least square fitting (NLSF) method and the linear regression of the sum (LRS) method in derivations and simulations, we find that this method can achieve the same or even better precision, comparable accuracy, and lower computation cost. We test this method by experimental decay signals. The results are in agreement with the ones obtained from the nonlinear least square fitting method.展开更多
The interest in selecting an appropriate cloud data center is exponentially increasing due to the popularity and continuous growth of the cloud computing sector.Cloud data center selection challenges are compounded by...The interest in selecting an appropriate cloud data center is exponentially increasing due to the popularity and continuous growth of the cloud computing sector.Cloud data center selection challenges are compounded by ever-increasing users’requests and the number of data centers required to execute these requests.Cloud service broker policy defines cloud data center’s selection,which is a case of an NP-hard problem that needs a precise solution for an efficient and superior solution.Differential evolution algorithm is a metaheuristic algorithm characterized by its speed and robustness,and it is well suited for selecting an appropriate cloud data center.This paper presents a modified differential evolution algorithm-based cloud service broker policy for the most appropriate data center selection in the cloud computing environment.The differential evolution algorithm is modified using the proposed new mutation technique ensuring enhanced performance and providing an appropriate selection of data centers.The proposed policy’s superiority in selecting the most suitable data center is evaluated using the CloudAnalyst simulator.The results are compared with the state-of-arts cloud service broker policies.展开更多
Principal component analysis (PCA) combined with artificial neural networks was used to classify the spectra of 27 steel samples acquired using laser-induced breakdown spectroscopy. Three methods of spectral data se...Principal component analysis (PCA) combined with artificial neural networks was used to classify the spectra of 27 steel samples acquired using laser-induced breakdown spectroscopy. Three methods of spectral data selection, selecting all the peak lines of the spectra, selecting intensive spectral partitions and the whole spectra, were utilized to compare the infiuence of different inputs of PCA on the classification of steels. Three intensive partitions were selected based on experience and prior knowledge to compare the classification, as the partitions can obtain the best results compared to all peak lines and the whole spectra. We also used two test data sets, mean spectra after being averaged and raw spectra without any pretreatment, to verify the results of the classification. The results of this comprehensive comparison show that a back propagation network trained using the principal components of appropriate, carefully selecred spectral partitions can obtain the best results accuracy can be achieved using the intensive spectral A perfect result with 100% classification partitions ranging of 357-367 nm.展开更多
This paper proposes a model to analyze the massive data of electricity.Feature subset is determined by the correla-tion-based feature selection and the data-driven methods.The attribute season can be classified succes...This paper proposes a model to analyze the massive data of electricity.Feature subset is determined by the correla-tion-based feature selection and the data-driven methods.The attribute season can be classified successfully through five classi-fiers using the selected feature subset,and the best model can be determined further.The effects on analyzing electricity consump-tion of the other three attributes,including months,businesses,and meters,can be estimated using the chosen model.The data used for the project is provided by Beijing Power Supply Bureau.We use WEKA as the machine learning tool.The models we built are promising for electricity scheduling and power theft detection.展开更多
Understanding the factors shaping species' distributions is a key longstanding topic in ecology with unresolved issues. The aims were to test whether the relative contribution of abiotic factors that set the geograph...Understanding the factors shaping species' distributions is a key longstanding topic in ecology with unresolved issues. The aims were to test whether the relative contribution of abiotic factors that set the geographical range of freshwater fish species may vary spatially and/or may depend on the geographical extent that is being considered. The relative contribution of factors, to discriminate between the conditions prevailing in the area where the species is present and those existing in the considered extent, was estimated with the instability index included in the R pack- age SPEDInstabR. We used 3 different extent sizes: 1) each river basin where the species is present (local); 2) all river basins where the species is present (regional); and 3) the whole Earth (global). We used a data set of 16,543 freshwater fish species with a total of 845,764 geographical records, together with bioclimatic and topographic variables. Factors associated with tempera- ture and altitude show the highest relative contribution to explain the distribution of freshwater fishes at the smaller considered extent. Altitude and a mix of factors associated with temperature and precipitation were more important when using the regional extent. Factors associated with precipitation show the highest contribution when using the global extent. There was also spatial variability in the importance of factors, both between species and within species and from region to region. Factors associated with precipitation show a clear latitudinal trend of decreasing in importance toward the equator.展开更多
Numerical weather prediction(NWP)data possess internal inaccuracies,such as low NWP wind speed corresponding to high actual wind power generation.This study is intended to reduce the negative effects of such inaccurac...Numerical weather prediction(NWP)data possess internal inaccuracies,such as low NWP wind speed corresponding to high actual wind power generation.This study is intended to reduce the negative effects of such inaccuracies by proposing a pure data-selection framework(PDF)to choose useful data prior to modeling,thus improving the accuracy of day-ahead wind power forecasting.Briefly,we convert an entire NWP training dataset into many small subsets and then select the best subset combination via a validation set to build a forecasting model.Although a small subset can increase selection flexibility,it can also produce billions of subset combinations,resulting in computational issues.To address this problem,we incorporated metamodeling and optimization steps into PDF.We then proposed a design and analysis of the computer experiments-based metamodeling algorithm and heuristic-exhaustive search optimization algorithm,respectively.Experimental results demonstrate that(1)it is necessary to select data before constructing a forecasting model;(2)using a smaller subset will likely increase selection flexibility,leading to a more accurate forecasting model;(3)PDF can generate a better training dataset than similarity-based data selection methods(e.g.,K-means and support vector classification);and(4)choosing data before building a forecasting model produces a more accurate forecasting model compared with using a machine learning method to construct a model directly.展开更多
基金supported by the National Natural Science Foundation of China(42250101)the Macao Foundation。
文摘Earth’s internal core and crustal magnetic fields,as measured by geomagnetic satellites like MSS-1(Macao Science Satellite-1)and Swarm,are vital for understanding core dynamics and tectonic evolution.To model these internal magnetic fields accurately,data selection based on specific criteria is often employed to minimize the influence of rapidly changing current systems in the ionosphere and magnetosphere.However,the quantitative impact of various data selection criteria on internal geomagnetic field modeling is not well understood.This study aims to address this issue and provide a reference for constructing and applying geomagnetic field models.First,we collect the latest MSS-1 and Swarm satellite magnetic data and summarize widely used data selection criteria in geomagnetic field modeling.Second,we briefly describe the method to co-estimate the core,crustal,and large-scale magnetospheric fields using satellite magnetic data.Finally,we conduct a series of field modeling experiments with different data selection criteria to quantitatively estimate their influence.Our numerical experiments confirm that without selecting data from dark regions and geomagnetically quiet times,the resulting internal field differences at the Earth’s surface can range from tens to hundreds of nanotesla(nT).Additionally,we find that the uncertainties introduced into field models by different data selection criteria are significantly larger than the measurement accuracy of modern geomagnetic satellites.These uncertainties should be considered when utilizing constructed magnetic field models for scientific research and applications.
基金This research was supported by a government-wide R&D fund project for infectious disease research(GFID),Republic of Korea(Grant Number:HG19C0682).
文摘One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six months in advance,and accurate long-term influenza forecasting is essential for this.Although diverse machine learning models have been proposed for influenza forecasting,they focus on short-term forecasting,and their performance is too dependent on input variables.For a country’s longterm influenza forecasting,typical surveillance data are known to be more effective than diverse external data on the Internet.We propose a two-stage data selection scheme for worldwide surveillance data to construct a longterm forecasting model for influenza in the target country.In the first stage,using a simple forecasting model based on the country’s surveillance data,we measured the change in performance by adding surveillance data from other countries,shifted by up to 52 weeks.In the second stage,for each set of surveillance data sorted by accuracy,we incrementally added data as input if the data have a positive effect on the performance of the forecasting model in the first stage.Using the selected surveillance data,we trained a new longterm forecasting model for influenza and perform influenza forecasting for the target country.We conducted extensive experiments using six machine learning models for the three target countries to verify the effectiveness of the proposed method.We report some of the results.
基金NOAA Grant NA17RJ1227 and NSF Grant EIA-0205628 for providing financial support for this worksupported by RSF Grant 14-41-00039
文摘Geophysical data sets are growing at an ever-increasing rate,requiring computationally efficient data selection (thinning) methods to preserve essential information.Satellites,such as WindSat,provide large data sets for assessing the accuracy and computational efficiency of data selection techniques.A new data thinning technique,based on support vector regression (SVR),is developed and tested.To manage large on-line satellite data streams,observations from WindSat are formed into subsets by Voronoi tessellation and then each is thinned by SVR (TSVR).Three experiments are performed.The first confirms the viability of TSVR for a relatively small sample,comparing it to several commonly used data thinning methods (random selection,averaging and Barnes filtering),producing a 10% thinning rate (90% data reduction),low mean absolute errors (MAE) and large correlations with the original data.A second experiment,using a larger dataset,shows TSVR retrievals with MAE < 1 m s-1 and correlations ≥ 0.98.TSVR was an order of magnitude faster than the commonly used thinning methods.A third experiment applies a two-stage pipeline to TSVR,to accommodate online data.The pipeline subsets reconstruct the wind field with the same accuracy as the second experiment,is an order of magnitude faster than the nonpipeline TSVR.Therefore,pipeline TSVR is two orders of magnitude faster than commonly used thinning methods that ingest the entire data set.This study demonstrates that TSVR pipeline thinning is an accurate and computationally efficient alternative to commonly used data selection techniques.
基金partially supported by the National Natural Science Foundation of China(Nos.62072151,62376236,61932009)Anhui Provincial Natural Science Fund for the Distinguished Young Scholars,China(No.2008085J30)+2 种基金Open Foundation of Yunnan Key Laboratory of Software Engineering,China(No.2023SE103)CCF-Baidu Open Fund,CAAI-Huawei MindSpore Open Fund,Shenzhen Science and Technology Program,China(No.ZDSYS20230626091302006)Key Project of Science and Technology of Guangxi,China(No.AB22035022-2021AB20147).
文摘Post-training quantization(PTQ)can reduce the memory footprint and latency of deep model inference while still preserving the accuracy of model,with only a small unlabeled calibration set and without the retraining on full training set.To calibrate a quantized model,current PTQ methods usually randomly select some unlabeled data from the training set as calibration data.However,we show the random data selection would result in performance instability and degradation due to the activation distribution mismatch.In this paper,we attempt to solve the crucial task on appropriate calibration data selection,and propose a novel one-shot calibration data selection method termed SelectQ,which selects specific data for calibration via dynamic clustering.The setting of our SelectQ uses the statistic information of activation and performs layer-wise clustering to learn an activation distribution on training set.For that purpose,a new metric called knowledge distance is proposed to calculate the distances of the activation statistics to centroids.Finally,after calibration with the selected data,quantization noise can be alleviated by mitigating the distribution mismatch within activations.Extensive experiments on ImageNet dataset show that our SelectQ increases the top-1 accuracy of ResNet18 over 15% in 4-bit quantization,compared to randomly sampled calibration data.It's noteworthy that SelectQ does not involve both the backward propagation and batch normalization parameters,which means that it has fewer limitations in practical applications.
文摘Dear Editor,We read with interest the meta-analysis by Cho et al1 on Obesity and mortality in patients with COVID-19:A me ta-analysis of prospective studies,and congratulate the authors.However,we have a few comments concerning the methodology.
基金supported by the Preeminent Youth Fund of Sichuan Province,China(Grant No.2012JQ0012)the National Natural Science Foundation of China(Grant Nos.11173008,10974202,and 60978049)the National Key Scientific and Research Equipment Development Project of China(Grant No.ZDYZ2013-2)
文摘For the accurate extraction of cavity decay time, a selection of data points is supplemented to the weighted least square method. We derive the expected precision, accuracy and computation cost of this improved method, and examine these performances by simulation. By comparing this method with the nonlinear least square fitting (NLSF) method and the linear regression of the sum (LRS) method in derivations and simulations, we find that this method can achieve the same or even better precision, comparable accuracy, and lower computation cost. We test this method by experimental decay signals. The results are in agreement with the ones obtained from the nonlinear least square fitting method.
基金This work was supported by Universiti Sains Malaysia under external grant(Grant Number 304/PNAV/650958/U154).
文摘The interest in selecting an appropriate cloud data center is exponentially increasing due to the popularity and continuous growth of the cloud computing sector.Cloud data center selection challenges are compounded by ever-increasing users’requests and the number of data centers required to execute these requests.Cloud service broker policy defines cloud data center’s selection,which is a case of an NP-hard problem that needs a precise solution for an efficient and superior solution.Differential evolution algorithm is a metaheuristic algorithm characterized by its speed and robustness,and it is well suited for selecting an appropriate cloud data center.This paper presents a modified differential evolution algorithm-based cloud service broker policy for the most appropriate data center selection in the cloud computing environment.The differential evolution algorithm is modified using the proposed new mutation technique ensuring enhanced performance and providing an appropriate selection of data centers.The proposed policy’s superiority in selecting the most suitable data center is evaluated using the CloudAnalyst simulator.The results are compared with the state-of-arts cloud service broker policies.
基金supported by the National High Technology Research and Development Program of China(863 Program)(No.2012AA040608)National Natural Science Foundation of China(Nos.61473279,61004131)the Development of Scientific Research Equipment Program of Chinese Academy of Sciences(No.YZ201247)
文摘Principal component analysis (PCA) combined with artificial neural networks was used to classify the spectra of 27 steel samples acquired using laser-induced breakdown spectroscopy. Three methods of spectral data selection, selecting all the peak lines of the spectra, selecting intensive spectral partitions and the whole spectra, were utilized to compare the infiuence of different inputs of PCA on the classification of steels. Three intensive partitions were selected based on experience and prior knowledge to compare the classification, as the partitions can obtain the best results compared to all peak lines and the whole spectra. We also used two test data sets, mean spectra after being averaged and raw spectra without any pretreatment, to verify the results of the classification. The results of this comprehensive comparison show that a back propagation network trained using the principal components of appropriate, carefully selecred spectral partitions can obtain the best results accuracy can be achieved using the intensive spectral A perfect result with 100% classification partitions ranging of 357-367 nm.
基金Supported by the National Earthquake Major Project of China (201008007)the Fundamental Research Funds for Central University of China (216275645)
文摘This paper proposes a model to analyze the massive data of electricity.Feature subset is determined by the correla-tion-based feature selection and the data-driven methods.The attribute season can be classified successfully through five classi-fiers using the selected feature subset,and the best model can be determined further.The effects on analyzing electricity consump-tion of the other three attributes,including months,businesses,and meters,can be estimated using the chosen model.The data used for the project is provided by Beijing Power Supply Bureau.We use WEKA as the machine learning tool.The models we built are promising for electricity scheduling and power theft detection.
文摘Understanding the factors shaping species' distributions is a key longstanding topic in ecology with unresolved issues. The aims were to test whether the relative contribution of abiotic factors that set the geographical range of freshwater fish species may vary spatially and/or may depend on the geographical extent that is being considered. The relative contribution of factors, to discriminate between the conditions prevailing in the area where the species is present and those existing in the considered extent, was estimated with the instability index included in the R pack- age SPEDInstabR. We used 3 different extent sizes: 1) each river basin where the species is present (local); 2) all river basins where the species is present (regional); and 3) the whole Earth (global). We used a data set of 16,543 freshwater fish species with a total of 845,764 geographical records, together with bioclimatic and topographic variables. Factors associated with tempera- ture and altitude show the highest relative contribution to explain the distribution of freshwater fishes at the smaller considered extent. Altitude and a mix of factors associated with temperature and precipitation were more important when using the regional extent. Factors associated with precipitation show the highest contribution when using the global extent. There was also spatial variability in the importance of factors, both between species and within species and from region to region. Factors associated with precipitation show a clear latitudinal trend of decreasing in importance toward the equator.
基金supported by the National Natural Science Foundation of China(72101066,72131005,72121001,72171062,91846301,and 71772053)Heilongjiang Natural Science Excellent Youth Fund(YQ2022G004)Key Research and Development Projects of Heilongjiang Province(JD22A003).
文摘Numerical weather prediction(NWP)data possess internal inaccuracies,such as low NWP wind speed corresponding to high actual wind power generation.This study is intended to reduce the negative effects of such inaccuracies by proposing a pure data-selection framework(PDF)to choose useful data prior to modeling,thus improving the accuracy of day-ahead wind power forecasting.Briefly,we convert an entire NWP training dataset into many small subsets and then select the best subset combination via a validation set to build a forecasting model.Although a small subset can increase selection flexibility,it can also produce billions of subset combinations,resulting in computational issues.To address this problem,we incorporated metamodeling and optimization steps into PDF.We then proposed a design and analysis of the computer experiments-based metamodeling algorithm and heuristic-exhaustive search optimization algorithm,respectively.Experimental results demonstrate that(1)it is necessary to select data before constructing a forecasting model;(2)using a smaller subset will likely increase selection flexibility,leading to a more accurate forecasting model;(3)PDF can generate a better training dataset than similarity-based data selection methods(e.g.,K-means and support vector classification);and(4)choosing data before building a forecasting model produces a more accurate forecasting model compared with using a machine learning method to construct a model directly.