This paper describes how data records can be matched across large datasets using a technique called the Identity Correlation Approach (ICA). The ICA technique is then compared with a string matching exercise. Both t...This paper describes how data records can be matched across large datasets using a technique called the Identity Correlation Approach (ICA). The ICA technique is then compared with a string matching exercise. Both the string matching exercise and the ICA technique were employed for a big data project carried out by the CSO. The project was called the SESADP (Structure of Earnings Survey Administrative Data Project) and involved linking the Irish Census dataset 2011 to a large Public Sector Dataset. The ICA technique provides a mathematical tool to link the datasets and the matching rate for an exact match can be calculated before the matching process begins. Based on the number of variables and the size of the population, the matching rate is calculated in the ICA approach from the MRUI (Matching Rate for Unique Identifier) formula, and false positives are eliminated. No string matching is used in the ICA, therefore names are not required on the dataset, making the data more secure & ensuring confidentiality. The SESADP Project was highly successful using the ICA technique. A comparison of the results using a string matching exercise for the SESADP and the ICA are discussed here.展开更多
Data centers are being distributed worldwide by cloud service providers(CSPs)to save energy costs through efficient workload alloca-tion strategies.Many CSPs are challenged by the significant rise in user demands due ...Data centers are being distributed worldwide by cloud service providers(CSPs)to save energy costs through efficient workload alloca-tion strategies.Many CSPs are challenged by the significant rise in user demands due to their extensive energy consumption during workload pro-cessing.Numerous research studies have examined distinct operating cost mitigation techniques for geo-distributed data centers(DCs).However,oper-ating cost savings during workload processing,which also considers string-matching techniques in geo-distributed DCs,remains unexplored.In this research,we propose a novel string matching-based geographical load balanc-ing(SMGLB)technique to mitigate the operating cost of the geo-distributed DC.The primary goal of this study is to use a string-matching algorithm(i.e.,Boyer Moore)to compare the contents of incoming workloads to those of documents that have already been processed in a data center.A successful match prevents the global load balancer from sending the user’s request to a data center for processing and displaying the results of the previously processed workload to the user to save energy.On the contrary,if no match can be discovered,the global load balancer will allocate the incoming workload to a specific DC for processing considering variable energy prices,the number of active servers,on-site green energy,and traces of incoming workload.The results of numerical evaluations show that the SMGLB can minimize the operating expenses of the geo-distributed data centers more than the existing workload distribution techniques.展开更多
In this paper we consider a static spherically symmetric black hole(BH)embedded in a Dehnen-(1,4,0)-type dark matter(DM)halo in the presence of a cloud string.We examine and present data on how the core density of the...In this paper we consider a static spherically symmetric black hole(BH)embedded in a Dehnen-(1,4,0)-type dark matter(DM)halo in the presence of a cloud string.We examine and present data on how the core density of the DM halo parameter and the cloud string parameter affect BH attributes such as quasinormal modes(QNMs)and shadow cast.To do this,we first look into the effective potential of perturbation equations for three types of perturbation fields with different spins:massless scalar field,electromagnetic field and gravitational field.Then,using the sixth-order Wentzel-Kramers-Brillouin approximation,we examine QNMs of the BH disturbed by the three fields and derive quasinormal frequencies.The changes in QNM versus the core density parameter and the cloud string parameter for three disturbances are explored.We also investigate how the core density and the cloud string parameter affect the photon sphere and shadow radius.Interestingly,the study shows that the influence of Dehnen-type DM and cloud strings increases both the photon sphere and the shadow radius.Finally,we employ observational data from Sgr A^(*) and M87^(*) to set limitations on the BH parameters.展开更多
Precipitation types primarily include rainfall,snowfall,and sleet,and the transformation of precipitation types has significant impacts on regional climate,ecosystems,and the land-atmosphere system.This study employs ...Precipitation types primarily include rainfall,snowfall,and sleet,and the transformation of precipitation types has significant impacts on regional climate,ecosystems,and the land-atmosphere system.This study employs the Ding method to separate precipitation types from three datasets(CMFD,ERA5_Land,and CN05.1).Using data from 26meteorological observation stations in the Chinese Tianshan Mountains Region(CTMR)of China as the validation dataset,the precipitation type separation accuracy of three datasets was evaluated.Additionally,the impacts of relative humidity,precipitation amount,and air temperature on the accuracy of precipitation type separation were analyzed.The results indicate that the CMFD dataset provides the highest separation accuracy,followed by CN05.1,with ERA5_Land showing the poorest performance.Spatial correlation analysis reveals that CMFD outperforms the other two datasets at both annual and monthly scales.Root Mean Square Error(RMSE)and Mean Deviation(MD)values suggest that CMFD is more consistent with the station observational data.The analysis further demonstrates that relative humidity and precipitation amount significantly affect separation accuracy.After bias correction,the correlation coefficients between CMFD,ERA5_Land,and station observational data improved to 0.85-0.94,while the RMSE was controlled within 2 mm.The study also revealed that the overestimation of precipitation was positively correlated with the overestimation of rainfall days,negatively correlated with the overestimation of snowfall days,and that underestimated air temperatures led to an increase in the misclassification of snowfall days.This research provides a basis for selecting climate change datasets and managing water resources in alpine regions.展开更多
Fire season affects the dynamic changes of post-fire vegetation communities and carbon emissions.Analyzing its global patterns supports understanding of the ecological impacts of fires and responses of fires to climat...Fire season affects the dynamic changes of post-fire vegetation communities and carbon emissions.Analyzing its global patterns supports understanding of the ecological impacts of fires and responses of fires to climate change.Meteorological variables have been widely used to quantify fire season in current studies.However,their results can not be used to assess climate impacts on the seasonality of fire activities.Here we utilized satellite-based Moderate Resolution Imaging Spectroradiometer(MODIS)burned area data from 2001 to 2022 to identify global fire season types based on the number of peaks within a year.Using satellite data and innovatively processing the data to obtain a more accurate length of the fire season.We divided fire season types and examined the spatial distribution of fire season types across the Koppen-Geiger climate(KGC)zones.At a global scale,we identified three major fire season types,including unimodal(31.25%),bimodal(52.07%),and random(16.69%).The unimodal fire season primarily occurs in boreal and tropical regions lasting about 2.7 mon.In comparison,temperate ecosystems tend to have a longer fire season(3 mon)with two peaks throughout the year.The KGC zones show divergent contributions from the fire season types,indicating potential impacts of the climatic conditions on fire seasonality in these regions.展开更多
Boosted by a strong solar power market,the electricity grid is exposed to risk under an increasing share of fluctuant solar power.To increase the stability of the electricity grid,an accurate solar power forecast is n...Boosted by a strong solar power market,the electricity grid is exposed to risk under an increasing share of fluctuant solar power.To increase the stability of the electricity grid,an accurate solar power forecast is needed to evaluate such fluctuations.In terms of forecast,solar irradiance is the key factor of solar power generation,which is affected by atmospheric conditions,including surface meteorological variables and column integrated variables.These variables involve multiple numerical timeseries and images.However,few studies have focused on the processing method of multiple data types in an interhour direct normal irradiance(DNI)forecast.In this study,a framework for predicting the DNI for a 10-min time horizon was developed,which included the nondimensionalization of multiple data types and time-series,development of a forecast model,and transformation of the outputs.Several atmospheric variables were considered in the forecast framework,including the historical DNI,wind speed and direction,relative humidity time-series,and ground-based cloud images.Experiments were conducted to evaluate the performance of the forecast framework.The experimental results demonstrate that the proposed method performs well with a normalized mean bias error of 0.41%and a normalized root mean square error(n RMSE)of20.53%,and outperforms the persistent model with an improvement of 34%in the nRMSE.展开更多
Identification of reservoir types in deep carbonates has always been a great challenge due to complex logging responses caused by the heterogeneous scale and distribution of storage spaces.Traditional cross-plot analy...Identification of reservoir types in deep carbonates has always been a great challenge due to complex logging responses caused by the heterogeneous scale and distribution of storage spaces.Traditional cross-plot analysis and empirical formula methods for identifying reservoir types using geophysical logging data have high uncertainty and low efficiency,which cannot accurately reflect the nonlinear relationship between reservoir types and logging data.Recently,the kernel Fisher discriminant analysis(KFD),a kernel-based machine learning technique,attracts attention in many fields because of its strong nonlinear processing ability.However,the overall performance of KFD model may be limited as a single kernel function cannot simultaneously extrapolate and interpolate well,especially for highly complex data cases.To address this issue,in this study,a mixed kernel Fisher discriminant analysis(MKFD)model was established and applied to identify reservoir types of the deep Sinian carbonates in central Sichuan Basin,China.The MKFD model was trained and tested with 453 datasets from 7 coring wells,utilizing GR,CAL,DEN,AC,CNL and RT logs as input variables.The particle swarm optimization(PSO)was adopted for hyper-parameter optimization of MKFD model.To evaluate the model performance,prediction results of MKFD were compared with those of basic-kernel based KFD,RF and SVM models.Subsequently,the built MKFD model was applied in a blind well test,and a variable importance analysis was conducted.The comparison and blind test results demonstrated that MKFD outperformed traditional KFD,RF and SVM in the identification of reservoir types,which provided higher accuracy and stronger generalization.The MKFD can therefore be a reliable method for identifying reservoir types of deep carbonates.展开更多
Highly accurate vegetative type distribution information is of great significance for forestry resource monitoring and management.In order to improve the classification accuracy of forest types,Sentinel-1 and 2 data o...Highly accurate vegetative type distribution information is of great significance for forestry resource monitoring and management.In order to improve the classification accuracy of forest types,Sentinel-1 and 2 data of Changbai Mountain protection development zone were selected,and combined with DEM to construct a multi-featured random forest type classification model incorporating fusing intensity,texture,spectral,vegetation index and topography information and using random forest Gini index(GI)for optimization.The overall accuracy of classification was 94.60%and the Kappa coefficient was 0.933.Comparing the classification results before and after feature optimization,it shows that feature optimization has a greater impact on the classification accuracy.Comparing the classification results of random forest,maximum likelihood method and CART decision tree under the same conditions,it shows that the random forest has a higher performance and can be applied to forestry research work such as forest resource survey and monitoring.展开更多
Accurate estimation of understory terrain has significant scientific importance for maintaining ecosystem balance and biodiversity conservation.Addressing the issue of inadequate representation of spatial heterogeneit...Accurate estimation of understory terrain has significant scientific importance for maintaining ecosystem balance and biodiversity conservation.Addressing the issue of inadequate representation of spatial heterogeneity when traditional forest topographic inversion methods consider the entire forest as the inversion unit,this study pro⁃poses a differentiated modeling approach to forest types based on refined land cover classification.Taking Puerto Ri⁃co and Maryland as study areas,a multi-dimensional feature system is constructed by integrating multi-source re⁃mote sensing data:ICESat-2 spaceborne LiDAR is used to obtain benchmark values for understory terrain,topo⁃graphic factors such as slope and aspect are extracted based on SRTM data,and vegetation cover characteristics are analyzed using Landsat-8 multispectral imagery.This study incorporates forest type as a classification modeling con⁃dition and applies the random forest algorithm to build differentiated topographic inversion models.Experimental re⁃sults indicate that,compared to traditional whole-area modeling methods(RMSE=5.06 m),forest type-based classi⁃fication modeling significantly improves the accuracy of understory terrain estimation(RMSE=2.94 m),validating the effectiveness of spatial heterogeneity modeling.Further sensitivity analysis reveals that canopy structure parame⁃ters(with RMSE variation reaching 4.11 m)exert a stronger regulatory effect on estimation accuracy compared to forest cover,providing important theoretical support for optimizing remote sensing models of forest topography.展开更多
This study aimed at investigating the characteristics of table and graph that people perceive and the data types which people consider the two displays are most appropriate for. Participants in this survey were 195 te...This study aimed at investigating the characteristics of table and graph that people perceive and the data types which people consider the two displays are most appropriate for. Participants in this survey were 195 teachers and undergraduates from four universities in Beijing. The results showed people's different attitudes towards the two forms of display.展开更多
文摘This paper describes how data records can be matched across large datasets using a technique called the Identity Correlation Approach (ICA). The ICA technique is then compared with a string matching exercise. Both the string matching exercise and the ICA technique were employed for a big data project carried out by the CSO. The project was called the SESADP (Structure of Earnings Survey Administrative Data Project) and involved linking the Irish Census dataset 2011 to a large Public Sector Dataset. The ICA technique provides a mathematical tool to link the datasets and the matching rate for an exact match can be calculated before the matching process begins. Based on the number of variables and the size of the population, the matching rate is calculated in the ICA approach from the MRUI (Matching Rate for Unique Identifier) formula, and false positives are eliminated. No string matching is used in the ICA, therefore names are not required on the dataset, making the data more secure & ensuring confidentiality. The SESADP Project was highly successful using the ICA technique. A comparison of the results using a string matching exercise for the SESADP and the ICA are discussed here.
文摘Data centers are being distributed worldwide by cloud service providers(CSPs)to save energy costs through efficient workload alloca-tion strategies.Many CSPs are challenged by the significant rise in user demands due to their extensive energy consumption during workload pro-cessing.Numerous research studies have examined distinct operating cost mitigation techniques for geo-distributed data centers(DCs).However,oper-ating cost savings during workload processing,which also considers string-matching techniques in geo-distributed DCs,remains unexplored.In this research,we propose a novel string matching-based geographical load balanc-ing(SMGLB)technique to mitigate the operating cost of the geo-distributed DC.The primary goal of this study is to use a string-matching algorithm(i.e.,Boyer Moore)to compare the contents of incoming workloads to those of documents that have already been processed in a data center.A successful match prevents the global load balancer from sending the user’s request to a data center for processing and displaying the results of the previously processed workload to the user to save energy.On the contrary,if no match can be discovered,the global load balancer will allocate the incoming workload to a specific DC for processing considering variable energy prices,the number of active servers,on-site green energy,and traces of incoming workload.The results of numerical evaluations show that the SMGLB can minimize the operating expenses of the geo-distributed data centers more than the existing workload distribution techniques.
基金supported by the National Natural Science Foundation of China under Grant No.11675143the National Key Research and Development Program of China under Grant No.2020YFC2201503。
文摘In this paper we consider a static spherically symmetric black hole(BH)embedded in a Dehnen-(1,4,0)-type dark matter(DM)halo in the presence of a cloud string.We examine and present data on how the core density of the DM halo parameter and the cloud string parameter affect BH attributes such as quasinormal modes(QNMs)and shadow cast.To do this,we first look into the effective potential of perturbation equations for three types of perturbation fields with different spins:massless scalar field,electromagnetic field and gravitational field.Then,using the sixth-order Wentzel-Kramers-Brillouin approximation,we examine QNMs of the BH disturbed by the three fields and derive quasinormal frequencies.The changes in QNM versus the core density parameter and the cloud string parameter for three disturbances are explored.We also investigate how the core density and the cloud string parameter affect the photon sphere and shadow radius.Interestingly,the study shows that the influence of Dehnen-type DM and cloud strings increases both the photon sphere and the shadow radius.Finally,we employ observational data from Sgr A^(*) and M87^(*) to set limitations on the BH parameters.
基金financial support from the National Natural Sciences Foundation of China(42261026,and 42161025)the Open Foundation of Xinjiang Key Laboratory of Water Cycle and Utilization in Arid Zone(XJYS0907-2023-01)。
文摘Precipitation types primarily include rainfall,snowfall,and sleet,and the transformation of precipitation types has significant impacts on regional climate,ecosystems,and the land-atmosphere system.This study employs the Ding method to separate precipitation types from three datasets(CMFD,ERA5_Land,and CN05.1).Using data from 26meteorological observation stations in the Chinese Tianshan Mountains Region(CTMR)of China as the validation dataset,the precipitation type separation accuracy of three datasets was evaluated.Additionally,the impacts of relative humidity,precipitation amount,and air temperature on the accuracy of precipitation type separation were analyzed.The results indicate that the CMFD dataset provides the highest separation accuracy,followed by CN05.1,with ERA5_Land showing the poorest performance.Spatial correlation analysis reveals that CMFD outperforms the other two datasets at both annual and monthly scales.Root Mean Square Error(RMSE)and Mean Deviation(MD)values suggest that CMFD is more consistent with the station observational data.The analysis further demonstrates that relative humidity and precipitation amount significantly affect separation accuracy.After bias correction,the correlation coefficients between CMFD,ERA5_Land,and station observational data improved to 0.85-0.94,while the RMSE was controlled within 2 mm.The study also revealed that the overestimation of precipitation was positively correlated with the overestimation of rainfall days,negatively correlated with the overestimation of snowfall days,and that underestimated air temperatures led to an increase in the misclassification of snowfall days.This research provides a basis for selecting climate change datasets and managing water resources in alpine regions.
基金Under the auspices of the National Key Research and Development Program of China(No.2019YFA0606603)。
文摘Fire season affects the dynamic changes of post-fire vegetation communities and carbon emissions.Analyzing its global patterns supports understanding of the ecological impacts of fires and responses of fires to climate change.Meteorological variables have been widely used to quantify fire season in current studies.However,their results can not be used to assess climate impacts on the seasonality of fire activities.Here we utilized satellite-based Moderate Resolution Imaging Spectroradiometer(MODIS)burned area data from 2001 to 2022 to identify global fire season types based on the number of peaks within a year.Using satellite data and innovatively processing the data to obtain a more accurate length of the fire season.We divided fire season types and examined the spatial distribution of fire season types across the Koppen-Geiger climate(KGC)zones.At a global scale,we identified three major fire season types,including unimodal(31.25%),bimodal(52.07%),and random(16.69%).The unimodal fire season primarily occurs in boreal and tropical regions lasting about 2.7 mon.In comparison,temperate ecosystems tend to have a longer fire season(3 mon)with two peaks throughout the year.The KGC zones show divergent contributions from the fire season types,indicating potential impacts of the climatic conditions on fire seasonality in these regions.
基金supported by the National Key Research and Development Program of China(No.2018YFB1500803)National Natural Science Foundation of China(No.61773118,No.61703100)Fundamental Research Funds for Central Universities.
文摘Boosted by a strong solar power market,the electricity grid is exposed to risk under an increasing share of fluctuant solar power.To increase the stability of the electricity grid,an accurate solar power forecast is needed to evaluate such fluctuations.In terms of forecast,solar irradiance is the key factor of solar power generation,which is affected by atmospheric conditions,including surface meteorological variables and column integrated variables.These variables involve multiple numerical timeseries and images.However,few studies have focused on the processing method of multiple data types in an interhour direct normal irradiance(DNI)forecast.In this study,a framework for predicting the DNI for a 10-min time horizon was developed,which included the nondimensionalization of multiple data types and time-series,development of a forecast model,and transformation of the outputs.Several atmospheric variables were considered in the forecast framework,including the historical DNI,wind speed and direction,relative humidity time-series,and ground-based cloud images.Experiments were conducted to evaluate the performance of the forecast framework.The experimental results demonstrate that the proposed method performs well with a normalized mean bias error of 0.41%and a normalized root mean square error(n RMSE)of20.53%,and outperforms the persistent model with an improvement of 34%in the nRMSE.
基金supported by the National Natural Science Foundation of China(No.U21B2062)the Natural Science Foundation of Hubei Province(No.2023AFB307)。
文摘Identification of reservoir types in deep carbonates has always been a great challenge due to complex logging responses caused by the heterogeneous scale and distribution of storage spaces.Traditional cross-plot analysis and empirical formula methods for identifying reservoir types using geophysical logging data have high uncertainty and low efficiency,which cannot accurately reflect the nonlinear relationship between reservoir types and logging data.Recently,the kernel Fisher discriminant analysis(KFD),a kernel-based machine learning technique,attracts attention in many fields because of its strong nonlinear processing ability.However,the overall performance of KFD model may be limited as a single kernel function cannot simultaneously extrapolate and interpolate well,especially for highly complex data cases.To address this issue,in this study,a mixed kernel Fisher discriminant analysis(MKFD)model was established and applied to identify reservoir types of the deep Sinian carbonates in central Sichuan Basin,China.The MKFD model was trained and tested with 453 datasets from 7 coring wells,utilizing GR,CAL,DEN,AC,CNL and RT logs as input variables.The particle swarm optimization(PSO)was adopted for hyper-parameter optimization of MKFD model.To evaluate the model performance,prediction results of MKFD were compared with those of basic-kernel based KFD,RF and SVM models.Subsequently,the built MKFD model was applied in a blind well test,and a variable importance analysis was conducted.The comparison and blind test results demonstrated that MKFD outperformed traditional KFD,RF and SVM in the identification of reservoir types,which provided higher accuracy and stronger generalization.The MKFD can therefore be a reliable method for identifying reservoir types of deep carbonates.
基金Supported by projects of National Natural Science Foundation of China(Nos.42171407,42077242)Natural Science Foundation of Jilin Province(No.20210101098JC)+1 种基金Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation,MNR(No.KF-2020-05-024)National Key R&D Program of China(No.2021YFD1500100).
文摘Highly accurate vegetative type distribution information is of great significance for forestry resource monitoring and management.In order to improve the classification accuracy of forest types,Sentinel-1 and 2 data of Changbai Mountain protection development zone were selected,and combined with DEM to construct a multi-featured random forest type classification model incorporating fusing intensity,texture,spectral,vegetation index and topography information and using random forest Gini index(GI)for optimization.The overall accuracy of classification was 94.60%and the Kappa coefficient was 0.933.Comparing the classification results before and after feature optimization,it shows that feature optimization has a greater impact on the classification accuracy.Comparing the classification results of random forest,maximum likelihood method and CART decision tree under the same conditions,it shows that the random forest has a higher performance and can be applied to forestry research work such as forest resource survey and monitoring.
基金Supported by the National Natural Science Foundation of China(42401488,42071351)the National Key Research and Development Program of China(2020YFA0608501,2017YFB0504204)+4 种基金the Liaoning Revitalization Talents Program(XLYC1802027)the Talent Recruited Program of the Chinese Academy of Science(Y938091)the Project Supported Discipline Innovation Team of the Liaoning Technical University(LNTU20TD-23)the Liaoning Province Doctoral Research Initiation Fund Program(2023-BS-202)the Basic Research Projects of Liaoning Department of Education(JYTQN2023202)。
文摘Accurate estimation of understory terrain has significant scientific importance for maintaining ecosystem balance and biodiversity conservation.Addressing the issue of inadequate representation of spatial heterogeneity when traditional forest topographic inversion methods consider the entire forest as the inversion unit,this study pro⁃poses a differentiated modeling approach to forest types based on refined land cover classification.Taking Puerto Ri⁃co and Maryland as study areas,a multi-dimensional feature system is constructed by integrating multi-source re⁃mote sensing data:ICESat-2 spaceborne LiDAR is used to obtain benchmark values for understory terrain,topo⁃graphic factors such as slope and aspect are extracted based on SRTM data,and vegetation cover characteristics are analyzed using Landsat-8 multispectral imagery.This study incorporates forest type as a classification modeling con⁃dition and applies the random forest algorithm to build differentiated topographic inversion models.Experimental re⁃sults indicate that,compared to traditional whole-area modeling methods(RMSE=5.06 m),forest type-based classi⁃fication modeling significantly improves the accuracy of understory terrain estimation(RMSE=2.94 m),validating the effectiveness of spatial heterogeneity modeling.Further sensitivity analysis reveals that canopy structure parame⁃ters(with RMSE variation reaching 4.11 m)exert a stronger regulatory effect on estimation accuracy compared to forest cover,providing important theoretical support for optimizing remote sensing models of forest topography.
基金Project supported partly by the National Basic Research Program (973) of China (No. 2002B312103)+2 种基金the National Natural Science Foundation of China (No. 3027466)the Chinese Academy of Sciences
文摘This study aimed at investigating the characteristics of table and graph that people perceive and the data types which people consider the two displays are most appropriate for. Participants in this survey were 195 teachers and undergraduates from four universities in Beijing. The results showed people's different attitudes towards the two forms of display.