Background:In recent years,there has been a growing trend in the utilization of observational studies that make use of routinely collected healthcare data(RCD).These studies rely on algorithms to identify specific hea...Background:In recent years,there has been a growing trend in the utilization of observational studies that make use of routinely collected healthcare data(RCD).These studies rely on algorithms to identify specific health conditions(e.g.,diabetes or sepsis)for statistical analyses.However,there has been substantial variation in the algorithm development and validation,leading to frequently suboptimal performance and posing a significant threat to the validity of study findings.Unfortunately,these issues are often overlooked.Methods:We systematically developed guidance for the development,validation,and evaluation of algorithms designed to identify health status(DEVELOP-RCD).Our initial efforts involved conducting both a narrative review and a systematic review of published studies on the concepts and methodological issues related to algorithm development,validation,and evaluation.Subsequently,we conducted an empirical study on an algorithm for identifying sepsis.Based on these findings,we formulated specific workflow and recommendations for algorithm development,validation,and evaluation within the guidance.Finally,the guidance underwent independent review by a panel of 20 external experts who then convened a consensus meeting to finalize it.Results:A standardized workflow for algorithm development,validation,and evaluation was established.Guided by specific health status considerations,the workflow comprises four integrated steps:assessing an existing algorithm’s suitability for the target health status;developing a new algorithm using recommended methods;validating the algorithm using prescribed performance measures;and evaluating the impact of the algorithm on study results.Additionally,13 good practice recommendations were formulated with detailed explanations.Furthermore,a practical study on sepsis identification was included to demonstrate the application of this guidance.Conclusions:The establishment of guidance is intended to aid researchers and clinicians in the appropriate and accurate development and application of algorithms for identifying health status from RCD.This guidance has the potential to enhance the credibility of findings from observational studies involving RCD.展开更多
In recent years, the rapid decline of Arctic sea ice area (SIA) and sea ice extent (SIE), especially for the multiyear (MY) ice, has led to significant effect on climate change. The accurate retrieval of MY ice ...In recent years, the rapid decline of Arctic sea ice area (SIA) and sea ice extent (SIE), especially for the multiyear (MY) ice, has led to significant effect on climate change. The accurate retrieval of MY ice concentration retrieval is very important and challenging to understand the ongoing changes. Three MY ice concentration retrieval algorithms were systematically evaluated. A similar total ice concentration was yielded by these algorithms, while the retrieved MY sea ice concentrations differs from each other. The MY SIA derived from NASA TEAM algorithm is relatively stable. Other two algorithms created seasonal fluctuations of MY SIA, particularly in autumn and winter. In this paper, we proposed an ice concentration retrieval algorithm, which developed the NASA TEAM algorithm by adding to use AMSR-E 6.9 GHz brightness temperature data and sea ice concentration using 89.0 GHz data. Comparison with the reference MY SIA from reference MY ice, indicates that the mean difference and root mean square (rms) difference of MY SIA derived from the algorithm of this study are 0.65×10^6 km^2 and 0.69×10^6 km^2 during January to March, -0.06×10^6 km^2 and 0.14×10^6 km^2during September to December respectively. Comparison with MY SIE obtained from weekly ice age data provided by University of Colorado show that, the mean difference and rms difference are 0.69×10^6 km^2 and 0.84×10^6 km^2, respectively. The developed algorithm proposed in this study has smaller difference compared with the reference MY ice and MY SIE from ice age data than the Wang's, Lomax' and NASA TEAM algorithms.展开更多
Improved traditional ant colony algorithms,a data routing model used to the data remote exchange on WAN was presented.In the model,random heuristic factors were introduced to realize multi-path search.The updating mod...Improved traditional ant colony algorithms,a data routing model used to the data remote exchange on WAN was presented.In the model,random heuristic factors were introduced to realize multi-path search.The updating model of pheromone could adjust the pheromone concentration on the optimal path according to path load dynamically to make the system keep load balance.The simulation results show that the improved model has a higher performance on convergence and load balance.展开更多
Estimating the volume growth of forest ecosystems accurately is important for understanding carbon sequestration and achieving carbon neutrality goals.However,the key environmental factors affecting volume growth diff...Estimating the volume growth of forest ecosystems accurately is important for understanding carbon sequestration and achieving carbon neutrality goals.However,the key environmental factors affecting volume growth differ across various scales and plant functional types.This study was,therefore,conducted to estimate the volume growth of Larix and Quercus forests based on national-scale forestry inventory data in China and its influencing factors using random forest algorithms.The results showed that the model performances of volume growth in natural forests(R^(2)=0.65 for Larix and 0.66 for Quercus,respectively)were better than those in planted forests(R^(2)=0.44 for Larix and 0.40 for Quercus,respectively).In both natural and planted forests,the stand age showed a strong relative importance for volume growth(8.6%–66.2%),while the edaphic and climatic variables had a limited relative importance(<6.0%).The relationship between stand age and volume growth was unimodal in natural forests and linear increase in planted Quercus forests.And the specific locations(i.e.,altitude and aspect)of sampling plots exhibited high relative importance for volume growth in planted forests(4.1%–18.2%).Altitude positively affected volume growth in planted Larix forests but controlled volume growth negatively in planted Quercus forests.Similarly,the effects of other environmental factors on volume growth also differed in both stand origins(planted versus natural)and plant functional types(Larix versus Quercus).These results highlighted that the stand age was the most important predictor for volume growth and there were diverse effects of environmental factors on volume growth among stand origins and plant functional types.Our findings will provide a good framework for site-specific recommendations regarding the management practices necessary to maintain the volume growth in China's forest ecosystems.展开更多
With the rapid development of the genomic sequencing technology,the cost of obtaining personal genomic data and effectively analyzing it has been gradually reduced.The analysis and utilization of genomic dam gradually...With the rapid development of the genomic sequencing technology,the cost of obtaining personal genomic data and effectively analyzing it has been gradually reduced.The analysis and utilization of genomic dam gradually entered the public view,and the leakage of genomic dam privacy has attracted the attention of researchers.The security of genomic data is not only related to the protection of personal privacy,but also related to the biological information security of the country.However,there is still no.effective genomic dam privacy protection scheme using Shangyong Mima(SM)algorithms.In this paper,we analyze the widely used genomic dam file formats and design a large genomic dam files encryption scheme based on the SM algorithms.Firstly,we design a key agreement protocol based on the SM2 asymmetric cryptography and use the SM3 hash function to guarantee the correctness of the key.Secondly,we used the SM4 symmetric cryptography to encrypt the genomic data by optimizing the packet processing of files,and improve the usability by assisting the computing platform with key management.Software implementation demonstrates that the scheme can be applied to securely transmit the genomic data in the network environment and provide an encryption method based on SM algorithms for protecting the privacy of genomic data.展开更多
Direct soil temperature(ST)measurement is time-consuming and costly;thus,the use of simple and cost-effective machine learning(ML)tools is helpful.In this study,ML approaches,including KStar,instance-based K-nearest l...Direct soil temperature(ST)measurement is time-consuming and costly;thus,the use of simple and cost-effective machine learning(ML)tools is helpful.In this study,ML approaches,including KStar,instance-based K-nearest learning(IBK),and locally weighted learning(LWL),coupled with resampling algorithms of bagging(BA)and dagging(DA)(BA-IBK,BA-KStar,BA-LWL,DA-IBK,DA-KStar,and DA-LWL)were developed and tested for multi-step ahead(3,6,and 9 d ahead)ST forecasting.In addition,a linear regression(LR)model was used as a benchmark to evaluate the results.A dataset was established,with daily ST time-series at 5 and 50 cm soil depths in a farmland as models’output and meteorological data as models’input,including mean(T_(mean)),minimum(Tmin),and maximum(T_(max))air temperatures,evaporation(Eva),sunshine hours(SSH),and solar radiation(SR),which were collected at Isfahan Synoptic Station(Iran)for 13 years(1992–2005).Six different input combination scenarios were selected based on Pearson’s correlation coefficients between inputs and outputs and fed into the models.We used 70%of the data to train the models,with the remaining 30%used for model evaluation via multiple visual and quantitative metrics.Our?ndings showed that T_(mean)was the most effective input variable for ST forecasting in most of the developed models,while in some cases the combinations of variables,including T_(mean)and T_(max)and T_(mean),T_(max),Tmin,Eva,and SSH proved to be the best input combinations.Among the evaluated models,BA-KStar showed greater compatibility,while in most cases,BA-IBK and-LWL provided more accurate results,depending on soil depth.For the 5 cm soil depth,BA-KStar had superior performance(i.e.,Nash-Sutcliffe efficiency(NSE)=0.90,0.87,and 0.85 for 3,6,and 9 d ahead forecasting,respectively);for the 50 cm soil depth,DA-KStar outperformed the other models(i.e.,NSE=0.88,0.89,and 0.89 for 3,6,and 9 d ahead forecasting,respectively).The results con?rmed that all hybrid models had higher prediction capabilities than the LR model.展开更多
Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of informatio...Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of information that help</span><span style="font-family:Verdana;"><span style="font-family:Verdana;">s</span></span><span style="font-family:Verdana;"> to market the appropriate products at the appropriate time. Moreover, services are considered recently as products. The development of education and health services </span><span style="font-family:Verdana;"><span style="font-family:Verdana;">is</span></span><span style="font-family:Verdana;"> depending on historical data. For the more, reducing online social media networks problems and crimes need a significant source of information. Data analysts need to use an efficient classification algorithm to predict the future of such businesses. However, dealing with a huge quantity of data requires great time to process. Data mining involves many useful techniques that are used to predict statistical data in a variety of business applications. The classification technique is one of the most widely used with a variety of algorithms. In this paper, various classification algorithms are revised in terms of accuracy in different areas of data mining applications. A comprehensive analysis is made after delegated reading of 20 papers in the literature. This paper aims to help data analysts to choose the most suitable classification algorithm for different business applications including business in general, online social media networks, agriculture, health, and education. Results show FFBPN is the most accurate algorithm in the business domain. The Random Forest algorithm is the most accurate in classifying online social networks (OSN) activities. Na<span style="white-space:nowrap;">ï</span>ve Bayes algorithm is the most accurate to classify agriculture datasets. OneR is the most accurate algorithm to classify instances within the health domain. The C4.5 Decision Tree algorithm is the most accurate to classify students’ records to predict degree completion time.展开更多
Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their sim...Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.展开更多
Cloud computing has become an essential technology for the management and processing of large datasets,offering scalability,high availability,and fault tolerance.However,optimizing data replication across multiple dat...Cloud computing has become an essential technology for the management and processing of large datasets,offering scalability,high availability,and fault tolerance.However,optimizing data replication across multiple data centers poses a significant challenge,especially when balancing opposing goals such as latency,storage costs,energy consumption,and network efficiency.This study introduces a novel Dynamic Optimization Algorithm called Dynamic Multi-Objective Gannet Optimization(DMGO),designed to enhance data replication efficiency in cloud environments.Unlike traditional static replication systems,DMGO adapts dynamically to variations in network conditions,system demand,and resource availability.The approach utilizes multi-objective optimization approaches to efficiently balance data access latency,storage efficiency,and operational costs.DMGO consistently evaluates data center performance and adjusts replication algorithms in real time to guarantee optimal system efficiency.Experimental evaluations conducted in a simulated cloud environment demonstrate that DMGO significantly outperforms conventional static algorithms,achieving faster data access,lower storage overhead,reduced energy consumption,and improved scalability.The proposed methodology offers a robust and adaptable solution for modern cloud systems,ensuring efficient resource consumption while maintaining high performance.展开更多
To improve the traffic scheduling capability in operator data center networks,an analysis prediction and online scheduling mechanism(APOS)is designed,considering both the network structure and the network traffic in t...To improve the traffic scheduling capability in operator data center networks,an analysis prediction and online scheduling mechanism(APOS)is designed,considering both the network structure and the network traffic in the operator data center.Fibonacci tree optimization algorithm(FTO)is embedded into the analysis prediction and the online scheduling stages,the FTO traffic scheduling strategy is proposed.By taking the global optimal and the multi-modal optimization advantage of FTO,the traffic scheduling optimal solution and many suboptimal solutions can be obtained.The experiment results show that the FTO traffic scheduling strategy can schedule traffic in data center networks reasonably,and improve the load balancing in the operator data center network effectively.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
Wireless sensor network deployment optimization is a classic NP-hard problem and a popular topic in academic research.However,the current research on wireless sensor network deployment problems uses overly simplistic ...Wireless sensor network deployment optimization is a classic NP-hard problem and a popular topic in academic research.However,the current research on wireless sensor network deployment problems uses overly simplistic models,and there is a significant gap between the research results and actual wireless sensor networks.Some scholars have now modeled data fusion networks to make them more suitable for practical applications.This paper will explore the deployment problem of a stochastic data fusion wireless sensor network(SDFWSN),a model that reflects the randomness of environmental monitoring and uses data fusion techniques widely used in actual sensor networks for information collection.The deployment problem of SDFWSN is modeled as a multi-objective optimization problem.The network life cycle,spatiotemporal coverage,detection rate,and false alarm rate of SDFWSN are used as optimization objectives to optimize the deployment of network nodes.This paper proposes an enhanced multi-objective mongoose optimization algorithm(EMODMOA)to solve the deployment problem of SDFWSN.First,to overcome the shortcomings of the DMOA algorithm,such as its low convergence and tendency to get stuck in a local optimum,an encircling and hunting strategy is introduced into the original algorithm to propose the EDMOA algorithm.The EDMOA algorithm is designed as the EMODMOA algorithm by selecting reference points using the K-Nearest Neighbor(KNN)algorithm.To verify the effectiveness of the proposed algorithm,the EMODMOA algorithm was tested at CEC 2020 and achieved good results.In the SDFWSN deployment problem,the algorithm was compared with the Non-dominated Sorting Genetic Algorithm II(NSGAII),Multiple Objective Particle Swarm Optimization(MOPSO),Multi-Objective Evolutionary Algorithm based on Decomposition(MOEA/D),and Multi-Objective Grey Wolf Optimizer(MOGWO).By comparing and analyzing the performance evaluation metrics and optimization results of the objective functions of the multi-objective algorithms,the algorithm outperforms the other algorithms in the SDFWSN deployment results.To better demonstrate the superiority of the algorithm,simulations of diverse test cases were also performed,and good results were obtained.展开更多
This paper addresses the problem of selecting a route for every pair of communicating nodes in a virtual circuit data network in order to minimize the average delay encountered by messages. The problem was previously ...This paper addresses the problem of selecting a route for every pair of communicating nodes in a virtual circuit data network in order to minimize the average delay encountered by messages. The problem was previously modeled as a network of M/M/1 queues. Agenetic algorithm to solve this problem is presented. Extensive computational results across a variety of networks are reported. These results indicate that the presented solution procedure outperforms the other methods in the literature and is effective for a wide range of traffic loads.展开更多
In this paper, the high-level knowledge of financial data modeled by ordinary differential equations (ODEs) is discovered in dynamic data by using an asynchronous parallel evolutionary modeling algorithm (APHEMA). A n...In this paper, the high-level knowledge of financial data modeled by ordinary differential equations (ODEs) is discovered in dynamic data by using an asynchronous parallel evolutionary modeling algorithm (APHEMA). A numerical example of Nasdaq index analysis is used to demonstrate the potential of APHEMA. The results show that the dynamic models automatically discovered in dynamic data by computer can be used to predict the financial trends.展开更多
Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-clus...Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity. Clustering techniques are applied in different domains to predict future trends of available data and its uses for the real world. This research work is carried out to find the performance of two of the most delegated, partition based clustering algorithms namely k-Means and k-Medoids. A state of art analysis of these two algorithms is implemented and performance is analyzed based on their clustering result quality by means of its execution time and other components. Telecommunication data is the source data for this analysis. The connection oriented broadband data is given as input to find the clustering quality of the algorithms. Distance between the server locations and their connection is considered for clustering. Execution time for each algorithm is analyzed and the results are compared with one another. Results found in comparison study are satisfactory for the chosen application.展开更多
Many search-based algorithms have been successfully applied in sev-eral software engineering activities.Genetic algorithms(GAs)are the most used in the scientific domains by scholars to solve software testing problems....Many search-based algorithms have been successfully applied in sev-eral software engineering activities.Genetic algorithms(GAs)are the most used in the scientific domains by scholars to solve software testing problems.They imi-tate the theory of natural selection and evolution.The harmony search algorithm(HSA)is one of the most recent search algorithms in the last years.It imitates the behavior of a musician tofind the best harmony.Scholars have estimated the simi-larities and the differences between genetic algorithms and the harmony search algorithm in diverse research domains.The test data generation process represents a critical task in software validation.Unfortunately,there is no work comparing the performance of genetic algorithms and the harmony search algorithm in the test data generation process.This paper studies the similarities and the differences between genetic algorithms and the harmony search algorithm based on the ability and speed offinding the required test data.The current research performs an empirical comparison of the HSA and the GAs,and then the significance of the results is estimated using the t-Test.The study investigates the efficiency of the harmony search algorithm and the genetic algorithms according to(1)the time performance,(2)the significance of the generated test data,and(3)the adequacy of the generated test data to satisfy a given testing criterion.The results showed that the harmony search algorithm is significantly faster than the genetic algo-rithms because the t-Test showed that the p-value of the time values is 0.026<α(αis the significance level=0.05 at 95%confidence level).In contrast,there is no significant difference between the two algorithms in generating the adequate test data because the t-Test showed that the p-value of thefitness values is 0.25>α.展开更多
Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of ...Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of wind farm data.Consequently,a method for cleaning wind power anomaly data by combining image processing with community detection algorithms(CWPAD-IPCDA)is proposed.To precisely identify and initially clean anomalous data,wind power curve(WPC)images are converted into graph structures,which employ the Louvain community recognition algorithm and graph-theoretic methods for community detection and segmentation.Furthermore,the mathematical morphology operation(MMO)determines the main part of the initially cleaned wind power curve images and maps them back to the normal wind power points to complete the final cleaning.The CWPAD-IPCDA method was applied to clean datasets from 25 wind turbines(WTs)in two wind farms in northwest China to validate its feasibility.A comparison was conducted using density-based spatial clustering of applications with noise(DBSCAN)algorithm,an improved isolation forest algorithm,and an image-based(IB)algorithm.The experimental results demonstrate that the CWPAD-IPCDA method surpasses the other three algorithms,achieving an approximately 7.23%higher average data cleaning rate.The mean value of the sum of the squared errors(SSE)of the dataset after cleaning is approximately 6.887 lower than that of the other algorithms.Moreover,the mean of overall accuracy,as measured by the F1-score,exceeds that of the other methods by approximately 10.49%;this indicates that the CWPAD-IPCDA method is more conducive to improving the accuracy and reliability of wind power curve modeling and wind farm power forecasting.展开更多
Age of knowledge explosion requires us not only to have the ability to get useful information which represented by data but also to find knowledge in information. Human Genome Project achieved large amount of such bio...Age of knowledge explosion requires us not only to have the ability to get useful information which represented by data but also to find knowledge in information. Human Genome Project achieved large amount of such biological data, and people found clustering is a promising approach to analyze those biological data for knowledge hidden. The researches on biological data go to in-depth gradually and so are the clustering algorithms. This article mainly introduces current broad-used clustering algorithms, including the main idea, improvements, key technology, advantage and disadvantage, and the applications in biological field as well as the problems they solve. What’s more, this article roughly introduces some database used in biological field.展开更多
To establish a parallel fusion approach of processing high dimensional information, the model and criterion of multisensor fuzzy stochastic data fusion were presented. In order to design genetic algorithm fusion, the ...To establish a parallel fusion approach of processing high dimensional information, the model and criterion of multisensor fuzzy stochastic data fusion were presented. In order to design genetic algorithm fusion, the fusion parameter coding, initial population and fitness function establishing, and fuzzy logic controller designing for genetic operations and probability choosing were completed. The discussion on the highly dimensional fusion was given. For a moving target with the division of 1 64 (velocity) and 1 75 (acceleration), the precision of fusion is 0 94 and 0 98 respectively. The fusion approach can improve the reliability and decision precision effectively.展开更多
To improve the performance of the traditional map matching algorithms in freeway traffic state monitoring systems using the low logging frequency GPS (global positioning system) probe data, a map matching algorithm ...To improve the performance of the traditional map matching algorithms in freeway traffic state monitoring systems using the low logging frequency GPS (global positioning system) probe data, a map matching algorithm based on the Oracle spatial data model is proposed. The algorithm uses the Oracle road network data model to analyze the spatial relationships between massive GPS positioning points and freeway networks, builds an N-shortest path algorithm to find reasonable candidate routes between GPS positioning points efficiently, and uses the fuzzy logic inference system to determine the final matched traveling route. According to the implementation with field data from Los Angeles, the computation speed of the algorithm is about 135 GPS positioning points per second and the accuracy is 98.9%. The results demonstrate the effectiveness and accuracy of the proposed algorithm for mapping massive GPS positioning data onto freeway networks with complex geometric characteristics.展开更多
基金supported by the National Natural Science Foundation of China(82225049,72104155)the Sichuan Provincial Central Government Guides Local Science and Technology Development Special Project(2022ZYD0127)the 1·3·5 Project for Disciplines of Excellence,West China Hospital,Sichuan University(ZYGD23004).
文摘Background:In recent years,there has been a growing trend in the utilization of observational studies that make use of routinely collected healthcare data(RCD).These studies rely on algorithms to identify specific health conditions(e.g.,diabetes or sepsis)for statistical analyses.However,there has been substantial variation in the algorithm development and validation,leading to frequently suboptimal performance and posing a significant threat to the validity of study findings.Unfortunately,these issues are often overlooked.Methods:We systematically developed guidance for the development,validation,and evaluation of algorithms designed to identify health status(DEVELOP-RCD).Our initial efforts involved conducting both a narrative review and a systematic review of published studies on the concepts and methodological issues related to algorithm development,validation,and evaluation.Subsequently,we conducted an empirical study on an algorithm for identifying sepsis.Based on these findings,we formulated specific workflow and recommendations for algorithm development,validation,and evaluation within the guidance.Finally,the guidance underwent independent review by a panel of 20 external experts who then convened a consensus meeting to finalize it.Results:A standardized workflow for algorithm development,validation,and evaluation was established.Guided by specific health status considerations,the workflow comprises four integrated steps:assessing an existing algorithm’s suitability for the target health status;developing a new algorithm using recommended methods;validating the algorithm using prescribed performance measures;and evaluating the impact of the algorithm on study results.Additionally,13 good practice recommendations were formulated with detailed explanations.Furthermore,a practical study on sepsis identification was included to demonstrate the application of this guidance.Conclusions:The establishment of guidance is intended to aid researchers and clinicians in the appropriate and accurate development and application of algorithms for identifying health status from RCD.This guidance has the potential to enhance the credibility of findings from observational studies involving RCD.
基金The National Natural Science Foundation of China under contract Nos 41330960 and 41276193 and 41206184
文摘In recent years, the rapid decline of Arctic sea ice area (SIA) and sea ice extent (SIE), especially for the multiyear (MY) ice, has led to significant effect on climate change. The accurate retrieval of MY ice concentration retrieval is very important and challenging to understand the ongoing changes. Three MY ice concentration retrieval algorithms were systematically evaluated. A similar total ice concentration was yielded by these algorithms, while the retrieved MY sea ice concentrations differs from each other. The MY SIA derived from NASA TEAM algorithm is relatively stable. Other two algorithms created seasonal fluctuations of MY SIA, particularly in autumn and winter. In this paper, we proposed an ice concentration retrieval algorithm, which developed the NASA TEAM algorithm by adding to use AMSR-E 6.9 GHz brightness temperature data and sea ice concentration using 89.0 GHz data. Comparison with the reference MY SIA from reference MY ice, indicates that the mean difference and root mean square (rms) difference of MY SIA derived from the algorithm of this study are 0.65×10^6 km^2 and 0.69×10^6 km^2 during January to March, -0.06×10^6 km^2 and 0.14×10^6 km^2during September to December respectively. Comparison with MY SIE obtained from weekly ice age data provided by University of Colorado show that, the mean difference and rms difference are 0.69×10^6 km^2 and 0.84×10^6 km^2, respectively. The developed algorithm proposed in this study has smaller difference compared with the reference MY ice and MY SIE from ice age data than the Wang's, Lomax' and NASA TEAM algorithms.
基金Sponsored by the National High Technology Research and Development Program of China(2006AA701306)the National Innovation Foundation of Enterprises(05C26212200378)
文摘Improved traditional ant colony algorithms,a data routing model used to the data remote exchange on WAN was presented.In the model,random heuristic factors were introduced to realize multi-path search.The updating model of pheromone could adjust the pheromone concentration on the optimal path according to path load dynamically to make the system keep load balance.The simulation results show that the improved model has a higher performance on convergence and load balance.
基金supported by the Major Program of the National Natural Science Foundation of China(No.32192434)the Fundamental Research Funds of Chinese Academy of Forestry(No.CAFYBB2019ZD001)the National Key Research and Development Program of China(2016YFD060020602).
文摘Estimating the volume growth of forest ecosystems accurately is important for understanding carbon sequestration and achieving carbon neutrality goals.However,the key environmental factors affecting volume growth differ across various scales and plant functional types.This study was,therefore,conducted to estimate the volume growth of Larix and Quercus forests based on national-scale forestry inventory data in China and its influencing factors using random forest algorithms.The results showed that the model performances of volume growth in natural forests(R^(2)=0.65 for Larix and 0.66 for Quercus,respectively)were better than those in planted forests(R^(2)=0.44 for Larix and 0.40 for Quercus,respectively).In both natural and planted forests,the stand age showed a strong relative importance for volume growth(8.6%–66.2%),while the edaphic and climatic variables had a limited relative importance(<6.0%).The relationship between stand age and volume growth was unimodal in natural forests and linear increase in planted Quercus forests.And the specific locations(i.e.,altitude and aspect)of sampling plots exhibited high relative importance for volume growth in planted forests(4.1%–18.2%).Altitude positively affected volume growth in planted Larix forests but controlled volume growth negatively in planted Quercus forests.Similarly,the effects of other environmental factors on volume growth also differed in both stand origins(planted versus natural)and plant functional types(Larix versus Quercus).These results highlighted that the stand age was the most important predictor for volume growth and there were diverse effects of environmental factors on volume growth among stand origins and plant functional types.Our findings will provide a good framework for site-specific recommendations regarding the management practices necessary to maintain the volume growth in China's forest ecosystems.
基金supported by the National Key Research and Development Program of China(No.2016YFC1000307)the National Natural Science Foundation of China(No.61571024,No.61971021).
文摘With the rapid development of the genomic sequencing technology,the cost of obtaining personal genomic data and effectively analyzing it has been gradually reduced.The analysis and utilization of genomic dam gradually entered the public view,and the leakage of genomic dam privacy has attracted the attention of researchers.The security of genomic data is not only related to the protection of personal privacy,but also related to the biological information security of the country.However,there is still no.effective genomic dam privacy protection scheme using Shangyong Mima(SM)algorithms.In this paper,we analyze the widely used genomic dam file formats and design a large genomic dam files encryption scheme based on the SM algorithms.Firstly,we design a key agreement protocol based on the SM2 asymmetric cryptography and use the SM3 hash function to guarantee the correctness of the key.Secondly,we used the SM4 symmetric cryptography to encrypt the genomic data by optimizing the packet processing of files,and improve the usability by assisting the computing platform with key management.Software implementation demonstrates that the scheme can be applied to securely transmit the genomic data in the network environment and provide an encryption method based on SM algorithms for protecting the privacy of genomic data.
文摘Direct soil temperature(ST)measurement is time-consuming and costly;thus,the use of simple and cost-effective machine learning(ML)tools is helpful.In this study,ML approaches,including KStar,instance-based K-nearest learning(IBK),and locally weighted learning(LWL),coupled with resampling algorithms of bagging(BA)and dagging(DA)(BA-IBK,BA-KStar,BA-LWL,DA-IBK,DA-KStar,and DA-LWL)were developed and tested for multi-step ahead(3,6,and 9 d ahead)ST forecasting.In addition,a linear regression(LR)model was used as a benchmark to evaluate the results.A dataset was established,with daily ST time-series at 5 and 50 cm soil depths in a farmland as models’output and meteorological data as models’input,including mean(T_(mean)),minimum(Tmin),and maximum(T_(max))air temperatures,evaporation(Eva),sunshine hours(SSH),and solar radiation(SR),which were collected at Isfahan Synoptic Station(Iran)for 13 years(1992–2005).Six different input combination scenarios were selected based on Pearson’s correlation coefficients between inputs and outputs and fed into the models.We used 70%of the data to train the models,with the remaining 30%used for model evaluation via multiple visual and quantitative metrics.Our?ndings showed that T_(mean)was the most effective input variable for ST forecasting in most of the developed models,while in some cases the combinations of variables,including T_(mean)and T_(max)and T_(mean),T_(max),Tmin,Eva,and SSH proved to be the best input combinations.Among the evaluated models,BA-KStar showed greater compatibility,while in most cases,BA-IBK and-LWL provided more accurate results,depending on soil depth.For the 5 cm soil depth,BA-KStar had superior performance(i.e.,Nash-Sutcliffe efficiency(NSE)=0.90,0.87,and 0.85 for 3,6,and 9 d ahead forecasting,respectively);for the 50 cm soil depth,DA-KStar outperformed the other models(i.e.,NSE=0.88,0.89,and 0.89 for 3,6,and 9 d ahead forecasting,respectively).The results con?rmed that all hybrid models had higher prediction capabilities than the LR model.
文摘Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of information that help</span><span style="font-family:Verdana;"><span style="font-family:Verdana;">s</span></span><span style="font-family:Verdana;"> to market the appropriate products at the appropriate time. Moreover, services are considered recently as products. The development of education and health services </span><span style="font-family:Verdana;"><span style="font-family:Verdana;">is</span></span><span style="font-family:Verdana;"> depending on historical data. For the more, reducing online social media networks problems and crimes need a significant source of information. Data analysts need to use an efficient classification algorithm to predict the future of such businesses. However, dealing with a huge quantity of data requires great time to process. Data mining involves many useful techniques that are used to predict statistical data in a variety of business applications. The classification technique is one of the most widely used with a variety of algorithms. In this paper, various classification algorithms are revised in terms of accuracy in different areas of data mining applications. A comprehensive analysis is made after delegated reading of 20 papers in the literature. This paper aims to help data analysts to choose the most suitable classification algorithm for different business applications including business in general, online social media networks, agriculture, health, and education. Results show FFBPN is the most accurate algorithm in the business domain. The Random Forest algorithm is the most accurate in classifying online social networks (OSN) activities. Na<span style="white-space:nowrap;">ï</span>ve Bayes algorithm is the most accurate to classify agriculture datasets. OneR is the most accurate algorithm to classify instances within the health domain. The C4.5 Decision Tree algorithm is the most accurate to classify students’ records to predict degree completion time.
文摘Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.
文摘Cloud computing has become an essential technology for the management and processing of large datasets,offering scalability,high availability,and fault tolerance.However,optimizing data replication across multiple data centers poses a significant challenge,especially when balancing opposing goals such as latency,storage costs,energy consumption,and network efficiency.This study introduces a novel Dynamic Optimization Algorithm called Dynamic Multi-Objective Gannet Optimization(DMGO),designed to enhance data replication efficiency in cloud environments.Unlike traditional static replication systems,DMGO adapts dynamically to variations in network conditions,system demand,and resource availability.The approach utilizes multi-objective optimization approaches to efficiently balance data access latency,storage efficiency,and operational costs.DMGO consistently evaluates data center performance and adjusts replication algorithms in real time to guarantee optimal system efficiency.Experimental evaluations conducted in a simulated cloud environment demonstrate that DMGO significantly outperforms conventional static algorithms,achieving faster data access,lower storage overhead,reduced energy consumption,and improved scalability.The proposed methodology offers a robust and adaptable solution for modern cloud systems,ensuring efficient resource consumption while maintaining high performance.
基金supported by National Natural Science Foundation of China(No.62163036).
文摘To improve the traffic scheduling capability in operator data center networks,an analysis prediction and online scheduling mechanism(APOS)is designed,considering both the network structure and the network traffic in the operator data center.Fibonacci tree optimization algorithm(FTO)is embedded into the analysis prediction and the online scheduling stages,the FTO traffic scheduling strategy is proposed.By taking the global optimal and the multi-modal optimization advantage of FTO,the traffic scheduling optimal solution and many suboptimal solutions can be obtained.The experiment results show that the FTO traffic scheduling strategy can schedule traffic in data center networks reasonably,and improve the load balancing in the operator data center network effectively.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
基金supported by the National Natural Science Foundation of China under Grant Nos.U21A20464,62066005Innovation Project of Guangxi Graduate Education under Grant No.YCSW2024313.
文摘Wireless sensor network deployment optimization is a classic NP-hard problem and a popular topic in academic research.However,the current research on wireless sensor network deployment problems uses overly simplistic models,and there is a significant gap between the research results and actual wireless sensor networks.Some scholars have now modeled data fusion networks to make them more suitable for practical applications.This paper will explore the deployment problem of a stochastic data fusion wireless sensor network(SDFWSN),a model that reflects the randomness of environmental monitoring and uses data fusion techniques widely used in actual sensor networks for information collection.The deployment problem of SDFWSN is modeled as a multi-objective optimization problem.The network life cycle,spatiotemporal coverage,detection rate,and false alarm rate of SDFWSN are used as optimization objectives to optimize the deployment of network nodes.This paper proposes an enhanced multi-objective mongoose optimization algorithm(EMODMOA)to solve the deployment problem of SDFWSN.First,to overcome the shortcomings of the DMOA algorithm,such as its low convergence and tendency to get stuck in a local optimum,an encircling and hunting strategy is introduced into the original algorithm to propose the EDMOA algorithm.The EDMOA algorithm is designed as the EMODMOA algorithm by selecting reference points using the K-Nearest Neighbor(KNN)algorithm.To verify the effectiveness of the proposed algorithm,the EMODMOA algorithm was tested at CEC 2020 and achieved good results.In the SDFWSN deployment problem,the algorithm was compared with the Non-dominated Sorting Genetic Algorithm II(NSGAII),Multiple Objective Particle Swarm Optimization(MOPSO),Multi-Objective Evolutionary Algorithm based on Decomposition(MOEA/D),and Multi-Objective Grey Wolf Optimizer(MOGWO).By comparing and analyzing the performance evaluation metrics and optimization results of the objective functions of the multi-objective algorithms,the algorithm outperforms the other algorithms in the SDFWSN deployment results.To better demonstrate the superiority of the algorithm,simulations of diverse test cases were also performed,and good results were obtained.
文摘This paper addresses the problem of selecting a route for every pair of communicating nodes in a virtual circuit data network in order to minimize the average delay encountered by messages. The problem was previously modeled as a network of M/M/1 queues. Agenetic algorithm to solve this problem is presented. Extensive computational results across a variety of networks are reported. These results indicate that the presented solution procedure outperforms the other methods in the literature and is effective for a wide range of traffic loads.
文摘In this paper, the high-level knowledge of financial data modeled by ordinary differential equations (ODEs) is discovered in dynamic data by using an asynchronous parallel evolutionary modeling algorithm (APHEMA). A numerical example of Nasdaq index analysis is used to demonstrate the potential of APHEMA. The results show that the dynamic models automatically discovered in dynamic data by computer can be used to predict the financial trends.
文摘Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity. Clustering techniques are applied in different domains to predict future trends of available data and its uses for the real world. This research work is carried out to find the performance of two of the most delegated, partition based clustering algorithms namely k-Means and k-Medoids. A state of art analysis of these two algorithms is implemented and performance is analyzed based on their clustering result quality by means of its execution time and other components. Telecommunication data is the source data for this analysis. The connection oriented broadband data is given as input to find the clustering quality of the algorithms. Distance between the server locations and their connection is considered for clustering. Execution time for each algorithm is analyzed and the results are compared with one another. Results found in comparison study are satisfactory for the chosen application.
文摘Many search-based algorithms have been successfully applied in sev-eral software engineering activities.Genetic algorithms(GAs)are the most used in the scientific domains by scholars to solve software testing problems.They imi-tate the theory of natural selection and evolution.The harmony search algorithm(HSA)is one of the most recent search algorithms in the last years.It imitates the behavior of a musician tofind the best harmony.Scholars have estimated the simi-larities and the differences between genetic algorithms and the harmony search algorithm in diverse research domains.The test data generation process represents a critical task in software validation.Unfortunately,there is no work comparing the performance of genetic algorithms and the harmony search algorithm in the test data generation process.This paper studies the similarities and the differences between genetic algorithms and the harmony search algorithm based on the ability and speed offinding the required test data.The current research performs an empirical comparison of the HSA and the GAs,and then the significance of the results is estimated using the t-Test.The study investigates the efficiency of the harmony search algorithm and the genetic algorithms according to(1)the time performance,(2)the significance of the generated test data,and(3)the adequacy of the generated test data to satisfy a given testing criterion.The results showed that the harmony search algorithm is significantly faster than the genetic algo-rithms because the t-Test showed that the p-value of the time values is 0.026<α(αis the significance level=0.05 at 95%confidence level).In contrast,there is no significant difference between the two algorithms in generating the adequate test data because the t-Test showed that the p-value of thefitness values is 0.25>α.
基金supported by the National Natural Science Foundation of China(Project No.51767018)Natural Science Foundation of Gansu Province(Project No.23JRRA836).
文摘Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of wind farm data.Consequently,a method for cleaning wind power anomaly data by combining image processing with community detection algorithms(CWPAD-IPCDA)is proposed.To precisely identify and initially clean anomalous data,wind power curve(WPC)images are converted into graph structures,which employ the Louvain community recognition algorithm and graph-theoretic methods for community detection and segmentation.Furthermore,the mathematical morphology operation(MMO)determines the main part of the initially cleaned wind power curve images and maps them back to the normal wind power points to complete the final cleaning.The CWPAD-IPCDA method was applied to clean datasets from 25 wind turbines(WTs)in two wind farms in northwest China to validate its feasibility.A comparison was conducted using density-based spatial clustering of applications with noise(DBSCAN)algorithm,an improved isolation forest algorithm,and an image-based(IB)algorithm.The experimental results demonstrate that the CWPAD-IPCDA method surpasses the other three algorithms,achieving an approximately 7.23%higher average data cleaning rate.The mean value of the sum of the squared errors(SSE)of the dataset after cleaning is approximately 6.887 lower than that of the other algorithms.Moreover,the mean of overall accuracy,as measured by the F1-score,exceeds that of the other methods by approximately 10.49%;this indicates that the CWPAD-IPCDA method is more conducive to improving the accuracy and reliability of wind power curve modeling and wind farm power forecasting.
文摘Age of knowledge explosion requires us not only to have the ability to get useful information which represented by data but also to find knowledge in information. Human Genome Project achieved large amount of such biological data, and people found clustering is a promising approach to analyze those biological data for knowledge hidden. The researches on biological data go to in-depth gradually and so are the clustering algorithms. This article mainly introduces current broad-used clustering algorithms, including the main idea, improvements, key technology, advantage and disadvantage, and the applications in biological field as well as the problems they solve. What’s more, this article roughly introduces some database used in biological field.
文摘To establish a parallel fusion approach of processing high dimensional information, the model and criterion of multisensor fuzzy stochastic data fusion were presented. In order to design genetic algorithm fusion, the fusion parameter coding, initial population and fitness function establishing, and fuzzy logic controller designing for genetic operations and probability choosing were completed. The discussion on the highly dimensional fusion was given. For a moving target with the division of 1 64 (velocity) and 1 75 (acceleration), the precision of fusion is 0 94 and 0 98 respectively. The fusion approach can improve the reliability and decision precision effectively.
文摘To improve the performance of the traditional map matching algorithms in freeway traffic state monitoring systems using the low logging frequency GPS (global positioning system) probe data, a map matching algorithm based on the Oracle spatial data model is proposed. The algorithm uses the Oracle road network data model to analyze the spatial relationships between massive GPS positioning points and freeway networks, builds an N-shortest path algorithm to find reasonable candidate routes between GPS positioning points efficiently, and uses the fuzzy logic inference system to determine the final matched traveling route. According to the implementation with field data from Los Angeles, the computation speed of the algorithm is about 135 GPS positioning points per second and the accuracy is 98.9%. The results demonstrate the effectiveness and accuracy of the proposed algorithm for mapping massive GPS positioning data onto freeway networks with complex geometric characteristics.