Well logging technology has accumulated a large amount of historical data through four generations of technological development,which forms the basis of well logging big data and digital assets.However,the value of th...Well logging technology has accumulated a large amount of historical data through four generations of technological development,which forms the basis of well logging big data and digital assets.However,the value of these data has not been well stored,managed and mined.With the development of cloud computing technology,it provides a rare development opportunity for logging big data private cloud.The traditional petrophysical evaluation and interpretation model has encountered great challenges in the face of new evaluation objects.The solution research of logging big data distributed storage,processing and learning functions integrated in logging big data private cloud has not been carried out yet.To establish a distributed logging big-data private cloud platform centered on a unifi ed learning model,which achieves the distributed storage and processing of logging big data and facilitates the learning of novel knowledge patterns via the unifi ed logging learning model integrating physical simulation and data models in a large-scale functional space,thus resolving the geo-engineering evaluation problem of geothermal fi elds.Based on the research idea of“logging big data cloud platform-unifi ed logging learning model-large function space-knowledge learning&discovery-application”,the theoretical foundation of unified learning model,cloud platform architecture,data storage and learning algorithm,arithmetic power allocation and platform monitoring,platform stability,data security,etc.have been carried on analysis.The designed logging big data cloud platform realizes parallel distributed storage and processing of data and learning algorithms.The feasibility of constructing a well logging big data cloud platform based on a unifi ed learning model of physics and data is analyzed in terms of the structure,ecology,management and security of the cloud platform.The case study shows that the logging big data cloud platform has obvious technical advantages over traditional logging evaluation methods in terms of knowledge discovery method,data software and results sharing,accuracy,speed and complexity.展开更多
Objectives:This study aimed to develop and validate a stroke risk prediction model based on machine learning(ML)and regional healthcare big data,and determine whether it may improve the prediction performance compared...Objectives:This study aimed to develop and validate a stroke risk prediction model based on machine learning(ML)and regional healthcare big data,and determine whether it may improve the prediction performance compared with the conventional Logistic Regression(LR)model.Methods:This retrospective cohort study analyzed data from the CHinese Electronic health Records Research in Yinzhou(CHERRY)(2015–2021).We included adults aged 18–75 from the platform who had established records before 2015.Individuals with pre-existing stroke,key data absence,or excessive missingness(>30%)were excluded.Data on demographic,clinical measures,lifestyle factors,comorbidities,and family history of stroke were collected.Variable selection was performed in two stages:an initial screening via univariate analysis,followed by a prioritization of variables based on clinical relevance and actionability,with a focus on those that are modifiable.Stroke prediction models were developed using LR and four ML algorithms:Decision Tree(DT),Random Forest(RF),eXtreme Gradient Boosting(XGBoost),and Back Propagation Neural Network(BPNN).The dataset was split 7:3 for training and validation sets.Performance was assessed using receiver operating characteristic(ROC)curves,calibration,and confusion matrices,and the cutoff value was determined by Youden's index to classify risk groups.Results:The study cohort comprised 92,172 participants with 436 incident stroke cases(incidence rate:474/100,000 person-years).Ultimately,13 predictor variables were included.RF achieved the highest accuracy(0.935),precision(0.923),sensitivity(recall:0.947),and F1 score(0.935).Model evaluation demonstrated superior predictive performance of ML algorithms over conventional LR,with training/validation areaunderthe curve(AUC)sof0.777/0.779(LR),0.921/0.918(BPNN),0.988/0.980(RF),0.980/0.955(DT),and 0.962/0.958(XGBoost).Calibration analysis revealed a better fit for DT,LR and BPNN compared to RF and XGBoost model.Based on the optimal performance of the RF model,the ranking of factors in descending order of importance was:hypertension,age,diabetes,systolic blood pressure,waist,high-density lipoprotein Cholesterol,fasting blood glucose,physical activity,BMI,low-density lipoprotein cholesterol,total cholesterol,dietary habits,and family history of stroke.Using Youden's index as the optimal cutoff,the RF model stratified individuals into high-risk(>0.789)and low-risk(≤0.789)groups with robust discrimination.Conclusions:The ML-based prediction models demonstrated superior performance metrics compared to conventional LR and the RF is the optimal prediction model,providing an effective tool for risk stratifi cation in primary stroke prevention in community settings.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
Recently,internet stimulates the explosive progress of knowledge discovery in big volume data resource,to dig the valuable and hidden rules by computing.Simultaneously,the wireless channel measurement data reveals big...Recently,internet stimulates the explosive progress of knowledge discovery in big volume data resource,to dig the valuable and hidden rules by computing.Simultaneously,the wireless channel measurement data reveals big volume feature,considering the massive antennas,huge bandwidth and versatile application scenarios.This article firstly presents a comprehensive survey of channel measurement and modeling research for mobile communication,especially for 5th Generation(5G) and beyond.Considering the big data research progress,then a cluster-nuclei based model is proposed,which takes advantages of both the stochastical model and deterministic model.The novel model has low complexity with the limited number of cluster-nuclei while the cluster-nuclei has the physical mapping to real propagation objects.Combining the channel properties variation principles with antenna size,frequency,mobility and scenario dug from the channel data,the proposed model can be expanded in versatile application to support future mobile research.展开更多
Facing the development of future 5 G, the emerging technologies such as Internet of things, big data, cloud computing, and artificial intelligence is enhancing an explosive growth in data traffic. Radical changes in c...Facing the development of future 5 G, the emerging technologies such as Internet of things, big data, cloud computing, and artificial intelligence is enhancing an explosive growth in data traffic. Radical changes in communication theory and implement technologies, the wireless communications and wireless networks have entered a new era. Among them, wireless big data(WBD) has tremendous value, and artificial intelligence(AI) gives unthinkable possibilities. However, in the big data development and artificial intelligence application groups, the lack of a sound theoretical foundation and mathematical methods is regarded as a real challenge that needs to be solved. From the basic problem of wireless communication, the interrelationship of demand, environment and ability, this paper intends to investigate the concept and data model of WBD, the wireless data mining, the wireless knowledge and wireless knowledge learning(WKL), and typical practices examples, to facilitate and open up more opportunities of WBD research and developments. Such research is beneficial for creating new theoretical foundation and emerging technologies of future wireless communications.展开更多
Aflood is a significant damaging natural calamity that causes loss of life and property.Earlier work on the construction offlood prediction models intended to reduce risks,suggest policies,reduce mortality,and limit prop...Aflood is a significant damaging natural calamity that causes loss of life and property.Earlier work on the construction offlood prediction models intended to reduce risks,suggest policies,reduce mortality,and limit property damage caused byfloods.The massive amount of data generated by social media platforms such as Twitter opens the door toflood analysis.Because of the real-time nature of Twitter data,some government agencies and authorities have used it to track natural catastrophe events in order to build a more rapid rescue strategy.However,due to the shorter duration of Tweets,it is difficult to construct a perfect prediction model for determiningflood.Machine learning(ML)and deep learning(DL)approaches can be used to statistically developflood prediction models.At the same time,the vast amount of Tweets necessitates the use of a big data analytics(BDA)tool forflood prediction.In this regard,this work provides an optimal deep learning-basedflood forecasting model with big data analytics(ODLFF-BDA)based on Twitter data.The suggested ODLFF-BDA technique intends to anticipate the existence offloods using tweets in a big data setting.The ODLFF-BDA technique comprises data pre-processing to convert the input tweets into a usable format.In addition,a Bidirectional Encoder Representations from Transformers(BERT)model is used to generate emotive contextual embed-ding from tweets.Furthermore,a gated recurrent unit(GRU)with a Multilayer Convolutional Neural Network(MLCNN)is used to extract local data and predict theflood.Finally,an Equilibrium Optimizer(EO)is used tofine-tune the hyper-parameters of the GRU and MLCNN models in order to increase prediction performance.The memory usage is pull down lesser than 3.5 MB,if its compared with the other algorithm techniques.The ODLFF-BDA technique’s performance was validated using a benchmark Kaggle dataset,and thefindings showed that it outperformed other recent approaches significantly.展开更多
Although the Internet of Things has been widely applied,the problems of cloud computing in the application of digital smart medical Big Data collection,processing,analysis,and storage remain,especially the low efficie...Although the Internet of Things has been widely applied,the problems of cloud computing in the application of digital smart medical Big Data collection,processing,analysis,and storage remain,especially the low efficiency of medical diagnosis.And with the wide application of the Internet of Things and Big Data in the medical field,medical Big Data is increasing in geometric magnitude resulting in cloud service overload,insufficient storage,communication delay,and network congestion.In order to solve these medical and network problems,a medical big-data-oriented fog computing architec-ture and BP algorithm application are proposed,and its structural advantages and characteristics are studied.This architecture enables the medical Big Data generated by medical edge devices and the existing data in the cloud service center to calculate,compare and analyze the fog node through the Internet of Things.The diagnosis results are designed to reduce the business processing delay and improve the diagnosis effect.Considering the weak computing of each edge device,the artificial intelligence BP neural network algorithm is used in the core computing model of the medical diagnosis system to improve the system computing power,enhance the medical intelligence-aided decision-making,and improve the clinical diagnosis and treatment efficiency.In the application process,combined with the characteristics of medical Big Data technology,through fog architecture design and Big Data technology integration,we could research the processing and analysis of heterogeneous data of the medical diagnosis system in the context of the Internet of Things.The results are promising:The medical platform network is smooth,the data storage space is sufficient,the data processing and analysis speed is fast,the diagnosis effect is remarkable,and it is a good assistant to doctors’treatment effect.It not only effectively solves the problem of low clinical diagnosis,treatment efficiency and quality,but also reduces the waiting time of patients,effectively solves the contradiction between doctors and patients,and improves the medical service quality and management level.展开更多
Today’s world is a data-driven one,with data being produced in vast amounts as a result of the rapid growth of technology that permeates every aspect of our lives.New data processing techniques must be developed and ...Today’s world is a data-driven one,with data being produced in vast amounts as a result of the rapid growth of technology that permeates every aspect of our lives.New data processing techniques must be developed and refined over time to gain meaningful insights from this vast continuous volume of produced data in various forms.Machine learning technologies provide promising solutions and potential methods for processing large quantities of data and gaining value from it.This study conducts a literature review on the application of machine learning techniques in big data processing.It provides a general overview of machine learning algorithms and techniques,a brief introduction to big data,and a discussion of related works that have used machine learning techniques in a variety of sectors to process big amounts of data.The study also discusses the challenges and issues associated with the usage of machine learning for big data.展开更多
Data fusion is a multidisciplinary research area that involves different domains.It is used to attain minimum detection error probability and maximum reliability with the help of data retrieved from multiple healthcar...Data fusion is a multidisciplinary research area that involves different domains.It is used to attain minimum detection error probability and maximum reliability with the help of data retrieved from multiple healthcare sources.The generation of huge quantity of data from medical devices resulted in the formation of big data during which data fusion techniques become essential.Securing medical data is a crucial issue of exponentially-pacing computing world and can be achieved by Intrusion Detection Systems(IDS).In this regard,since singularmodality is not adequate to attain high detection rate,there is a need exists to merge diverse techniques using decision-based multimodal fusion process.In this view,this research article presents a new multimodal fusion-based IDS to secure the healthcare data using Spark.The proposed model involves decision-based fusion model which has different processes such as initialization,pre-processing,Feature Selection(FS)and multimodal classification for effective detection of intrusions.In FS process,a chaotic Butterfly Optimization(BO)algorithmcalled CBOA is introduced.Though the classic BO algorithm offers effective exploration,it fails in achieving faster convergence.In order to overcome this,i.e.,to improve the convergence rate,this research work modifies the required parameters of BO algorithm using chaos theory.Finally,to detect intrusions,multimodal classifier is applied by incorporating three Deep Learning(DL)-based classification models.Besides,the concepts like Hadoop MapReduce and Spark were also utilized in this study to achieve faster computation of big data in parallel computation platform.To validate the outcome of the presented model,a series of experimentations was performed using the benchmark NSLKDDCup99 Dataset repository.The proposed model demonstrated its effective results on the applied dataset by offering the maximum accuracy of 99.21%,precision of 98.93%and detection rate of 99.59%.The results assured the betterment of the proposed model.展开更多
Calorific value is one of the most important properties of coal.Machine learning(ML)can be used in the prediction of calorific value to reduce experimental costs.China is one of the world’s largest coal production co...Calorific value is one of the most important properties of coal.Machine learning(ML)can be used in the prediction of calorific value to reduce experimental costs.China is one of the world’s largest coal production countries and coal occupies an important position in its national energy structure.However,ML models with a large database for the overall regions of China are still missing.Based on the extensive coal gasification practices in East China University of Science and Technology,we have built ML models with a large database for overall regions of China.An AutoML model was proposed and achieved a minimum MSE of 1.021.SHAP method was used to increase the model interpretability,and model validity was proved with literature data and additional in-house experiments.The model adaptability was discussed based on the databases of China and USA,showing that geography-specific ML models are essential.This study integrated a large coal database and AutoML method for accurate calorific value prediction and could offer key tools for Chinese coal industry.展开更多
Developing an accurate and efficient comprehensive water quality prediction model and its assessment method is crucial for the prevention and control of water pollution.Deep learning(DL),as one of the most promising t...Developing an accurate and efficient comprehensive water quality prediction model and its assessment method is crucial for the prevention and control of water pollution.Deep learning(DL),as one of the most promising technologies today,plays a crucial role in the effective assessment of water body health,which is essential for water resource management.This study models using both the original dataset and a dataset augmented with Generative Adversarial Networks(GAN).It integrates optimization algorithms(OA)with Convolutional Neural Networks(CNN)to propose a comprehensive water quality model evaluation method aiming at identifying the optimal models for different pollutants.Specifically,after preprocessing the spectral dataset,data augmentation was conducted to obtain two datasets.Then,six new models were developed on these datasets using particle swarm optimization(PSO),genetic algorithm(GA),and simulated annealing(SA)combined with CNN to simulate and forecast the concentrations of three water pollutants:Chemical Oxygen Demand(COD),Total Nitrogen(TN),and Total Phosphorus(TP).Finally,seven model evaluation methods,including uncertainty analysis,were used to evaluate the constructed models and select the optimal models for the three pollutants.The evaluation results indicate that the GPSCNN model performed best in predicting COD and TP concentrations,while the GGACNN model excelled in TN concentration prediction.Compared to existing technologies,the proposed models and evaluation methods provide a more comprehensive and rapid approach to water body prediction and assessment,offering new insights and methods for water pollution prevention and control.展开更多
Climate change and human activities have reduced the area and degraded the functions and services of wetlands in China.To protect and restore wetlands,it is urgent to predict the spatial distribution of potential wetl...Climate change and human activities have reduced the area and degraded the functions and services of wetlands in China.To protect and restore wetlands,it is urgent to predict the spatial distribution of potential wetlands.In this study,the distribution of potential wetlands in China was simulated by integrating the advantages of Google Earth Engine with geographic big data and machine learning algorithms.Based on a potential wetland database with 46,000 samples and an indicator system of 30 hydrologic,soil,vegetation,and topographic factors,a simulation model was constructed by machine learning algorithms.The accuracy of the random forest model for simulating the distribution of potential wetlands in China was good,with an area under the receiver operating characteristic curve value of 0.851.The area of potential wetlands was 332,702 km^(2),with 39.0%of potential wetlands in Northeast China.Geographic features were notable,and potential wetlands were mainly concentrated in areas with 400-600 mm precipitation,semi-hydric and hydric soils,meadow and marsh vegetation,altitude less than 700 m,and slope less than 3°.The results provide an important reference for wetland remote sensing mapping and a scientific basis for wetland management in China.展开更多
Background Describing where distribution hotspots and coldspots are located is crucial for any science-based species management and governance.Thus,here we created the world's first Super Species Distribution Mode...Background Describing where distribution hotspots and coldspots are located is crucial for any science-based species management and governance.Thus,here we created the world's first Super Species Distribution Models(SDMs)including all described primate species and the best-available predictor set.These Super SDMs are conducted using an ensemble of modern Machine Learning algorithms,including Maxent,Tree Net,Random Forest,CART,CART Boosting and Bagging,and MARS with the utilization of cloud supercomputers(as an add-on option for more powerful models).For the global cold/hotspot models,we obtained global distribution data from www.GBIF.org(approx.420,000 raw occurrence records)and utilized the world's largest Open Access environmental predictor set of 201 layers.For this analysis,all occurrences have been merged into one multi-species(400+species)pixel-based analysis.Results We present the first quantified pixel-based global primate hotspot prediction for Central and Northern South America,West Africa,East Africa,Southeast Asia,Central Asia,and Southern Africa.The global primate coldspots are Antarctica,the Arctic,most temperate regions,and Oceania past the Wallace line.We additionally described all these modeled hotspots/coldspots and discussed reasons for a quantified understanding of where the world's non-human primates occur(or not).Conclusions This shows us where the focus for most future research and conservation management efforts should be,using state-of-the-art digital data indication tools with reasoning.Those areas should be considered of the highest conservation management priority,ideally following‘no killing zones'and sustainable land stewardship approaches if primates are to have a chance of survival.展开更多
The objective of this study is to develop an advanced approach to variogram modelling by integrating genetic algorithms(GA)with machine learning-based linear regression,aiming to improve the accuracy and efficiency of...The objective of this study is to develop an advanced approach to variogram modelling by integrating genetic algorithms(GA)with machine learning-based linear regression,aiming to improve the accuracy and efficiency of geostatistical analysis,particularly in mineral exploration.The study combines GA and machine learning to optimise variogram parameters,including range,sill,and nugget,by minimising the root mean square error(RMSE)and maximising the coefficient of determination(R^(2)).The experimental variograms were computed and modelled using theoretical models,followed by optimisation via evolutionary algorithms.The method was applied to gravity data from the Ngoura-Batouri-Kette mining district in Eastern Cameroon,covering 141 data points.Sequential Gaussian Simulations(SGS)were employed for predictive mapping to validate simulated results against true values.Key findings show variograms with ranges between 24.71 km and 49.77 km,opti-mised RMSE and R^(2) values of 11.21 mGal^(2) and 0.969,respectively,after 42 generations of GA optimisation.Predictive mapping using SGS demonstrated that simulated values closely matched true values,with the simu-lated mean at 21.75 mGal compared to the true mean of 25.16 mGal,and variances of 465.70 mGal^(2) and 555.28 mGal^(2),respectively.The results confirmed spatial variability and anisotropies in the N170-N210 directions,consistent with prior studies.This work presents a novel integration of GA and machine learning for variogram modelling,offering an automated,efficient approach to parameter estimation.The methodology significantly enhances predictive geostatistical models,contributing to the advancement of mineral exploration and improving the precision and speed of decision-making in the petroleum and mining industries.展开更多
基金supported By Grant (PLN2022-14) of State Key Laboratory of Oil and Gas Reservoir Geology and Exploitation (Southwest Petroleum University)。
文摘Well logging technology has accumulated a large amount of historical data through four generations of technological development,which forms the basis of well logging big data and digital assets.However,the value of these data has not been well stored,managed and mined.With the development of cloud computing technology,it provides a rare development opportunity for logging big data private cloud.The traditional petrophysical evaluation and interpretation model has encountered great challenges in the face of new evaluation objects.The solution research of logging big data distributed storage,processing and learning functions integrated in logging big data private cloud has not been carried out yet.To establish a distributed logging big-data private cloud platform centered on a unifi ed learning model,which achieves the distributed storage and processing of logging big data and facilitates the learning of novel knowledge patterns via the unifi ed logging learning model integrating physical simulation and data models in a large-scale functional space,thus resolving the geo-engineering evaluation problem of geothermal fi elds.Based on the research idea of“logging big data cloud platform-unifi ed logging learning model-large function space-knowledge learning&discovery-application”,the theoretical foundation of unified learning model,cloud platform architecture,data storage and learning algorithm,arithmetic power allocation and platform monitoring,platform stability,data security,etc.have been carried on analysis.The designed logging big data cloud platform realizes parallel distributed storage and processing of data and learning algorithms.The feasibility of constructing a well logging big data cloud platform based on a unifi ed learning model of physics and data is analyzed in terms of the structure,ecology,management and security of the cloud platform.The case study shows that the logging big data cloud platform has obvious technical advantages over traditional logging evaluation methods in terms of knowledge discovery method,data software and results sharing,accuracy,speed and complexity.
基金funded by Beijing Natural Science Foundation-Haidian Original Innovation Joint Fund(Grant No.L222103)the National Natural Science Foundation of China(Grant No.72174012)。
文摘Objectives:This study aimed to develop and validate a stroke risk prediction model based on machine learning(ML)and regional healthcare big data,and determine whether it may improve the prediction performance compared with the conventional Logistic Regression(LR)model.Methods:This retrospective cohort study analyzed data from the CHinese Electronic health Records Research in Yinzhou(CHERRY)(2015–2021).We included adults aged 18–75 from the platform who had established records before 2015.Individuals with pre-existing stroke,key data absence,or excessive missingness(>30%)were excluded.Data on demographic,clinical measures,lifestyle factors,comorbidities,and family history of stroke were collected.Variable selection was performed in two stages:an initial screening via univariate analysis,followed by a prioritization of variables based on clinical relevance and actionability,with a focus on those that are modifiable.Stroke prediction models were developed using LR and four ML algorithms:Decision Tree(DT),Random Forest(RF),eXtreme Gradient Boosting(XGBoost),and Back Propagation Neural Network(BPNN).The dataset was split 7:3 for training and validation sets.Performance was assessed using receiver operating characteristic(ROC)curves,calibration,and confusion matrices,and the cutoff value was determined by Youden's index to classify risk groups.Results:The study cohort comprised 92,172 participants with 436 incident stroke cases(incidence rate:474/100,000 person-years).Ultimately,13 predictor variables were included.RF achieved the highest accuracy(0.935),precision(0.923),sensitivity(recall:0.947),and F1 score(0.935).Model evaluation demonstrated superior predictive performance of ML algorithms over conventional LR,with training/validation areaunderthe curve(AUC)sof0.777/0.779(LR),0.921/0.918(BPNN),0.988/0.980(RF),0.980/0.955(DT),and 0.962/0.958(XGBoost).Calibration analysis revealed a better fit for DT,LR and BPNN compared to RF and XGBoost model.Based on the optimal performance of the RF model,the ranking of factors in descending order of importance was:hypertension,age,diabetes,systolic blood pressure,waist,high-density lipoprotein Cholesterol,fasting blood glucose,physical activity,BMI,low-density lipoprotein cholesterol,total cholesterol,dietary habits,and family history of stroke.Using Youden's index as the optimal cutoff,the RF model stratified individuals into high-risk(>0.789)and low-risk(≤0.789)groups with robust discrimination.Conclusions:The ML-based prediction models demonstrated superior performance metrics compared to conventional LR and the RF is the optimal prediction model,providing an effective tool for risk stratifi cation in primary stroke prevention in community settings.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
基金supported in part by National Natural Science Foundation of China (61322110, 6141101115)Doctoral Fund of Ministry of Education (201300051100013)
文摘Recently,internet stimulates the explosive progress of knowledge discovery in big volume data resource,to dig the valuable and hidden rules by computing.Simultaneously,the wireless channel measurement data reveals big volume feature,considering the massive antennas,huge bandwidth and versatile application scenarios.This article firstly presents a comprehensive survey of channel measurement and modeling research for mobile communication,especially for 5th Generation(5G) and beyond.Considering the big data research progress,then a cluster-nuclei based model is proposed,which takes advantages of both the stochastical model and deterministic model.The novel model has low complexity with the limited number of cluster-nuclei while the cluster-nuclei has the physical mapping to real propagation objects.Combining the channel properties variation principles with antenna size,frequency,mobility and scenario dug from the channel data,the proposed model can be expanded in versatile application to support future mobile research.
文摘Facing the development of future 5 G, the emerging technologies such as Internet of things, big data, cloud computing, and artificial intelligence is enhancing an explosive growth in data traffic. Radical changes in communication theory and implement technologies, the wireless communications and wireless networks have entered a new era. Among them, wireless big data(WBD) has tremendous value, and artificial intelligence(AI) gives unthinkable possibilities. However, in the big data development and artificial intelligence application groups, the lack of a sound theoretical foundation and mathematical methods is regarded as a real challenge that needs to be solved. From the basic problem of wireless communication, the interrelationship of demand, environment and ability, this paper intends to investigate the concept and data model of WBD, the wireless data mining, the wireless knowledge and wireless knowledge learning(WKL), and typical practices examples, to facilitate and open up more opportunities of WBD research and developments. Such research is beneficial for creating new theoretical foundation and emerging technologies of future wireless communications.
文摘Aflood is a significant damaging natural calamity that causes loss of life and property.Earlier work on the construction offlood prediction models intended to reduce risks,suggest policies,reduce mortality,and limit property damage caused byfloods.The massive amount of data generated by social media platforms such as Twitter opens the door toflood analysis.Because of the real-time nature of Twitter data,some government agencies and authorities have used it to track natural catastrophe events in order to build a more rapid rescue strategy.However,due to the shorter duration of Tweets,it is difficult to construct a perfect prediction model for determiningflood.Machine learning(ML)and deep learning(DL)approaches can be used to statistically developflood prediction models.At the same time,the vast amount of Tweets necessitates the use of a big data analytics(BDA)tool forflood prediction.In this regard,this work provides an optimal deep learning-basedflood forecasting model with big data analytics(ODLFF-BDA)based on Twitter data.The suggested ODLFF-BDA technique intends to anticipate the existence offloods using tweets in a big data setting.The ODLFF-BDA technique comprises data pre-processing to convert the input tweets into a usable format.In addition,a Bidirectional Encoder Representations from Transformers(BERT)model is used to generate emotive contextual embed-ding from tweets.Furthermore,a gated recurrent unit(GRU)with a Multilayer Convolutional Neural Network(MLCNN)is used to extract local data and predict theflood.Finally,an Equilibrium Optimizer(EO)is used tofine-tune the hyper-parameters of the GRU and MLCNN models in order to increase prediction performance.The memory usage is pull down lesser than 3.5 MB,if its compared with the other algorithm techniques.The ODLFF-BDA technique’s performance was validated using a benchmark Kaggle dataset,and thefindings showed that it outperformed other recent approaches significantly.
基金supported by 2020 Foshan Science and Technology Project(Numbering:2020001005356),Baoling Qin received the grant.
文摘Although the Internet of Things has been widely applied,the problems of cloud computing in the application of digital smart medical Big Data collection,processing,analysis,and storage remain,especially the low efficiency of medical diagnosis.And with the wide application of the Internet of Things and Big Data in the medical field,medical Big Data is increasing in geometric magnitude resulting in cloud service overload,insufficient storage,communication delay,and network congestion.In order to solve these medical and network problems,a medical big-data-oriented fog computing architec-ture and BP algorithm application are proposed,and its structural advantages and characteristics are studied.This architecture enables the medical Big Data generated by medical edge devices and the existing data in the cloud service center to calculate,compare and analyze the fog node through the Internet of Things.The diagnosis results are designed to reduce the business processing delay and improve the diagnosis effect.Considering the weak computing of each edge device,the artificial intelligence BP neural network algorithm is used in the core computing model of the medical diagnosis system to improve the system computing power,enhance the medical intelligence-aided decision-making,and improve the clinical diagnosis and treatment efficiency.In the application process,combined with the characteristics of medical Big Data technology,through fog architecture design and Big Data technology integration,we could research the processing and analysis of heterogeneous data of the medical diagnosis system in the context of the Internet of Things.The results are promising:The medical platform network is smooth,the data storage space is sufficient,the data processing and analysis speed is fast,the diagnosis effect is remarkable,and it is a good assistant to doctors’treatment effect.It not only effectively solves the problem of low clinical diagnosis,treatment efficiency and quality,but also reduces the waiting time of patients,effectively solves the contradiction between doctors and patients,and improves the medical service quality and management level.
基金This work was supported by the Deanship of Scientific Research at Qassim University.
文摘Today’s world is a data-driven one,with data being produced in vast amounts as a result of the rapid growth of technology that permeates every aspect of our lives.New data processing techniques must be developed and refined over time to gain meaningful insights from this vast continuous volume of produced data in various forms.Machine learning technologies provide promising solutions and potential methods for processing large quantities of data and gaining value from it.This study conducts a literature review on the application of machine learning techniques in big data processing.It provides a general overview of machine learning algorithms and techniques,a brief introduction to big data,and a discussion of related works that have used machine learning techniques in a variety of sectors to process big amounts of data.The study also discusses the challenges and issues associated with the usage of machine learning for big data.
文摘Data fusion is a multidisciplinary research area that involves different domains.It is used to attain minimum detection error probability and maximum reliability with the help of data retrieved from multiple healthcare sources.The generation of huge quantity of data from medical devices resulted in the formation of big data during which data fusion techniques become essential.Securing medical data is a crucial issue of exponentially-pacing computing world and can be achieved by Intrusion Detection Systems(IDS).In this regard,since singularmodality is not adequate to attain high detection rate,there is a need exists to merge diverse techniques using decision-based multimodal fusion process.In this view,this research article presents a new multimodal fusion-based IDS to secure the healthcare data using Spark.The proposed model involves decision-based fusion model which has different processes such as initialization,pre-processing,Feature Selection(FS)and multimodal classification for effective detection of intrusions.In FS process,a chaotic Butterfly Optimization(BO)algorithmcalled CBOA is introduced.Though the classic BO algorithm offers effective exploration,it fails in achieving faster convergence.In order to overcome this,i.e.,to improve the convergence rate,this research work modifies the required parameters of BO algorithm using chaos theory.Finally,to detect intrusions,multimodal classifier is applied by incorporating three Deep Learning(DL)-based classification models.Besides,the concepts like Hadoop MapReduce and Spark were also utilized in this study to achieve faster computation of big data in parallel computation platform.To validate the outcome of the presented model,a series of experimentations was performed using the benchmark NSLKDDCup99 Dataset repository.The proposed model demonstrated its effective results on the applied dataset by offering the maximum accuracy of 99.21%,precision of 98.93%and detection rate of 99.59%.The results assured the betterment of the proposed model.
基金Shanghai Yangfan Program,22YF1410300,Yunfei GaoNational Natural Science Foundation of China,22208104,Yunfei Gao+1 种基金Shanghai Chenguang Program,21CGA35,Yunfei GaoNational Key Research and Development Program of China,2022YFA1504701,Yunfei Gao,2022YFB4101900,Yunfei Gao。
文摘Calorific value is one of the most important properties of coal.Machine learning(ML)can be used in the prediction of calorific value to reduce experimental costs.China is one of the world’s largest coal production countries and coal occupies an important position in its national energy structure.However,ML models with a large database for the overall regions of China are still missing.Based on the extensive coal gasification practices in East China University of Science and Technology,we have built ML models with a large database for overall regions of China.An AutoML model was proposed and achieved a minimum MSE of 1.021.SHAP method was used to increase the model interpretability,and model validity was proved with literature data and additional in-house experiments.The model adaptability was discussed based on the databases of China and USA,showing that geography-specific ML models are essential.This study integrated a large coal database and AutoML method for accurate calorific value prediction and could offer key tools for Chinese coal industry.
基金Supported by Natural Science Basic Research Plan in Shaanxi Province of China(Program No.2022JM-396)the Strategic Priority Research Program of the Chinese Academy of Sciences,Grant No.XDA23040101+4 种基金Shaanxi Province Key Research and Development Projects(Program No.2023-YBSF-437)Xi'an Shiyou University Graduate Student Innovation Fund Program(Program No.YCX2412041)State Key Laboratory of Air Traffic Management System and Technology(SKLATM202001)Tianjin Education Commission Research Program Project(2020KJ028)Fundamental Research Funds for the Central Universities(3122019132)。
文摘Developing an accurate and efficient comprehensive water quality prediction model and its assessment method is crucial for the prevention and control of water pollution.Deep learning(DL),as one of the most promising technologies today,plays a crucial role in the effective assessment of water body health,which is essential for water resource management.This study models using both the original dataset and a dataset augmented with Generative Adversarial Networks(GAN).It integrates optimization algorithms(OA)with Convolutional Neural Networks(CNN)to propose a comprehensive water quality model evaluation method aiming at identifying the optimal models for different pollutants.Specifically,after preprocessing the spectral dataset,data augmentation was conducted to obtain two datasets.Then,six new models were developed on these datasets using particle swarm optimization(PSO),genetic algorithm(GA),and simulated annealing(SA)combined with CNN to simulate and forecast the concentrations of three water pollutants:Chemical Oxygen Demand(COD),Total Nitrogen(TN),and Total Phosphorus(TP).Finally,seven model evaluation methods,including uncertainty analysis,were used to evaluate the constructed models and select the optimal models for the three pollutants.The evaluation results indicate that the GPSCNN model performed best in predicting COD and TP concentrations,while the GGACNN model excelled in TN concentration prediction.Compared to existing technologies,the proposed models and evaluation methods provide a more comprehensive and rapid approach to water body prediction and assessment,offering new insights and methods for water pollution prevention and control.
基金supported by the Natural Science Foundation of Jilin Province,China[YDZJ202301ZYTS218]the National Natural Science Foundation of China[42301430,42222103,42171379,U2243230,and 42101379]+1 种基金the Youth Innovation Promotion Association of the Chinese Academy of Sciences[2017277 and 2021227]the Professional Association of the Alliance of International Science Organizations[ANSO-PA-2020-14].
文摘Climate change and human activities have reduced the area and degraded the functions and services of wetlands in China.To protect and restore wetlands,it is urgent to predict the spatial distribution of potential wetlands.In this study,the distribution of potential wetlands in China was simulated by integrating the advantages of Google Earth Engine with geographic big data and machine learning algorithms.Based on a potential wetland database with 46,000 samples and an indicator system of 30 hydrologic,soil,vegetation,and topographic factors,a simulation model was constructed by machine learning algorithms.The accuracy of the random forest model for simulating the distribution of potential wetlands in China was good,with an area under the receiver operating characteristic curve value of 0.851.The area of potential wetlands was 332,702 km^(2),with 39.0%of potential wetlands in Northeast China.Geographic features were notable,and potential wetlands were mainly concentrated in areas with 400-600 mm precipitation,semi-hydric and hydric soils,meadow and marsh vegetation,altitude less than 700 m,and slope less than 3°.The results provide an important reference for wetland remote sensing mapping and a scientific basis for wetland management in China.
文摘Background Describing where distribution hotspots and coldspots are located is crucial for any science-based species management and governance.Thus,here we created the world's first Super Species Distribution Models(SDMs)including all described primate species and the best-available predictor set.These Super SDMs are conducted using an ensemble of modern Machine Learning algorithms,including Maxent,Tree Net,Random Forest,CART,CART Boosting and Bagging,and MARS with the utilization of cloud supercomputers(as an add-on option for more powerful models).For the global cold/hotspot models,we obtained global distribution data from www.GBIF.org(approx.420,000 raw occurrence records)and utilized the world's largest Open Access environmental predictor set of 201 layers.For this analysis,all occurrences have been merged into one multi-species(400+species)pixel-based analysis.Results We present the first quantified pixel-based global primate hotspot prediction for Central and Northern South America,West Africa,East Africa,Southeast Asia,Central Asia,and Southern Africa.The global primate coldspots are Antarctica,the Arctic,most temperate regions,and Oceania past the Wallace line.We additionally described all these modeled hotspots/coldspots and discussed reasons for a quantified understanding of where the world's non-human primates occur(or not).Conclusions This shows us where the focus for most future research and conservation management efforts should be,using state-of-the-art digital data indication tools with reasoning.Those areas should be considered of the highest conservation management priority,ideally following‘no killing zones'and sustainable land stewardship approaches if primates are to have a chance of survival.
文摘The objective of this study is to develop an advanced approach to variogram modelling by integrating genetic algorithms(GA)with machine learning-based linear regression,aiming to improve the accuracy and efficiency of geostatistical analysis,particularly in mineral exploration.The study combines GA and machine learning to optimise variogram parameters,including range,sill,and nugget,by minimising the root mean square error(RMSE)and maximising the coefficient of determination(R^(2)).The experimental variograms were computed and modelled using theoretical models,followed by optimisation via evolutionary algorithms.The method was applied to gravity data from the Ngoura-Batouri-Kette mining district in Eastern Cameroon,covering 141 data points.Sequential Gaussian Simulations(SGS)were employed for predictive mapping to validate simulated results against true values.Key findings show variograms with ranges between 24.71 km and 49.77 km,opti-mised RMSE and R^(2) values of 11.21 mGal^(2) and 0.969,respectively,after 42 generations of GA optimisation.Predictive mapping using SGS demonstrated that simulated values closely matched true values,with the simu-lated mean at 21.75 mGal compared to the true mean of 25.16 mGal,and variances of 465.70 mGal^(2) and 555.28 mGal^(2),respectively.The results confirmed spatial variability and anisotropies in the N170-N210 directions,consistent with prior studies.This work presents a novel integration of GA and machine learning for variogram modelling,offering an automated,efficient approach to parameter estimation.The methodology significantly enhances predictive geostatistical models,contributing to the advancement of mineral exploration and improving the precision and speed of decision-making in the petroleum and mining industries.