In recent years,there has been a concerted effort to improve anomaly detection tech-niques,particularly in the context of high-dimensional,distributed clinical data.Analysing patient data within clinical settings reve...In recent years,there has been a concerted effort to improve anomaly detection tech-niques,particularly in the context of high-dimensional,distributed clinical data.Analysing patient data within clinical settings reveals a pronounced focus on refining diagnostic accuracy,personalising treatment plans,and optimising resource allocation to enhance clinical outcomes.Nonetheless,this domain faces unique challenges,such as irregular data collection,inconsistent data quality,and patient-specific structural variations.This paper proposed a novel hybrid approach that integrates heuristic and stochastic methods for anomaly detection in patient clinical data to address these challenges.The strategy combines HPO-based optimal Density-Based Spatial Clustering of Applications with Noise for clustering patient exercise data,facilitating efficient anomaly identification.Subsequently,a stochastic method based on the Interquartile Range filters unreliable data points,ensuring that medical tools and professionals receive only the most pertinent and accurate information.The primary objective of this study is to equip healthcare pro-fessionals and researchers with a robust tool for managing extensive,high-dimensional clinical datasets,enabling effective isolation and removal of aberrant data points.Furthermore,a sophisticated regression model has been developed using Automated Machine Learning(AutoML)to assess the impact of the ensemble abnormal pattern detection approach.Various statistical error estimation techniques validate the efficacy of the hybrid approach alongside AutoML.Experimental results show that implementing this innovative hybrid model on patient rehabilitation data leads to a notable enhance-ment in AutoML performance,with an average improvement of 0.041 in the R2 score,surpassing the effectiveness of traditional regression models.展开更多
Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algor...Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algorithms force a structure in the data instead of discovering one.To avoid false structures in the relations of data,a novel clusterability assessment method called density-based clusterability measure is proposed in this paper.I measures the prominence of clustering structure in the data to evaluate whether a cluster analysis could produce a meaningfu insight to the relationships in the data.This is especially useful in time-series data since visualizing the structure in time-series data is hard.The performance of the clusterability measure is evalu ated against several synthetic data sets and time-series data sets which illustrate that the density-based clusterability measure can successfully indicate clustering structure of time-series data.展开更多
Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro c...Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro cluster. We propose LeaDen-Stream (Leader Density-based clustering algorithm over evolving data Stream), a density-based clustering algorithm using leader clustering. The algorithm is based on a two-phase clustering. The online phase selects the proper mini-micro or micro-cluster leaders based on the distribution of data points in the micro clusters. Then, the leader centers are sent to the offline phase to form final clusters. In LeaDen-Stream, by carefully choosing between two kinds of micro leaders, we decrease time complexity of the clustering while maintaining the cluster quality. A pruning strategy is also used to filter out real data from noise by introducing dense and sparse mini-micro and micro-cluster leaders. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.展开更多
Deformation prediction for extra-high arch dams is highly important for ensuring their safe operation.To address the challenges of complex monitoring data,the uneven spatial distribution of deformation,and the constru...Deformation prediction for extra-high arch dams is highly important for ensuring their safe operation.To address the challenges of complex monitoring data,the uneven spatial distribution of deformation,and the construction and optimization of a prediction model for deformation prediction,a multipoint ultrahigh arch dam deformation prediction model,namely,the CEEMDAN-KPCA-GSWOA-KELM,which is based on a clustering partition,is pro-posed.First,the monitoring data are preprocessed via variational mode decomposition(VMD)and wavelet denoising(WT),which effectively filters out noise and improves the signal-to-noise ratio of the data,providing high-quality input data for subsequent prediction models.Second,scientific cluster partitioning is performed via the K-means++algorithm to precisely capture the spatial distribution characteristics of extra-high arch dams and ensure the consistency of deformation trends at measurement points within each partition.Finally,CEEMDAN is used to separate monitoring data,predict and analyze each component,combine the KPCA(Kernel Principal Component Analysis)and the KELM(Kernel Extreme Learning Machine)optimized by the GSWOA(Global Search Whale Optimization Algorithm),integrate the predictions of each component via reconstruction methods,and precisely predict the overall trend of ultrahigh arch dam deformation.An extra high arch dam project is taken as an example and validated via a comparative analysis of multiple models.The results show that the multipoint deformation prediction model in this paper can combine data from different measurement points,achieve a comprehensive,precise prediction of the deformation situation of extra high arch dams,and provide strong technical support for safe operation.展开更多
Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy cl...Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy clustering techniques,such as Fuzzy C-Means(FCM),face significant challenges in handling uncertainty and the dependencies between different views.To overcome these limitations,we introduce a new multi-view fuzzy clustering approach that integrates picture fuzzy sets with a dual-anchor graph method for multi-view data,aiming to enhance clustering accuracy and robustness,termed Multi-view Picture Fuzzy Clustering(MPFC).In particular,the picture fuzzy set theory extends the capability to represent uncertainty by modeling three membership levels:membership degrees,neutral degrees,and refusal degrees.This allows for a more flexible representation of uncertain and conflicting data than traditional fuzzy models.Meanwhile,dual-anchor graphs exploit the similarity relationships between data points and integrate information across views.This combination improves stability,scalability,and robustness when handling noisy and heterogeneous data.Experimental results on several benchmark datasets demonstrate significant improvements in clustering accuracy and efficiency,outperforming traditional methods.Specifically,the MPFC algorithm demonstrates outstanding clustering performance on a variety of datasets,attaining a Purity(PUR)score of 0.6440 and an Accuracy(ACC)score of 0.6213 for the 3 Sources dataset,underscoring its robustness and efficiency.The proposed approach significantly contributes to fields such as pattern recognition,multi-view relational data analysis,and large-scale clustering problems.Future work will focus on extending the method for semi-supervised multi-view clustering,aiming to enhance adaptability,scalability,and performance in real-world applications.展开更多
Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subse...Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.展开更多
Various factors,including weak tie-lines into the electric power system(EPS)networks,can lead to low-frequency oscillations(LFOs),which are considered an instant,non-threatening situation,but slow-acting and poisonous...Various factors,including weak tie-lines into the electric power system(EPS)networks,can lead to low-frequency oscillations(LFOs),which are considered an instant,non-threatening situation,but slow-acting and poisonous.Considering the challenge mentioned,this article proposes a clustering-based machine learning(ML)framework to enhance the stability of EPS networks by suppressing LFOs through real-time tuning of key power system stabilizer(PSS)parameters.To validate the proposed strategy,two distinct EPS networks are selected:the single-machine infinite-bus(SMIB)with a single-stage PSS and the unified power flow controller(UPFC)coordinated SMIB with a double-stage PSS.To generate data under various loading conditions for both networks,an efficient but offline meta-heuristic algorithm,namely the grey wolf optimizer(GWO),is used,with the loading conditions as inputs and the key PSS parameters as outputs.The generated loading conditions are then clustered using the fuzzy k-means(FKM)clustering method.Finally,the group method of data handling(GMDH)and long short-term memory(LSTM)ML models are developed for clustered data to predict PSS key parameters in real time for any loading condition.A few well-known statistical performance indices(SPI)are considered for validation and robustness of the training and testing procedure of the developed FKM-GMDH and FKM-LSTM models based on the prediction of PSS parameters.The performance of the ML models is also evaluated using three stability indices(i.e.,minimum damping ratio,eigenvalues,and time-domain simulations)after optimally tuned PSS with real-time estimated parameters under changing operating conditions.Besides,the outputs of the offline(GWO-based)metaheuristic model,proposed real-time(FKM-GMDH and FKM-LSTM)machine learning models,and previously reported literature models are compared.According to the results,the proposed methodology outperforms the others in enhancing the stability of the selected EPS networks by damping out the observed unwanted LFOs under various loading conditions.展开更多
Underwater wireless sensor networks(UWSNs)rely on data aggregation to streamline routing operations by merging information at intermediate nodes before transmitting it to the sink.However,many existing data aggregatio...Underwater wireless sensor networks(UWSNs)rely on data aggregation to streamline routing operations by merging information at intermediate nodes before transmitting it to the sink.However,many existing data aggregation techniques are designed exclusively for static networks and fail to reflect the dynamic nature of underwater environments.Additionally,conventional multi-hop data gathering techniques often lead to energy depletion problems near the sink,commonly known as the energy hole issue.Moreover,cluster-based aggregation methods face significant challenges such as cluster head(CH)failures and collisions within clusters that degrade overall network performance.To address these limitations,this paper introduces an innovative framework,the Cluster-based Data Aggregation using Fuzzy Decision Model(CDAFDM),tailored for mobile UWSNs.The proposed method has four main phases:clustering,CH selection,data aggregation,and re-clustering.During CH selection,a fuzzy decision model is utilized to ensure efficient cluster head selection based on parameters such as residual energy,distance to the sink,and data delivery likelihood,enhancing network stability and energy efficiency.In the aggregation phase,CHs transmit a single,consolidated set of non-redundant data to the base station(BS),thereby reducing data duplication and saving energy.To adapt to the changing network topology,the re-clustering phase periodically updates cluster formations and reselects CHs.Simulation results show that CDAFDM outperforms current protocols such as CAPTAIN(Collection Algorithm for underwater oPTical-AcoustIc sensor Networks),EDDG(Event-Driven Data Gathering),and DCBMEC(Data Collection Based on Mobile Edge Computing)with a packet delivery ratio increase of up to 4%,an energy consumption reduction of 18%,and a data collection latency reduction of 52%.These findings highlight the framework’s potential for reliable and energy-efficient data aggregation mobile UWSNs.展开更多
The issue of strong noise has increasingly become a bottleneck restricting the precision and application space of electromagnetic exploration methods.Noise suppression and extraction of effective electromagnetic respo...The issue of strong noise has increasingly become a bottleneck restricting the precision and application space of electromagnetic exploration methods.Noise suppression and extraction of effective electromagnetic response information under a strong noise background is a crucial scientific task to be addressed.To solve the noise suppression problem of the controlled-source electromagnetic method in strong interference areas,we propose an approach based on complex-plane 2D k-means clustering for data processing.Based on the stability of the controlled-source signal response,clustering analysis is applied to classify the spectra of different sources and noises in multiple time segments.By identifying the power spectra with controlled-source characteristics,it helps to improve the quality of the controlled-source response extraction.This paper presents the principle and workflow of the proposed algorithm,and demonstrates feasibility and effectiveness of the new algorithm through synthetic and real data examples.The results show that,compared with the conventional Robust denoising method,the clustering algorithm has a stronger suppression effect on common noise,can identify high-quality signals,and improve the preprocessing data quality of the controlledsource electromagnetic method.展开更多
The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show...The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show limitations with the inherent uncertainty and imprecision of such data,as they rely on a single-dimensional membership value.To overcome these limitations,we propose an auto-weighted multi-view neutrosophic fuzzy clustering(AW-MVNFC)algorithm.Our method leverages the neutrosophic framework,an extension of fuzzy sets,to explicitly model imprecision and ambiguity through three membership degrees.The core novelty of AWMVNFC lies in a hierarchical weighting strategy that adaptively learns the contributions of both individual data views and the importance of each feature within a view.Through a unified objective function,AW-MVNFC jointly optimizes the neutrosophic membership assignments,cluster centers,and the distributions of view and feature weights.Comprehensive experiments conducted on synthetic and real-world datasets demonstrate that our algorithm achieves more accurate and stable clustering than existing methods,demonstrating its effectiveness in handling the complexities of multi-view data.展开更多
In order to solve the problems of short network lifetime and high data transmission delay in data gathering for wireless sensor network(WSN)caused by uneven energy consumption among nodes,a hybrid energy efficient clu...In order to solve the problems of short network lifetime and high data transmission delay in data gathering for wireless sensor network(WSN)caused by uneven energy consumption among nodes,a hybrid energy efficient clustering routing base on firefly and pigeon-inspired algorithm(FF-PIA)is proposed to optimise the data transmission path.After having obtained the optimal number of cluster head node(CH),its result might be taken as the basis of producing the initial population of FF-PIA algorithm.The L′evy flight mechanism and adaptive inertia weighting are employed in the algorithm iteration to balance the contradiction between the global search and the local search.Moreover,a Gaussian perturbation strategy is applied to update the optimal solution,ensuring the algorithm can jump out of the local optimal solution.And,in the WSN data gathering,a onedimensional signal reconstruction algorithm model is developed by dilated convolution and residual neural networks(DCRNN).We conducted experiments on the National Oceanic and Atmospheric Administration(NOAA)dataset.It shows that the DCRNN modeldriven data reconstruction algorithm improves the reconstruction accuracy as well as the reconstruction time performance.FF-PIA and DCRNN clustering routing co-simulation reveals that the proposed algorithm can effectively improve the performance in extending the network lifetime and reducing data transmission delay.展开更多
Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their sim...Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
Finding clusters based on density represents a significant class of clustering algorithms.These methods can discover clusters of various shapes and sizes.The most studied algorithm in this class is theDensity-Based Sp...Finding clusters based on density represents a significant class of clustering algorithms.These methods can discover clusters of various shapes and sizes.The most studied algorithm in this class is theDensity-Based Spatial Clustering of Applications with Noise(DBSCAN).It identifies clusters by grouping the densely connected objects into one group and discarding the noise objects.It requires two input parameters:epsilon(fixed neighborhood radius)and MinPts(the lowest number of objects in epsilon).However,it can’t handle clusters of various densities since it uses a global value for epsilon.This article proposes an adaptation of the DBSCAN method so it can discover clusters of varied densities besides reducing the required number of input parameters to only one.Only user input in the proposed method is the MinPts.Epsilon on the other hand,is computed automatically based on statistical information of the dataset.The proposed method finds the core distance for each object in the dataset,takes the average of these distances as the first value of epsilon,and finds the clusters satisfying this density level.The remaining unclustered objects will be clustered using a new value of epsilon that equals the average core distances of unclustered objects.This process continues until all objects have been clustered or the remaining unclustered objects are less than 0.006 of the dataset’s size.The proposed method requires MinPts only as an input parameter because epsilon is computed from data.Benchmark datasets were used to evaluate the effectiveness of the proposed method that produced promising results.Practical experiments demonstrate that the outstanding ability of the proposed method to detect clusters of different densities even if there is no separation between them.The accuracy of the method ranges from 92%to 100%for the experimented datasets.展开更多
Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,lo...Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,low accuracy,and inconsistent performance concerning data size and structure.To address these challenges,a novel clustering algorithm called the fully automated density-based clustering method(FADBC)is proposed.The FADBC method consists of two stages:parameter selection and cluster extraction.In the first stage,a proposed method extracts optimal parameters for the dataset,including the epsilon size and a minimum number of points thresholds.These parameters are then used in a density-based technique to scan each point in the dataset and evaluate neighborhood densities to find clusters.The proposed method was evaluated on different benchmark datasets andmetrics,and the experimental results demonstrate its competitive performance without requiring manual inputs.The results show that the FADBC method outperforms well-known clustering methods such as the agglomerative hierarchical method,k-means,spectral clustering,DBSCAN,FCDCSD,Gaussian mixtures,and density-based spatial clustering methods.It can handle any kind of data set well and perform excellently.展开更多
We propose a new clustering algorithm that assists the researchers to quickly and accurately analyze data. We call this algorithm Combined Density-based and Constraint-based Algorithm (CDC). CDC consists of two phases...We propose a new clustering algorithm that assists the researchers to quickly and accurately analyze data. We call this algorithm Combined Density-based and Constraint-based Algorithm (CDC). CDC consists of two phases. In the first phase, CDC employs the idea of density-based clustering algorithm to split the original data into a number of fragmented clusters. At the same time, CDC cuts off the noises and outliers. In the second phase, CDC employs the concept of K-means clustering algorithm to select a greater cluster to be the center. Then, the greater cluster merges some smaller clusters which satisfy some constraint rules. Due to the merged clusters around the center cluster, the clustering results show high accuracy. Moreover, CDC reduces the calculations and speeds up the clustering process. In this paper, the accuracy of CDC is evaluated and compared with those of K-means, hierarchical clustering, and the genetic clustering algorithm (GCA) proposed in 2004. Experimental results show that CDC has better performance.展开更多
Encephalitis is a brain inflammation disease.Encephalitis can yield to seizures,motor disability,or some loss of vision or hearing.Sometimes,encepha-litis can be a life-threatening and proper diagnosis in an early stag...Encephalitis is a brain inflammation disease.Encephalitis can yield to seizures,motor disability,or some loss of vision or hearing.Sometimes,encepha-litis can be a life-threatening and proper diagnosis in an early stage is very crucial.Therefore,in this paper,we are proposing a deep learning model for computerized detection of Encephalitis from the electroencephalogram data(EEG).Also,we propose a Density-Based Clustering model to classify the distinctive waves of Encephalitis.Customary clustering models usually employ a computed single centroid virtual point to define the cluster configuration,but this single point does not contain adequate information.To precisely extract accurate inner structural data,a multiple centroids approach is employed and defined in this paper,which defines the cluster configuration by allocating weights to each state in the cluster.The multiple EEG view fuzzy learning approach incorporates data from every sin-gle view to enhance the model's clustering performance.Also a fuzzy Density-Based Clustering model with multiple centroids(FDBC)is presented.This model employs multiple real state centroids to define clusters using Partitioning Around Centroids algorithm.The Experimental results validate the medical importance of the proposed clustering model.展开更多
A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR...A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.展开更多
Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
In this paper, a cardinality compensation method based on Information-weighted Consensus Filter(ICF) using data clustering is proposed in order to accurately estimate the cardinality of the Cardinalized Probability Hy...In this paper, a cardinality compensation method based on Information-weighted Consensus Filter(ICF) using data clustering is proposed in order to accurately estimate the cardinality of the Cardinalized Probability Hypothesis Density(CPHD) filter. Although the joint propagation of the intensity and the cardinality distribution in the CPHD filter process allows for more reliable estimation of the cardinality(target number) than the PHD filter, tracking loss may occur when noise and clutter are high in the measurements in a practical situation. For that reason, the cardinality compensation process is included in the CPHD filter, which is based on information fusion step using estimated cardinality obtained from the CPHD filter and measured cardinality obtained through data clustering. Here, the ICF is used for information fusion. To verify the performance of the proposed method, simulations were carried out and it was confirmed that the tracking performance of the multi-target was improved because the cardinality was estimated more accurately as compared to the existing techniques.展开更多
文摘In recent years,there has been a concerted effort to improve anomaly detection tech-niques,particularly in the context of high-dimensional,distributed clinical data.Analysing patient data within clinical settings reveals a pronounced focus on refining diagnostic accuracy,personalising treatment plans,and optimising resource allocation to enhance clinical outcomes.Nonetheless,this domain faces unique challenges,such as irregular data collection,inconsistent data quality,and patient-specific structural variations.This paper proposed a novel hybrid approach that integrates heuristic and stochastic methods for anomaly detection in patient clinical data to address these challenges.The strategy combines HPO-based optimal Density-Based Spatial Clustering of Applications with Noise for clustering patient exercise data,facilitating efficient anomaly identification.Subsequently,a stochastic method based on the Interquartile Range filters unreliable data points,ensuring that medical tools and professionals receive only the most pertinent and accurate information.The primary objective of this study is to equip healthcare pro-fessionals and researchers with a robust tool for managing extensive,high-dimensional clinical datasets,enabling effective isolation and removal of aberrant data points.Furthermore,a sophisticated regression model has been developed using Automated Machine Learning(AutoML)to assess the impact of the ensemble abnormal pattern detection approach.Various statistical error estimation techniques validate the efficacy of the hybrid approach alongside AutoML.Experimental results show that implementing this innovative hybrid model on patient rehabilitation data leads to a notable enhance-ment in AutoML performance,with an average improvement of 0.041 in the R2 score,surpassing the effectiveness of traditional regression models.
文摘Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algorithms force a structure in the data instead of discovering one.To avoid false structures in the relations of data,a novel clusterability assessment method called density-based clusterability measure is proposed in this paper.I measures the prominence of clustering structure in the data to evaluate whether a cluster analysis could produce a meaningfu insight to the relationships in the data.This is especially useful in time-series data since visualizing the structure in time-series data is hard.The performance of the clusterability measure is evalu ated against several synthetic data sets and time-series data sets which illustrate that the density-based clusterability measure can successfully indicate clustering structure of time-series data.
文摘Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro cluster. We propose LeaDen-Stream (Leader Density-based clustering algorithm over evolving data Stream), a density-based clustering algorithm using leader clustering. The algorithm is based on a two-phase clustering. The online phase selects the proper mini-micro or micro-cluster leaders based on the distribution of data points in the micro clusters. Then, the leader centers are sent to the offline phase to form final clusters. In LeaDen-Stream, by carefully choosing between two kinds of micro leaders, we decrease time complexity of the clustering while maintaining the cluster quality. A pruning strategy is also used to filter out real data from noise by introducing dense and sparse mini-micro and micro-cluster leaders. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.
基金supported by the National Natural Science Foundation of China(Grant Nos.52069029,52369026)the Belt and Road Special Foundation of National Key Laboratory of Water Disaster Preven-tion(Grant No.2023490411)+2 种基金the Yunnan Agricultural Basic Research Joint Special General Project(Grant Nos.202501BD070001-060,202401BD070001-071)Construction Project of the Yunnan Key Laboratory of Water Security(No.20254916CE340051)the Youth Talent Project of“Xingdian Talent Support Plan”in Yunnan Province(Grant No.XDYC-QNRC-2023-0412).
文摘Deformation prediction for extra-high arch dams is highly important for ensuring their safe operation.To address the challenges of complex monitoring data,the uneven spatial distribution of deformation,and the construction and optimization of a prediction model for deformation prediction,a multipoint ultrahigh arch dam deformation prediction model,namely,the CEEMDAN-KPCA-GSWOA-KELM,which is based on a clustering partition,is pro-posed.First,the monitoring data are preprocessed via variational mode decomposition(VMD)and wavelet denoising(WT),which effectively filters out noise and improves the signal-to-noise ratio of the data,providing high-quality input data for subsequent prediction models.Second,scientific cluster partitioning is performed via the K-means++algorithm to precisely capture the spatial distribution characteristics of extra-high arch dams and ensure the consistency of deformation trends at measurement points within each partition.Finally,CEEMDAN is used to separate monitoring data,predict and analyze each component,combine the KPCA(Kernel Principal Component Analysis)and the KELM(Kernel Extreme Learning Machine)optimized by the GSWOA(Global Search Whale Optimization Algorithm),integrate the predictions of each component via reconstruction methods,and precisely predict the overall trend of ultrahigh arch dam deformation.An extra high arch dam project is taken as an example and validated via a comparative analysis of multiple models.The results show that the multipoint deformation prediction model in this paper can combine data from different measurement points,achieve a comprehensive,precise prediction of the deformation situation of extra high arch dams,and provide strong technical support for safe operation.
基金funded by the Research Project:THTETN.05/24-25,VietnamAcademy of Science and Technology.
文摘Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy clustering techniques,such as Fuzzy C-Means(FCM),face significant challenges in handling uncertainty and the dependencies between different views.To overcome these limitations,we introduce a new multi-view fuzzy clustering approach that integrates picture fuzzy sets with a dual-anchor graph method for multi-view data,aiming to enhance clustering accuracy and robustness,termed Multi-view Picture Fuzzy Clustering(MPFC).In particular,the picture fuzzy set theory extends the capability to represent uncertainty by modeling three membership levels:membership degrees,neutral degrees,and refusal degrees.This allows for a more flexible representation of uncertain and conflicting data than traditional fuzzy models.Meanwhile,dual-anchor graphs exploit the similarity relationships between data points and integrate information across views.This combination improves stability,scalability,and robustness when handling noisy and heterogeneous data.Experimental results on several benchmark datasets demonstrate significant improvements in clustering accuracy and efficiency,outperforming traditional methods.Specifically,the MPFC algorithm demonstrates outstanding clustering performance on a variety of datasets,attaining a Purity(PUR)score of 0.6440 and an Accuracy(ACC)score of 0.6213 for the 3 Sources dataset,underscoring its robustness and efficiency.The proposed approach significantly contributes to fields such as pattern recognition,multi-view relational data analysis,and large-scale clustering problems.Future work will focus on extending the method for semi-supervised multi-view clustering,aiming to enhance adaptability,scalability,and performance in real-world applications.
基金supported in part by NIH grants R01NS39600,U01MH114829RF1MH128693(to GAA)。
文摘Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.
基金supported by the Deanship of Research at the King Fahd University of Petroleum&Minerals,Dhahran,31261,Saudi Arabia,under Project No.EC241001.
文摘Various factors,including weak tie-lines into the electric power system(EPS)networks,can lead to low-frequency oscillations(LFOs),which are considered an instant,non-threatening situation,but slow-acting and poisonous.Considering the challenge mentioned,this article proposes a clustering-based machine learning(ML)framework to enhance the stability of EPS networks by suppressing LFOs through real-time tuning of key power system stabilizer(PSS)parameters.To validate the proposed strategy,two distinct EPS networks are selected:the single-machine infinite-bus(SMIB)with a single-stage PSS and the unified power flow controller(UPFC)coordinated SMIB with a double-stage PSS.To generate data under various loading conditions for both networks,an efficient but offline meta-heuristic algorithm,namely the grey wolf optimizer(GWO),is used,with the loading conditions as inputs and the key PSS parameters as outputs.The generated loading conditions are then clustered using the fuzzy k-means(FKM)clustering method.Finally,the group method of data handling(GMDH)and long short-term memory(LSTM)ML models are developed for clustered data to predict PSS key parameters in real time for any loading condition.A few well-known statistical performance indices(SPI)are considered for validation and robustness of the training and testing procedure of the developed FKM-GMDH and FKM-LSTM models based on the prediction of PSS parameters.The performance of the ML models is also evaluated using three stability indices(i.e.,minimum damping ratio,eigenvalues,and time-domain simulations)after optimally tuned PSS with real-time estimated parameters under changing operating conditions.Besides,the outputs of the offline(GWO-based)metaheuristic model,proposed real-time(FKM-GMDH and FKM-LSTM)machine learning models,and previously reported literature models are compared.According to the results,the proposed methodology outperforms the others in enhancing the stability of the selected EPS networks by damping out the observed unwanted LFOs under various loading conditions.
基金funded by the Deanship of Scientific Research,the Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia under the project(KFU250420).
文摘Underwater wireless sensor networks(UWSNs)rely on data aggregation to streamline routing operations by merging information at intermediate nodes before transmitting it to the sink.However,many existing data aggregation techniques are designed exclusively for static networks and fail to reflect the dynamic nature of underwater environments.Additionally,conventional multi-hop data gathering techniques often lead to energy depletion problems near the sink,commonly known as the energy hole issue.Moreover,cluster-based aggregation methods face significant challenges such as cluster head(CH)failures and collisions within clusters that degrade overall network performance.To address these limitations,this paper introduces an innovative framework,the Cluster-based Data Aggregation using Fuzzy Decision Model(CDAFDM),tailored for mobile UWSNs.The proposed method has four main phases:clustering,CH selection,data aggregation,and re-clustering.During CH selection,a fuzzy decision model is utilized to ensure efficient cluster head selection based on parameters such as residual energy,distance to the sink,and data delivery likelihood,enhancing network stability and energy efficiency.In the aggregation phase,CHs transmit a single,consolidated set of non-redundant data to the base station(BS),thereby reducing data duplication and saving energy.To adapt to the changing network topology,the re-clustering phase periodically updates cluster formations and reselects CHs.Simulation results show that CDAFDM outperforms current protocols such as CAPTAIN(Collection Algorithm for underwater oPTical-AcoustIc sensor Networks),EDDG(Event-Driven Data Gathering),and DCBMEC(Data Collection Based on Mobile Edge Computing)with a packet delivery ratio increase of up to 4%,an energy consumption reduction of 18%,and a data collection latency reduction of 52%.These findings highlight the framework’s potential for reliable and energy-efficient data aggregation mobile UWSNs.
基金supported by the National Key Research and Development Program Project of China(Grant No.2023YFF0718003)the key research and development plan project of Yunnan Province(Grant No.202303AA080006).
文摘The issue of strong noise has increasingly become a bottleneck restricting the precision and application space of electromagnetic exploration methods.Noise suppression and extraction of effective electromagnetic response information under a strong noise background is a crucial scientific task to be addressed.To solve the noise suppression problem of the controlled-source electromagnetic method in strong interference areas,we propose an approach based on complex-plane 2D k-means clustering for data processing.Based on the stability of the controlled-source signal response,clustering analysis is applied to classify the spectra of different sources and noises in multiple time segments.By identifying the power spectra with controlled-source characteristics,it helps to improve the quality of the controlled-source response extraction.This paper presents the principle and workflow of the proposed algorithm,and demonstrates feasibility and effectiveness of the new algorithm through synthetic and real data examples.The results show that,compared with the conventional Robust denoising method,the clustering algorithm has a stronger suppression effect on common noise,can identify high-quality signals,and improve the preprocessing data quality of the controlledsource electromagnetic method.
文摘The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show limitations with the inherent uncertainty and imprecision of such data,as they rely on a single-dimensional membership value.To overcome these limitations,we propose an auto-weighted multi-view neutrosophic fuzzy clustering(AW-MVNFC)algorithm.Our method leverages the neutrosophic framework,an extension of fuzzy sets,to explicitly model imprecision and ambiguity through three membership degrees.The core novelty of AWMVNFC lies in a hierarchical weighting strategy that adaptively learns the contributions of both individual data views and the importance of each feature within a view.Through a unified objective function,AW-MVNFC jointly optimizes the neutrosophic membership assignments,cluster centers,and the distributions of view and feature weights.Comprehensive experiments conducted on synthetic and real-world datasets demonstrate that our algorithm achieves more accurate and stable clustering than existing methods,demonstrating its effectiveness in handling the complexities of multi-view data.
基金partially supported by the National Natural Science Foundation of China(62161016)the Key Research and Development Project of Lanzhou Jiaotong University(ZDYF2304)+1 种基金the Beijing Engineering Research Center of Highvelocity Railway Broadband Mobile Communications(BHRC-2022-1)Beijing Jiaotong University。
文摘In order to solve the problems of short network lifetime and high data transmission delay in data gathering for wireless sensor network(WSN)caused by uneven energy consumption among nodes,a hybrid energy efficient clustering routing base on firefly and pigeon-inspired algorithm(FF-PIA)is proposed to optimise the data transmission path.After having obtained the optimal number of cluster head node(CH),its result might be taken as the basis of producing the initial population of FF-PIA algorithm.The L′evy flight mechanism and adaptive inertia weighting are employed in the algorithm iteration to balance the contradiction between the global search and the local search.Moreover,a Gaussian perturbation strategy is applied to update the optimal solution,ensuring the algorithm can jump out of the local optimal solution.And,in the WSN data gathering,a onedimensional signal reconstruction algorithm model is developed by dilated convolution and residual neural networks(DCRNN).We conducted experiments on the National Oceanic and Atmospheric Administration(NOAA)dataset.It shows that the DCRNN modeldriven data reconstruction algorithm improves the reconstruction accuracy as well as the reconstruction time performance.FF-PIA and DCRNN clustering routing co-simulation reveals that the proposed algorithm can effectively improve the performance in extending the network lifetime and reducing data transmission delay.
文摘Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
基金The author extends his appreciation to theDeputyship forResearch&Innovation,Ministry of Education in Saudi Arabia for funding this research work through the project number(IFPSAU-2021/01/17758).
文摘Finding clusters based on density represents a significant class of clustering algorithms.These methods can discover clusters of various shapes and sizes.The most studied algorithm in this class is theDensity-Based Spatial Clustering of Applications with Noise(DBSCAN).It identifies clusters by grouping the densely connected objects into one group and discarding the noise objects.It requires two input parameters:epsilon(fixed neighborhood radius)and MinPts(the lowest number of objects in epsilon).However,it can’t handle clusters of various densities since it uses a global value for epsilon.This article proposes an adaptation of the DBSCAN method so it can discover clusters of varied densities besides reducing the required number of input parameters to only one.Only user input in the proposed method is the MinPts.Epsilon on the other hand,is computed automatically based on statistical information of the dataset.The proposed method finds the core distance for each object in the dataset,takes the average of these distances as the first value of epsilon,and finds the clusters satisfying this density level.The remaining unclustered objects will be clustered using a new value of epsilon that equals the average core distances of unclustered objects.This process continues until all objects have been clustered or the remaining unclustered objects are less than 0.006 of the dataset’s size.The proposed method requires MinPts only as an input parameter because epsilon is computed from data.Benchmark datasets were used to evaluate the effectiveness of the proposed method that produced promising results.Practical experiments demonstrate that the outstanding ability of the proposed method to detect clusters of different densities even if there is no separation between them.The accuracy of the method ranges from 92%to 100%for the experimented datasets.
基金the Deanship of Scientific Research at Umm Al-Qura University,Grant Code:(23UQU4361009DSR001).
文摘Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,low accuracy,and inconsistent performance concerning data size and structure.To address these challenges,a novel clustering algorithm called the fully automated density-based clustering method(FADBC)is proposed.The FADBC method consists of two stages:parameter selection and cluster extraction.In the first stage,a proposed method extracts optimal parameters for the dataset,including the epsilon size and a minimum number of points thresholds.These parameters are then used in a density-based technique to scan each point in the dataset and evaluate neighborhood densities to find clusters.The proposed method was evaluated on different benchmark datasets andmetrics,and the experimental results demonstrate its competitive performance without requiring manual inputs.The results show that the FADBC method outperforms well-known clustering methods such as the agglomerative hierarchical method,k-means,spectral clustering,DBSCAN,FCDCSD,Gaussian mixtures,and density-based spatial clustering methods.It can handle any kind of data set well and perform excellently.
文摘We propose a new clustering algorithm that assists the researchers to quickly and accurately analyze data. We call this algorithm Combined Density-based and Constraint-based Algorithm (CDC). CDC consists of two phases. In the first phase, CDC employs the idea of density-based clustering algorithm to split the original data into a number of fragmented clusters. At the same time, CDC cuts off the noises and outliers. In the second phase, CDC employs the concept of K-means clustering algorithm to select a greater cluster to be the center. Then, the greater cluster merges some smaller clusters which satisfy some constraint rules. Due to the merged clusters around the center cluster, the clustering results show high accuracy. Moreover, CDC reduces the calculations and speeds up the clustering process. In this paper, the accuracy of CDC is evaluated and compared with those of K-means, hierarchical clustering, and the genetic clustering algorithm (GCA) proposed in 2004. Experimental results show that CDC has better performance.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2022R113)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Encephalitis is a brain inflammation disease.Encephalitis can yield to seizures,motor disability,or some loss of vision or hearing.Sometimes,encepha-litis can be a life-threatening and proper diagnosis in an early stage is very crucial.Therefore,in this paper,we are proposing a deep learning model for computerized detection of Encephalitis from the electroencephalogram data(EEG).Also,we propose a Density-Based Clustering model to classify the distinctive waves of Encephalitis.Customary clustering models usually employ a computed single centroid virtual point to define the cluster configuration,but this single point does not contain adequate information.To precisely extract accurate inner structural data,a multiple centroids approach is employed and defined in this paper,which defines the cluster configuration by allocating weights to each state in the cluster.The multiple EEG view fuzzy learning approach incorporates data from every sin-gle view to enhance the model's clustering performance.Also a fuzzy Density-Based Clustering model with multiple centroids(FDBC)is presented.This model employs multiple real state centroids to define clusters using Partitioning Around Centroids algorithm.The Experimental results validate the medical importance of the proposed clustering model.
基金The National Natural Science Foundation of China(No.60673060)the Natural Science Foundation of Jiangsu Province(No.BK2005047)
文摘A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
基金supported by the National GNSS Research Center Program of the Defense Acquisition Program Administration and Agency for Defense Developmentthe Ministry of Science and ICT of the Republic of Korea through the Space Core Technology Development Program (No. NRF2018M1A3A3A02065722)
文摘In this paper, a cardinality compensation method based on Information-weighted Consensus Filter(ICF) using data clustering is proposed in order to accurately estimate the cardinality of the Cardinalized Probability Hypothesis Density(CPHD) filter. Although the joint propagation of the intensity and the cardinality distribution in the CPHD filter process allows for more reliable estimation of the cardinality(target number) than the PHD filter, tracking loss may occur when noise and clutter are high in the measurements in a practical situation. For that reason, the cardinality compensation process is included in the CPHD filter, which is based on information fusion step using estimated cardinality obtained from the CPHD filter and measured cardinality obtained through data clustering. Here, the ICF is used for information fusion. To verify the performance of the proposed method, simulations were carried out and it was confirmed that the tracking performance of the multi-target was improved because the cardinality was estimated more accurately as compared to the existing techniques.