This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction m...This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.展开更多
Single-cell RNA sequencing(scRNA-seq)technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue.However,the computational c...Single-cell RNA sequencing(scRNA-seq)technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue.However,the computational characterization of such cellular heterogeneity is complicated by the high dimensionality,sparsity,and biological noise inherent to the raw data.Here,we introduce PhytoCluster,an unsupervised deep learning algorithm,to cluster scRNA-seq data by extracting latent features.We benchmarked PhytoCluster against four simulated datasets and five real scRNA-seq datasets with varying protocols and data quality levels.A comprehensive evaluation indicated that PhytoCluster outperforms other methods in clustering accuracy,noise removal,and signal retention.Additionally,we evaluated the performance of the latent features extracted by PhytoCluster across four machine learning models.The computational results highlight the ability of PhytoCluster to extract meaningful information from plant scRNA-seq data,with machine learning models achieving accuracy comparable to that of raw features.We believe that PhytoCluster will be a valuable tool for disentangling complex cellular heterogeneity based on scRNA-seq data.展开更多
Applying domain knowledge in fuzzy clustering algorithms continuously promotes the development of clustering technology.The combination of domain knowledge and fuzzy clustering algorithms has some problems,such as ini...Applying domain knowledge in fuzzy clustering algorithms continuously promotes the development of clustering technology.The combination of domain knowledge and fuzzy clustering algorithms has some problems,such as initialization sensitivity and information granule weight optimization.Therefore,we propose a weighted kernel fuzzy clustering algorithm based on a relative density view(RDVWKFC).Compared with the traditional density-based methods,RDVWKFC can capture the intrinsic structure of the data more accurately,thus improving the initial quality of the clustering.By introducing a Relative Density based Knowledge Extraction Method(RDKM)and adaptive weight optimization mechanism,we effectively solve the limitations of view initialization and information granule weight optimization.RDKM can accurately identify high-density regions and optimize the initialization process.The adaptive weight mechanism can reduce noise and outliers’interference in the initial cluster centre selection by dynamically allocating weights.Experimental results on 14 benchmark datasets show that the proposed algorithm is superior to the existing algorithms in terms of clustering accuracy,stability,and convergence speed.It shows adaptability and robustness,especially when dealing with different data distributions and noise interference.Moreover,RDVWKFC can also show significant advantages when dealing with data with complex structures and high-dimensional features.These advancements provide versatile tools for real-world applications such as bioinformatics,image segmentation,and anomaly detection.展开更多
The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show...The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show limitations with the inherent uncertainty and imprecision of such data,as they rely on a single-dimensional membership value.To overcome these limitations,we propose an auto-weighted multi-view neutrosophic fuzzy clustering(AW-MVNFC)algorithm.Our method leverages the neutrosophic framework,an extension of fuzzy sets,to explicitly model imprecision and ambiguity through three membership degrees.The core novelty of AWMVNFC lies in a hierarchical weighting strategy that adaptively learns the contributions of both individual data views and the importance of each feature within a view.Through a unified objective function,AW-MVNFC jointly optimizes the neutrosophic membership assignments,cluster centers,and the distributions of view and feature weights.Comprehensive experiments conducted on synthetic and real-world datasets demonstrate that our algorithm achieves more accurate and stable clustering than existing methods,demonstrating its effectiveness in handling the complexities of multi-view data.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
Low visibility conditions,particularly those caused by fog,significantly affect road safety and reduce drivers’ability to see ahead clearly.The conventional approaches used to address this problem primarily rely on i...Low visibility conditions,particularly those caused by fog,significantly affect road safety and reduce drivers’ability to see ahead clearly.The conventional approaches used to address this problem primarily rely on instrument-based and fixed-threshold-based theoretical frameworks,which face challenges in adaptability and demonstrate lower performance under varying environmental conditions.To overcome these challenges,we propose a real-time visibility estimation model that leverages roadside CCTV cameras to monitor and identify visibility levels under different weather conditions.The proposedmethod begins by identifying specific regions of interest(ROI)in the CCTVimages and focuses on extracting specific features such as the number of lines and contours detected within these regions.These features are then provided as an input to the proposed hierarchical clusteringmodel,which classifies them into different visibility levels without the need for predefined rules and threshold values.In the proposed approach,we used two different distance similaritymetrics,namely dynamic time warping(DTW)and Euclidean distance,alongside the proposed hierarchical clustering model and noted its performance in terms of numerous evaluation measures.The proposed model achieved an average accuracy of 97.81%,precision of 91.31%,recall of 91.25%,and F1-score of 91.27% using theDTWdistancemetric.We also conducted experiments for other deep learning(DL)-based models used in the literature and compared their performances with the proposed model.The experimental results demonstrate that the proposedmodel ismore adaptable and consistent compared to themethods used in the literature.The proposedmethod provides drivers real-time and accurate visibility information and enhances road safety during low visibility conditions.展开更多
Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess...Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.展开更多
In order to solve the problem of indoor place recognition for indoor service robot, a novel algorithm, clustering of features and images (CFI), is proposed in this work. Different from traditional indoor place recog...In order to solve the problem of indoor place recognition for indoor service robot, a novel algorithm, clustering of features and images (CFI), is proposed in this work. Different from traditional indoor place recognition methods which are based on kernels or bag of features, with large margin classifier, CFI proposed in this work is based on feature matching, image similarity and clustering of features and images. It establishes independent local feature clusters by feature cloud registration to represent each room, and defines image distance to describe the similarity between images or feature clusters, which determines the label of query images. Besides, it improves recognition speed by image scaling, with state inertia and hidden Markov model constraining the transition of the state to kill unreasonable wrong recognitions and achieves remarkable precision and speed. A series of experiments are conducted to test the algorithm based on standard databases, and it achieves recognition rate up to 97% and speed is over 30 fps, which is much superior to traditional methods. Its impressive precision and speed demonstrate the great discriminative power in the face of complicated environment.展开更多
Stream morphology is an important indicator for revealing the geomorphological features and evolution of the Yangtze River.Existing studies on the morphology of the Yangtze River focus on planar features.However,the v...Stream morphology is an important indicator for revealing the geomorphological features and evolution of the Yangtze River.Existing studies on the morphology of the Yangtze River focus on planar features.However,the vertical features are also important.Vertical features mainly control the flow ability and erosion intensity.Furthermore,traditional studies often focus on a few stream profiles in the Yangtze River.However,stream profiles are linked together by runoff nodes,thus affecting the geomorphological evolution of the Yangtze River naturally.In this study,a clustering method of stream profiles in the Yangtze River is proposed by plotting all profiles together.Then,a stream evolution index is used to investigate the geomorphological features of the stream profile clusters to reveal the evolution of the Yangtze River.Based on the stream profile clusters,the erosion base of the Yangtze River generally changes from steep to gentle from the upper reaches to the lower reaches,and the evolution degree of the stream changes from low to high.The asymmetric distribution of knickpoints in the Hanshui River Basin supports the view that the boundary of the eastward growth of the Tibetan Plateau has reached the vicinity of the Daba Mountains.展开更多
Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the ...Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.展开更多
In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the...In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the original features are ranked according to their importance. An evaluation function E(f) used to evaluate the importance of a feature is introduced. Secondly, the set of important features is selected sequentially. Finally, the possible redundant features are removed from the important feature subset. Because the features are selected sequentially, it is not necessary to search through the large feature subset space, thus the efficiency can be improved. Experimental results show that the set of important features for clustering can be found and those unimportant features or features that may hinder the clustering task will be discarded by this method.展开更多
Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method b...Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method based on genetic clustering VGA-clustering is presented. By integrating the spatial neighbouring information of range data into fuzzy clustering algorithm, a weighted fuzzy clustering algorithm (WFCA) instead of standard clustering algorithm is introduced to realize feature extraction of laser scanner. Aimed at the unknown clustering number in advance, several validation index functions are used to estimate the validity of different clustering algorithms and one validation index is selected as the fitness function of genetic algorithm so as to determine the accurate clustering number automatically. At the same time, an improved genetic algorithm IVGA on the basis of VGA is proposed to solve the local optimum of clustering algorithm, which is implemented by increasing the population diversity and improving the genetic operators of elitist rule to enhance the local search capacity and to quicken the convergence speed. By the comparison with other algorithms, the effectiveness of the algorithm introduced is demonstrated.展开更多
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method...Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods...There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods for these internal defects remains a challenging task.To address this challenge,in this study,an intelligent detection method based on a generalization feature cluster is proposed for internal defects of railway tracks.First,the defects are classified and counted according to their shape and location features.Then,generalized features of the internal defects are extracted and formulated based on the maximum difference between different types of defects and the maximum tolerance among same defects’types.Finally,the extracted generalized features are expressed by function constraints,and formulated as generalization feature clusters to classify and identify internal defects in the railway track.Furthermore,to improve the detection reliability and speed,a reduced-dimension method of the generalization feature clusters is presented in this paper.Based on this reduced-dimension feature and strongly constrained generalized features,the K-means clustering algorithm is developed for defect clustering,and good clustering results are achieved.Regarding the defects in the rail head region,the clustering accuracy is over 95%,and the Davies-Bouldin index(DBI)index is negligible,which indicates the validation of the proposed generalization features with strong constraints.Experimental results prove that the accuracy of the proposed method based on generalization feature clusters is up to 97.55%,and the average detection time is 0.12 s/frame,which indicates that it performs well in adaptability,high accuracy,and detection speed under complex working environments.The proposed algorithm can effectively detect internal defects in railway tracks using an established generalization feature cluster model.展开更多
Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension ...Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.展开更多
Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining a...Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining areas and to be responsible for the observed clustering.The earthquakes accompanied by coal mine disasters may be the vital geophysical evidence for tectonic stress disturbances around mining areas.This paper analyzes all the possible causative factors to demonstrate the authenticity and reliability of the observed phenomena.A quantitative study was performed on the degree of clustering,and space-time distribution curves are obtained.Under the threshold of 100 km,47%of disasters are involved in cluster series and 372 coal mine disasters accompanied by earthquakes.The majority cluster series lasting for 1-2 days correspond well earthquakes nearby,which are speculated to be related to local stress disturbance.While the minority lasting longer than 4 days correspond well with fatal earthquakes,which are speculated to be related to regional stress disturbance.The cluster series possess multiple properties,such as the area,the distance,the related disasters,etc.,and compared with the energy and the magnitude of earthquakes,good correspondences are acquired.It indicates that the cluster series of coal mine disasters and earthquakes are linked with fatal earthquakes and may serve as footprints of regional stress disturbance.Speculations relating to the geological model are made,and five disaster-causing models are examined.To earthquake research and disaster prevention,widely scientific significance is suggested.展开更多
Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one...Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one of the most important technologies in network security detection.The accuracy of network intrusion detection has reached higher accuracy so far.However,these methods have very low efficiency in network intrusion detection,even the most popular SOM neural network method.In this paper,an efficient and fast network intrusion detection method was proposed.Firstly,the fundamental of the two different methods are introduced respectively.Then,the selforganizing feature map neural network based on K-means clustering(KSOM)algorithms was presented to improve the efficiency of network intrusion detection.Finally,the NSLKDD is used as network intrusion data set to demonstrate that the KSOM method can significantly reduce the number of clustering iteration than SOM method without substantially affecting the clustering results and the accuracy is much higher than Kmeans method.The Experimental results show that our method can relatively improve the accuracy of network intrusion and significantly reduce the number of clustering iteration.展开更多
Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the respons...Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.展开更多
Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at ...Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.展开更多
With the development of information technology,radio communication technology has made rapid progress.Many radio signals that have appeared in space are difficult to classify without manually labeling.Unsupervised rad...With the development of information technology,radio communication technology has made rapid progress.Many radio signals that have appeared in space are difficult to classify without manually labeling.Unsupervised radio signal clustering methods have recently become an urgent need for this situation.Meanwhile,the high complexity of deep learning makes it difficult to understand the decision results of the clustering models,making it essential to conduct interpretable analysis.This paper proposed a combined loss function for unsupervised clustering based on autoencoder.The combined loss function includes reconstruction loss and deep clustering loss.Deep clustering loss is added based on reconstruction loss,which makes similar deep features converge more in feature space.In addition,a features visualization method for signal clustering was proposed to analyze the interpretability of autoencoder utilizing Saliency Map.Extensive experiments have been conducted on a modulated signal dataset,and the results indicate the superior performance of our proposed method over other clustering algorithms.In particular,for the simulated dataset containing six modulation modes,when the SNR is 20dB,the clustering accuracy of the proposed method is greater than 78%.The interpretability analysis of the clustering model was performed to visualize the significant features of different modulated signals and verified the high separability of the features extracted by clustering model.展开更多
基金funded by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Grant No.KFU241683].
文摘This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.
基金supported by the National Natural Science Foundation of China(32371996 and 62372158)the National Key R&D Program of China(2022YFF0711802)+1 种基金the STI 2030-Major Projects(2022ZD04017)the National Key Research and Development Program of China(2019YFA0802202 and 2020YFA0803401).
文摘Single-cell RNA sequencing(scRNA-seq)technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue.However,the computational characterization of such cellular heterogeneity is complicated by the high dimensionality,sparsity,and biological noise inherent to the raw data.Here,we introduce PhytoCluster,an unsupervised deep learning algorithm,to cluster scRNA-seq data by extracting latent features.We benchmarked PhytoCluster against four simulated datasets and five real scRNA-seq datasets with varying protocols and data quality levels.A comprehensive evaluation indicated that PhytoCluster outperforms other methods in clustering accuracy,noise removal,and signal retention.Additionally,we evaluated the performance of the latent features extracted by PhytoCluster across four machine learning models.The computational results highlight the ability of PhytoCluster to extract meaningful information from plant scRNA-seq data,with machine learning models achieving accuracy comparable to that of raw features.We believe that PhytoCluster will be a valuable tool for disentangling complex cellular heterogeneity based on scRNA-seq data.
文摘Applying domain knowledge in fuzzy clustering algorithms continuously promotes the development of clustering technology.The combination of domain knowledge and fuzzy clustering algorithms has some problems,such as initialization sensitivity and information granule weight optimization.Therefore,we propose a weighted kernel fuzzy clustering algorithm based on a relative density view(RDVWKFC).Compared with the traditional density-based methods,RDVWKFC can capture the intrinsic structure of the data more accurately,thus improving the initial quality of the clustering.By introducing a Relative Density based Knowledge Extraction Method(RDKM)and adaptive weight optimization mechanism,we effectively solve the limitations of view initialization and information granule weight optimization.RDKM can accurately identify high-density regions and optimize the initialization process.The adaptive weight mechanism can reduce noise and outliers’interference in the initial cluster centre selection by dynamically allocating weights.Experimental results on 14 benchmark datasets show that the proposed algorithm is superior to the existing algorithms in terms of clustering accuracy,stability,and convergence speed.It shows adaptability and robustness,especially when dealing with different data distributions and noise interference.Moreover,RDVWKFC can also show significant advantages when dealing with data with complex structures and high-dimensional features.These advancements provide versatile tools for real-world applications such as bioinformatics,image segmentation,and anomaly detection.
文摘The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show limitations with the inherent uncertainty and imprecision of such data,as they rely on a single-dimensional membership value.To overcome these limitations,we propose an auto-weighted multi-view neutrosophic fuzzy clustering(AW-MVNFC)algorithm.Our method leverages the neutrosophic framework,an extension of fuzzy sets,to explicitly model imprecision and ambiguity through three membership degrees.The core novelty of AWMVNFC lies in a hierarchical weighting strategy that adaptively learns the contributions of both individual data views and the importance of each feature within a view.Through a unified objective function,AW-MVNFC jointly optimizes the neutrosophic membership assignments,cluster centers,and the distributions of view and feature weights.Comprehensive experiments conducted on synthetic and real-world datasets demonstrate that our algorithm achieves more accurate and stable clustering than existing methods,demonstrating its effectiveness in handling the complexities of multi-view data.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
文摘Low visibility conditions,particularly those caused by fog,significantly affect road safety and reduce drivers’ability to see ahead clearly.The conventional approaches used to address this problem primarily rely on instrument-based and fixed-threshold-based theoretical frameworks,which face challenges in adaptability and demonstrate lower performance under varying environmental conditions.To overcome these challenges,we propose a real-time visibility estimation model that leverages roadside CCTV cameras to monitor and identify visibility levels under different weather conditions.The proposedmethod begins by identifying specific regions of interest(ROI)in the CCTVimages and focuses on extracting specific features such as the number of lines and contours detected within these regions.These features are then provided as an input to the proposed hierarchical clusteringmodel,which classifies them into different visibility levels without the need for predefined rules and threshold values.In the proposed approach,we used two different distance similaritymetrics,namely dynamic time warping(DTW)and Euclidean distance,alongside the proposed hierarchical clustering model and noted its performance in terms of numerous evaluation measures.The proposed model achieved an average accuracy of 97.81%,precision of 91.31%,recall of 91.25%,and F1-score of 91.27% using theDTWdistancemetric.We also conducted experiments for other deep learning(DL)-based models used in the literature and compared their performances with the proposed model.The experimental results demonstrate that the proposedmodel ismore adaptable and consistent compared to themethods used in the literature.The proposedmethod provides drivers real-time and accurate visibility information and enhances road safety during low visibility conditions.
基金This work was supported by the National Science Foundation of China(62176055)the China University S&T Innovation Plan Guided by the Ministry of Education.
文摘Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.
基金supported by National Natural Science Foundation of China(Nos.61305103 and 61473103)Natural Science Foundation Heilongjiang province(No.QC2014C072)+1 种基金Postdoctoral Science Foundation of Heilongjiang(No.LBH-Z14108)SelfPlanned Task of State Key Laboratory of Robotics and System(HIT)(No.SKLRS201609B)
文摘In order to solve the problem of indoor place recognition for indoor service robot, a novel algorithm, clustering of features and images (CFI), is proposed in this work. Different from traditional indoor place recognition methods which are based on kernels or bag of features, with large margin classifier, CFI proposed in this work is based on feature matching, image similarity and clustering of features and images. It establishes independent local feature clusters by feature cloud registration to represent each room, and defines image distance to describe the similarity between images or feature clusters, which determines the label of query images. Besides, it improves recognition speed by image scaling, with state inertia and hidden Markov model constraining the transition of the state to kill unreasonable wrong recognitions and achieves remarkable precision and speed. A series of experiments are conducted to test the algorithm based on standard databases, and it achieves recognition rate up to 97% and speed is over 30 fps, which is much superior to traditional methods. Its impressive precision and speed demonstrate the great discriminative power in the face of complicated environment.
基金National Natural Science Foundation of China,No.41930102,No.41971333。
文摘Stream morphology is an important indicator for revealing the geomorphological features and evolution of the Yangtze River.Existing studies on the morphology of the Yangtze River focus on planar features.However,the vertical features are also important.Vertical features mainly control the flow ability and erosion intensity.Furthermore,traditional studies often focus on a few stream profiles in the Yangtze River.However,stream profiles are linked together by runoff nodes,thus affecting the geomorphological evolution of the Yangtze River naturally.In this study,a clustering method of stream profiles in the Yangtze River is proposed by plotting all profiles together.Then,a stream evolution index is used to investigate the geomorphological features of the stream profile clusters to reveal the evolution of the Yangtze River.Based on the stream profile clusters,the erosion base of the Yangtze River generally changes from steep to gentle from the upper reaches to the lower reaches,and the evolution degree of the stream changes from low to high.The asymmetric distribution of knickpoints in the Hanshui River Basin supports the view that the boundary of the eastward growth of the Tibetan Plateau has reached the vicinity of the Daba Mountains.
基金Supported by the National Natural Science Foundation of China(61139002)~~
文摘Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.
文摘In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the original features are ranked according to their importance. An evaluation function E(f) used to evaluate the importance of a feature is introduced. Secondly, the set of important features is selected sequentially. Finally, the possible redundant features are removed from the important feature subset. Because the features are selected sequentially, it is not necessary to search through the large feature subset space, thus the efficiency can be improved. Experimental results show that the set of important features for clustering can be found and those unimportant features or features that may hinder the clustering task will be discarded by this method.
基金the National Natural Science Foundation of China (60234030)the Natural Science Foundationof He’nan Educational Committee of China (2007520019, 2008B520015)Doctoral Foundation of Henan Polytechnic Universityof China (B050901, B2008-61)
文摘Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method based on genetic clustering VGA-clustering is presented. By integrating the spatial neighbouring information of range data into fuzzy clustering algorithm, a weighted fuzzy clustering algorithm (WFCA) instead of standard clustering algorithm is introduced to realize feature extraction of laser scanner. Aimed at the unknown clustering number in advance, several validation index functions are used to estimate the validity of different clustering algorithms and one validation index is selected as the fitness function of genetic algorithm so as to determine the accurate clustering number automatically. At the same time, an improved genetic algorithm IVGA on the basis of VGA is proposed to solve the local optimum of clustering algorithm, which is implemented by increasing the population diversity and improving the genetic operators of elitist rule to enhance the local search capacity and to quicken the convergence speed. By the comparison with other algorithms, the effectiveness of the algorithm introduced is demonstrated.
基金Supported by the National Natural Science Foundation of China (60503020, 60373066)the Outstanding Young Scientist’s Fund (60425206)+1 种基金the Natural Science Foundation of Jiangsu Province (BK2005060)the Opening Foundation of Jiangsu Key Laboratory of Computer Informa-tion Processing Technology in Soochow University
文摘Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金National Natural Science Foundation of China(Grant No.61573233)Guangdong Provincial Natural Science Foundation of China(Grant No.2018A0303130188)+1 种基金Guangdong Provincial Science and Technology Special Funds Project of China(Grant No.190805145540361)Special Projects in Key Fields of Colleges and Universities in Guangdong Province of China(Grant No.2020ZDZX2005).
文摘There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods for these internal defects remains a challenging task.To address this challenge,in this study,an intelligent detection method based on a generalization feature cluster is proposed for internal defects of railway tracks.First,the defects are classified and counted according to their shape and location features.Then,generalized features of the internal defects are extracted and formulated based on the maximum difference between different types of defects and the maximum tolerance among same defects’types.Finally,the extracted generalized features are expressed by function constraints,and formulated as generalization feature clusters to classify and identify internal defects in the railway track.Furthermore,to improve the detection reliability and speed,a reduced-dimension method of the generalization feature clusters is presented in this paper.Based on this reduced-dimension feature and strongly constrained generalized features,the K-means clustering algorithm is developed for defect clustering,and good clustering results are achieved.Regarding the defects in the rail head region,the clustering accuracy is over 95%,and the Davies-Bouldin index(DBI)index is negligible,which indicates the validation of the proposed generalization features with strong constraints.Experimental results prove that the accuracy of the proposed method based on generalization feature clusters is up to 97.55%,and the average detection time is 0.12 s/frame,which indicates that it performs well in adaptability,high accuracy,and detection speed under complex working environments.The proposed algorithm can effectively detect internal defects in railway tracks using an established generalization feature cluster model.
基金supported by the National Natural Science Foundation of China (60774096)the National HighTech R&D Program of China (2008BAK49B05)
文摘Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.
文摘Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining areas and to be responsible for the observed clustering.The earthquakes accompanied by coal mine disasters may be the vital geophysical evidence for tectonic stress disturbances around mining areas.This paper analyzes all the possible causative factors to demonstrate the authenticity and reliability of the observed phenomena.A quantitative study was performed on the degree of clustering,and space-time distribution curves are obtained.Under the threshold of 100 km,47%of disasters are involved in cluster series and 372 coal mine disasters accompanied by earthquakes.The majority cluster series lasting for 1-2 days correspond well earthquakes nearby,which are speculated to be related to local stress disturbance.While the minority lasting longer than 4 days correspond well with fatal earthquakes,which are speculated to be related to regional stress disturbance.The cluster series possess multiple properties,such as the area,the distance,the related disasters,etc.,and compared with the energy and the magnitude of earthquakes,good correspondences are acquired.It indicates that the cluster series of coal mine disasters and earthquakes are linked with fatal earthquakes and may serve as footprints of regional stress disturbance.Speculations relating to the geological model are made,and five disaster-causing models are examined.To earthquake research and disaster prevention,widely scientific significance is suggested.
文摘Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one of the most important technologies in network security detection.The accuracy of network intrusion detection has reached higher accuracy so far.However,these methods have very low efficiency in network intrusion detection,even the most popular SOM neural network method.In this paper,an efficient and fast network intrusion detection method was proposed.Firstly,the fundamental of the two different methods are introduced respectively.Then,the selforganizing feature map neural network based on K-means clustering(KSOM)algorithms was presented to improve the efficiency of network intrusion detection.Finally,the NSLKDD is used as network intrusion data set to demonstrate that the KSOM method can significantly reduce the number of clustering iteration than SOM method without substantially affecting the clustering results and the accuracy is much higher than Kmeans method.The Experimental results show that our method can relatively improve the accuracy of network intrusion and significantly reduce the number of clustering iteration.
文摘Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.
基金This work has been supported by.Central University Research Fund(No.2016MS116,No.2016MS117,No.2018MS074)the National Natural Science Foundation(51677072).
文摘Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.
基金supported in part by the National Natural Science Foundation of China(No.62276206)the Key Research and Development Program of Shaanxi under Grant S2022-YF-YBGY-0921+2 种基金the State Key Program of National Natural Science of China(No.62231027)supported by the Science and Technology on Communication Information Security Control Laboratory。
文摘With the development of information technology,radio communication technology has made rapid progress.Many radio signals that have appeared in space are difficult to classify without manually labeling.Unsupervised radio signal clustering methods have recently become an urgent need for this situation.Meanwhile,the high complexity of deep learning makes it difficult to understand the decision results of the clustering models,making it essential to conduct interpretable analysis.This paper proposed a combined loss function for unsupervised clustering based on autoencoder.The combined loss function includes reconstruction loss and deep clustering loss.Deep clustering loss is added based on reconstruction loss,which makes similar deep features converge more in feature space.In addition,a features visualization method for signal clustering was proposed to analyze the interpretability of autoencoder utilizing Saliency Map.Extensive experiments have been conducted on a modulated signal dataset,and the results indicate the superior performance of our proposed method over other clustering algorithms.In particular,for the simulated dataset containing six modulation modes,when the SNR is 20dB,the clustering accuracy of the proposed method is greater than 78%.The interpretability analysis of the clustering model was performed to visualize the significant features of different modulated signals and verified the high separability of the features extracted by clustering model.