This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction m...This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Single-cell RNA sequencing(scRNA-seq)technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue.However,the computational c...Single-cell RNA sequencing(scRNA-seq)technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue.However,the computational characterization of such cellular heterogeneity is complicated by the high dimensionality,sparsity,and biological noise inherent to the raw data.Here,we introduce PhytoCluster,an unsupervised deep learning algorithm,to cluster scRNA-seq data by extracting latent features.We benchmarked PhytoCluster against four simulated datasets and five real scRNA-seq datasets with varying protocols and data quality levels.A comprehensive evaluation indicated that PhytoCluster outperforms other methods in clustering accuracy,noise removal,and signal retention.Additionally,we evaluated the performance of the latent features extracted by PhytoCluster across four machine learning models.The computational results highlight the ability of PhytoCluster to extract meaningful information from plant scRNA-seq data,with machine learning models achieving accuracy comparable to that of raw features.We believe that PhytoCluster will be a valuable tool for disentangling complex cellular heterogeneity based on scRNA-seq data.展开更多
Applying domain knowledge in fuzzy clustering algorithms continuously promotes the development of clustering technology.The combination of domain knowledge and fuzzy clustering algorithms has some problems,such as ini...Applying domain knowledge in fuzzy clustering algorithms continuously promotes the development of clustering technology.The combination of domain knowledge and fuzzy clustering algorithms has some problems,such as initialization sensitivity and information granule weight optimization.Therefore,we propose a weighted kernel fuzzy clustering algorithm based on a relative density view(RDVWKFC).Compared with the traditional density-based methods,RDVWKFC can capture the intrinsic structure of the data more accurately,thus improving the initial quality of the clustering.By introducing a Relative Density based Knowledge Extraction Method(RDKM)and adaptive weight optimization mechanism,we effectively solve the limitations of view initialization and information granule weight optimization.RDKM can accurately identify high-density regions and optimize the initialization process.The adaptive weight mechanism can reduce noise and outliers’interference in the initial cluster centre selection by dynamically allocating weights.Experimental results on 14 benchmark datasets show that the proposed algorithm is superior to the existing algorithms in terms of clustering accuracy,stability,and convergence speed.It shows adaptability and robustness,especially when dealing with different data distributions and noise interference.Moreover,RDVWKFC can also show significant advantages when dealing with data with complex structures and high-dimensional features.These advancements provide versatile tools for real-world applications such as bioinformatics,image segmentation,and anomaly detection.展开更多
The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show...The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show limitations with the inherent uncertainty and imprecision of such data,as they rely on a single-dimensional membership value.To overcome these limitations,we propose an auto-weighted multi-view neutrosophic fuzzy clustering(AW-MVNFC)algorithm.Our method leverages the neutrosophic framework,an extension of fuzzy sets,to explicitly model imprecision and ambiguity through three membership degrees.The core novelty of AWMVNFC lies in a hierarchical weighting strategy that adaptively learns the contributions of both individual data views and the importance of each feature within a view.Through a unified objective function,AW-MVNFC jointly optimizes the neutrosophic membership assignments,cluster centers,and the distributions of view and feature weights.Comprehensive experiments conducted on synthetic and real-world datasets demonstrate that our algorithm achieves more accurate and stable clustering than existing methods,demonstrating its effectiveness in handling the complexities of multi-view data.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
Low visibility conditions,particularly those caused by fog,significantly affect road safety and reduce drivers’ability to see ahead clearly.The conventional approaches used to address this problem primarily rely on i...Low visibility conditions,particularly those caused by fog,significantly affect road safety and reduce drivers’ability to see ahead clearly.The conventional approaches used to address this problem primarily rely on instrument-based and fixed-threshold-based theoretical frameworks,which face challenges in adaptability and demonstrate lower performance under varying environmental conditions.To overcome these challenges,we propose a real-time visibility estimation model that leverages roadside CCTV cameras to monitor and identify visibility levels under different weather conditions.The proposedmethod begins by identifying specific regions of interest(ROI)in the CCTVimages and focuses on extracting specific features such as the number of lines and contours detected within these regions.These features are then provided as an input to the proposed hierarchical clusteringmodel,which classifies them into different visibility levels without the need for predefined rules and threshold values.In the proposed approach,we used two different distance similaritymetrics,namely dynamic time warping(DTW)and Euclidean distance,alongside the proposed hierarchical clustering model and noted its performance in terms of numerous evaluation measures.The proposed model achieved an average accuracy of 97.81%,precision of 91.31%,recall of 91.25%,and F1-score of 91.27% using theDTWdistancemetric.We also conducted experiments for other deep learning(DL)-based models used in the literature and compared their performances with the proposed model.The experimental results demonstrate that the proposedmodel ismore adaptable and consistent compared to themethods used in the literature.The proposedmethod provides drivers real-time and accurate visibility information and enhances road safety during low visibility conditions.展开更多
Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the ...Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.展开更多
In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the...In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the original features are ranked according to their importance. An evaluation function E(f) used to evaluate the importance of a feature is introduced. Secondly, the set of important features is selected sequentially. Finally, the possible redundant features are removed from the important feature subset. Because the features are selected sequentially, it is not necessary to search through the large feature subset space, thus the efficiency can be improved. Experimental results show that the set of important features for clustering can be found and those unimportant features or features that may hinder the clustering task will be discarded by this method.展开更多
Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method b...Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method based on genetic clustering VGA-clustering is presented. By integrating the spatial neighbouring information of range data into fuzzy clustering algorithm, a weighted fuzzy clustering algorithm (WFCA) instead of standard clustering algorithm is introduced to realize feature extraction of laser scanner. Aimed at the unknown clustering number in advance, several validation index functions are used to estimate the validity of different clustering algorithms and one validation index is selected as the fitness function of genetic algorithm so as to determine the accurate clustering number automatically. At the same time, an improved genetic algorithm IVGA on the basis of VGA is proposed to solve the local optimum of clustering algorithm, which is implemented by increasing the population diversity and improving the genetic operators of elitist rule to enhance the local search capacity and to quicken the convergence speed. By the comparison with other algorithms, the effectiveness of the algorithm introduced is demonstrated.展开更多
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method...Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods...There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods for these internal defects remains a challenging task.To address this challenge,in this study,an intelligent detection method based on a generalization feature cluster is proposed for internal defects of railway tracks.First,the defects are classified and counted according to their shape and location features.Then,generalized features of the internal defects are extracted and formulated based on the maximum difference between different types of defects and the maximum tolerance among same defects’types.Finally,the extracted generalized features are expressed by function constraints,and formulated as generalization feature clusters to classify and identify internal defects in the railway track.Furthermore,to improve the detection reliability and speed,a reduced-dimension method of the generalization feature clusters is presented in this paper.Based on this reduced-dimension feature and strongly constrained generalized features,the K-means clustering algorithm is developed for defect clustering,and good clustering results are achieved.Regarding the defects in the rail head region,the clustering accuracy is over 95%,and the Davies-Bouldin index(DBI)index is negligible,which indicates the validation of the proposed generalization features with strong constraints.Experimental results prove that the accuracy of the proposed method based on generalization feature clusters is up to 97.55%,and the average detection time is 0.12 s/frame,which indicates that it performs well in adaptability,high accuracy,and detection speed under complex working environments.The proposed algorithm can effectively detect internal defects in railway tracks using an established generalization feature cluster model.展开更多
Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension ...Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.展开更多
Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining a...Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining areas and to be responsible for the observed clustering.The earthquakes accompanied by coal mine disasters may be the vital geophysical evidence for tectonic stress disturbances around mining areas.This paper analyzes all the possible causative factors to demonstrate the authenticity and reliability of the observed phenomena.A quantitative study was performed on the degree of clustering,and space-time distribution curves are obtained.Under the threshold of 100 km,47%of disasters are involved in cluster series and 372 coal mine disasters accompanied by earthquakes.The majority cluster series lasting for 1-2 days correspond well earthquakes nearby,which are speculated to be related to local stress disturbance.While the minority lasting longer than 4 days correspond well with fatal earthquakes,which are speculated to be related to regional stress disturbance.The cluster series possess multiple properties,such as the area,the distance,the related disasters,etc.,and compared with the energy and the magnitude of earthquakes,good correspondences are acquired.It indicates that the cluster series of coal mine disasters and earthquakes are linked with fatal earthquakes and may serve as footprints of regional stress disturbance.Speculations relating to the geological model are made,and five disaster-causing models are examined.To earthquake research and disaster prevention,widely scientific significance is suggested.展开更多
Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one...Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one of the most important technologies in network security detection.The accuracy of network intrusion detection has reached higher accuracy so far.However,these methods have very low efficiency in network intrusion detection,even the most popular SOM neural network method.In this paper,an efficient and fast network intrusion detection method was proposed.Firstly,the fundamental of the two different methods are introduced respectively.Then,the selforganizing feature map neural network based on K-means clustering(KSOM)algorithms was presented to improve the efficiency of network intrusion detection.Finally,the NSLKDD is used as network intrusion data set to demonstrate that the KSOM method can significantly reduce the number of clustering iteration than SOM method without substantially affecting the clustering results and the accuracy is much higher than Kmeans method.The Experimental results show that our method can relatively improve the accuracy of network intrusion and significantly reduce the number of clustering iteration.展开更多
Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the respons...Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.展开更多
Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at ...Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.展开更多
Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess...Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.展开更多
Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy,namely,high dimensionality and small sample size.In order to improve cluster analysis outcomes,...Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy,namely,high dimensionality and small sample size.In order to improve cluster analysis outcomes,feature selection can be used to remove redundant or irrelevant features and reduce the dimensionality.However,for cluster analysis,this must be done in an unsupervised manner without the benefit of data labels.This paper presents a novel feature selection approach for cluster analysis,utilizing clusterability metrics to remove features that least contribute to a dataset’s tendency to cluster.Two versions are presented and evaluated:The Hopkins clusterability filter which utilizes the Hopkins test for spatial randomness and the Dip clusterability filter which utilizes the Dip test for unimodality.These new techniques,along with a range of existing filter and wrapper feature selection techniques were evaluated on eleven real-world spectroscopy datasets using internal and external clustering indices.Our newly proposed Hopkins clusterability filter performed the best of the six filter techniques evaluated.However,it was observed that results varied greatly for different techniques depending on the specifics of the dataset and the number of features selected,with significant instability observed for most techniques at low numbers of features.It was identified that the genetic algorithm wrapper technique avoided this instability,performed consistently across all datasets and resulted in better results on average than utilizing the all the features in the spectra.展开更多
The following questions are discussed: feature cluster, feature clusterconcept and the reasoning formula. The defect based on approach direction and feed direction areanalyzed. Feature tool axis direction concept and ...The following questions are discussed: feature cluster, feature clusterconcept and the reasoning formula. The defect based on approach direction and feed direction areanalyzed. Feature tool axis direction concept and its definition method are submitted. The featurefor practical part is also clustered by tool axis direction.展开更多
基金funded by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Grant No.KFU241683].
文摘This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金supported by the National Natural Science Foundation of China(32371996 and 62372158)the National Key R&D Program of China(2022YFF0711802)+1 种基金the STI 2030-Major Projects(2022ZD04017)the National Key Research and Development Program of China(2019YFA0802202 and 2020YFA0803401).
文摘Single-cell RNA sequencing(scRNA-seq)technology enables a deep understanding of cellular differentiation during plant development and reveals heterogeneity among the cells of a given tissue.However,the computational characterization of such cellular heterogeneity is complicated by the high dimensionality,sparsity,and biological noise inherent to the raw data.Here,we introduce PhytoCluster,an unsupervised deep learning algorithm,to cluster scRNA-seq data by extracting latent features.We benchmarked PhytoCluster against four simulated datasets and five real scRNA-seq datasets with varying protocols and data quality levels.A comprehensive evaluation indicated that PhytoCluster outperforms other methods in clustering accuracy,noise removal,and signal retention.Additionally,we evaluated the performance of the latent features extracted by PhytoCluster across four machine learning models.The computational results highlight the ability of PhytoCluster to extract meaningful information from plant scRNA-seq data,with machine learning models achieving accuracy comparable to that of raw features.We believe that PhytoCluster will be a valuable tool for disentangling complex cellular heterogeneity based on scRNA-seq data.
文摘Applying domain knowledge in fuzzy clustering algorithms continuously promotes the development of clustering technology.The combination of domain knowledge and fuzzy clustering algorithms has some problems,such as initialization sensitivity and information granule weight optimization.Therefore,we propose a weighted kernel fuzzy clustering algorithm based on a relative density view(RDVWKFC).Compared with the traditional density-based methods,RDVWKFC can capture the intrinsic structure of the data more accurately,thus improving the initial quality of the clustering.By introducing a Relative Density based Knowledge Extraction Method(RDKM)and adaptive weight optimization mechanism,we effectively solve the limitations of view initialization and information granule weight optimization.RDKM can accurately identify high-density regions and optimize the initialization process.The adaptive weight mechanism can reduce noise and outliers’interference in the initial cluster centre selection by dynamically allocating weights.Experimental results on 14 benchmark datasets show that the proposed algorithm is superior to the existing algorithms in terms of clustering accuracy,stability,and convergence speed.It shows adaptability and robustness,especially when dealing with different data distributions and noise interference.Moreover,RDVWKFC can also show significant advantages when dealing with data with complex structures and high-dimensional features.These advancements provide versatile tools for real-world applications such as bioinformatics,image segmentation,and anomaly detection.
文摘The increasing prevalence of multi-view data has made multi-view clustering a crucial technique for discovering latent structures from heterogeneous representations.However,traditional fuzzy clustering algorithms show limitations with the inherent uncertainty and imprecision of such data,as they rely on a single-dimensional membership value.To overcome these limitations,we propose an auto-weighted multi-view neutrosophic fuzzy clustering(AW-MVNFC)algorithm.Our method leverages the neutrosophic framework,an extension of fuzzy sets,to explicitly model imprecision and ambiguity through three membership degrees.The core novelty of AWMVNFC lies in a hierarchical weighting strategy that adaptively learns the contributions of both individual data views and the importance of each feature within a view.Through a unified objective function,AW-MVNFC jointly optimizes the neutrosophic membership assignments,cluster centers,and the distributions of view and feature weights.Comprehensive experiments conducted on synthetic and real-world datasets demonstrate that our algorithm achieves more accurate and stable clustering than existing methods,demonstrating its effectiveness in handling the complexities of multi-view data.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
文摘Low visibility conditions,particularly those caused by fog,significantly affect road safety and reduce drivers’ability to see ahead clearly.The conventional approaches used to address this problem primarily rely on instrument-based and fixed-threshold-based theoretical frameworks,which face challenges in adaptability and demonstrate lower performance under varying environmental conditions.To overcome these challenges,we propose a real-time visibility estimation model that leverages roadside CCTV cameras to monitor and identify visibility levels under different weather conditions.The proposedmethod begins by identifying specific regions of interest(ROI)in the CCTVimages and focuses on extracting specific features such as the number of lines and contours detected within these regions.These features are then provided as an input to the proposed hierarchical clusteringmodel,which classifies them into different visibility levels without the need for predefined rules and threshold values.In the proposed approach,we used two different distance similaritymetrics,namely dynamic time warping(DTW)and Euclidean distance,alongside the proposed hierarchical clustering model and noted its performance in terms of numerous evaluation measures.The proposed model achieved an average accuracy of 97.81%,precision of 91.31%,recall of 91.25%,and F1-score of 91.27% using theDTWdistancemetric.We also conducted experiments for other deep learning(DL)-based models used in the literature and compared their performances with the proposed model.The experimental results demonstrate that the proposedmodel ismore adaptable and consistent compared to themethods used in the literature.The proposedmethod provides drivers real-time and accurate visibility information and enhances road safety during low visibility conditions.
基金Supported by the National Natural Science Foundation of China(61139002)~~
文摘Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.
文摘In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the original features are ranked according to their importance. An evaluation function E(f) used to evaluate the importance of a feature is introduced. Secondly, the set of important features is selected sequentially. Finally, the possible redundant features are removed from the important feature subset. Because the features are selected sequentially, it is not necessary to search through the large feature subset space, thus the efficiency can be improved. Experimental results show that the set of important features for clustering can be found and those unimportant features or features that may hinder the clustering task will be discarded by this method.
基金the National Natural Science Foundation of China (60234030)the Natural Science Foundationof He’nan Educational Committee of China (2007520019, 2008B520015)Doctoral Foundation of Henan Polytechnic Universityof China (B050901, B2008-61)
文摘Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method based on genetic clustering VGA-clustering is presented. By integrating the spatial neighbouring information of range data into fuzzy clustering algorithm, a weighted fuzzy clustering algorithm (WFCA) instead of standard clustering algorithm is introduced to realize feature extraction of laser scanner. Aimed at the unknown clustering number in advance, several validation index functions are used to estimate the validity of different clustering algorithms and one validation index is selected as the fitness function of genetic algorithm so as to determine the accurate clustering number automatically. At the same time, an improved genetic algorithm IVGA on the basis of VGA is proposed to solve the local optimum of clustering algorithm, which is implemented by increasing the population diversity and improving the genetic operators of elitist rule to enhance the local search capacity and to quicken the convergence speed. By the comparison with other algorithms, the effectiveness of the algorithm introduced is demonstrated.
基金Supported by the National Natural Science Foundation of China (60503020, 60373066)the Outstanding Young Scientist’s Fund (60425206)+1 种基金the Natural Science Foundation of Jiangsu Province (BK2005060)the Opening Foundation of Jiangsu Key Laboratory of Computer Informa-tion Processing Technology in Soochow University
文摘Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金National Natural Science Foundation of China(Grant No.61573233)Guangdong Provincial Natural Science Foundation of China(Grant No.2018A0303130188)+1 种基金Guangdong Provincial Science and Technology Special Funds Project of China(Grant No.190805145540361)Special Projects in Key Fields of Colleges and Universities in Guangdong Province of China(Grant No.2020ZDZX2005).
文摘There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods for these internal defects remains a challenging task.To address this challenge,in this study,an intelligent detection method based on a generalization feature cluster is proposed for internal defects of railway tracks.First,the defects are classified and counted according to their shape and location features.Then,generalized features of the internal defects are extracted and formulated based on the maximum difference between different types of defects and the maximum tolerance among same defects’types.Finally,the extracted generalized features are expressed by function constraints,and formulated as generalization feature clusters to classify and identify internal defects in the railway track.Furthermore,to improve the detection reliability and speed,a reduced-dimension method of the generalization feature clusters is presented in this paper.Based on this reduced-dimension feature and strongly constrained generalized features,the K-means clustering algorithm is developed for defect clustering,and good clustering results are achieved.Regarding the defects in the rail head region,the clustering accuracy is over 95%,and the Davies-Bouldin index(DBI)index is negligible,which indicates the validation of the proposed generalization features with strong constraints.Experimental results prove that the accuracy of the proposed method based on generalization feature clusters is up to 97.55%,and the average detection time is 0.12 s/frame,which indicates that it performs well in adaptability,high accuracy,and detection speed under complex working environments.The proposed algorithm can effectively detect internal defects in railway tracks using an established generalization feature cluster model.
基金supported by the National Natural Science Foundation of China (60774096)the National HighTech R&D Program of China (2008BAK49B05)
文摘Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.
文摘Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining areas and to be responsible for the observed clustering.The earthquakes accompanied by coal mine disasters may be the vital geophysical evidence for tectonic stress disturbances around mining areas.This paper analyzes all the possible causative factors to demonstrate the authenticity and reliability of the observed phenomena.A quantitative study was performed on the degree of clustering,and space-time distribution curves are obtained.Under the threshold of 100 km,47%of disasters are involved in cluster series and 372 coal mine disasters accompanied by earthquakes.The majority cluster series lasting for 1-2 days correspond well earthquakes nearby,which are speculated to be related to local stress disturbance.While the minority lasting longer than 4 days correspond well with fatal earthquakes,which are speculated to be related to regional stress disturbance.The cluster series possess multiple properties,such as the area,the distance,the related disasters,etc.,and compared with the energy and the magnitude of earthquakes,good correspondences are acquired.It indicates that the cluster series of coal mine disasters and earthquakes are linked with fatal earthquakes and may serve as footprints of regional stress disturbance.Speculations relating to the geological model are made,and five disaster-causing models are examined.To earthquake research and disaster prevention,widely scientific significance is suggested.
文摘Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one of the most important technologies in network security detection.The accuracy of network intrusion detection has reached higher accuracy so far.However,these methods have very low efficiency in network intrusion detection,even the most popular SOM neural network method.In this paper,an efficient and fast network intrusion detection method was proposed.Firstly,the fundamental of the two different methods are introduced respectively.Then,the selforganizing feature map neural network based on K-means clustering(KSOM)algorithms was presented to improve the efficiency of network intrusion detection.Finally,the NSLKDD is used as network intrusion data set to demonstrate that the KSOM method can significantly reduce the number of clustering iteration than SOM method without substantially affecting the clustering results and the accuracy is much higher than Kmeans method.The Experimental results show that our method can relatively improve the accuracy of network intrusion and significantly reduce the number of clustering iteration.
文摘Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.
基金This work has been supported by.Central University Research Fund(No.2016MS116,No.2016MS117,No.2018MS074)the National Natural Science Foundation(51677072).
文摘Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.
基金This work was supported by the National Science Foundation of China(62176055)the China University S&T Innovation Plan Guided by the Ministry of Education.
文摘Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.
文摘Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy,namely,high dimensionality and small sample size.In order to improve cluster analysis outcomes,feature selection can be used to remove redundant or irrelevant features and reduce the dimensionality.However,for cluster analysis,this must be done in an unsupervised manner without the benefit of data labels.This paper presents a novel feature selection approach for cluster analysis,utilizing clusterability metrics to remove features that least contribute to a dataset’s tendency to cluster.Two versions are presented and evaluated:The Hopkins clusterability filter which utilizes the Hopkins test for spatial randomness and the Dip clusterability filter which utilizes the Dip test for unimodality.These new techniques,along with a range of existing filter and wrapper feature selection techniques were evaluated on eleven real-world spectroscopy datasets using internal and external clustering indices.Our newly proposed Hopkins clusterability filter performed the best of the six filter techniques evaluated.However,it was observed that results varied greatly for different techniques depending on the specifics of the dataset and the number of features selected,with significant instability observed for most techniques at low numbers of features.It was identified that the genetic algorithm wrapper technique avoided this instability,performed consistently across all datasets and resulted in better results on average than utilizing the all the features in the spectra.
基金National Natural Science Foundation of China (No.59875006).
文摘The following questions are discussed: feature cluster, feature clusterconcept and the reasoning formula. The defect based on approach direction and feed direction areanalyzed. Feature tool axis direction concept and its definition method are submitted. The featurefor practical part is also clustered by tool axis direction.