This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction m...This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.展开更多
Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the ...Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.展开更多
In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the...In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the original features are ranked according to their importance. An evaluation function E(f) used to evaluate the importance of a feature is introduced. Secondly, the set of important features is selected sequentially. Finally, the possible redundant features are removed from the important feature subset. Because the features are selected sequentially, it is not necessary to search through the large feature subset space, thus the efficiency can be improved. Experimental results show that the set of important features for clustering can be found and those unimportant features or features that may hinder the clustering task will be discarded by this method.展开更多
Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method b...Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method based on genetic clustering VGA-clustering is presented. By integrating the spatial neighbouring information of range data into fuzzy clustering algorithm, a weighted fuzzy clustering algorithm (WFCA) instead of standard clustering algorithm is introduced to realize feature extraction of laser scanner. Aimed at the unknown clustering number in advance, several validation index functions are used to estimate the validity of different clustering algorithms and one validation index is selected as the fitness function of genetic algorithm so as to determine the accurate clustering number automatically. At the same time, an improved genetic algorithm IVGA on the basis of VGA is proposed to solve the local optimum of clustering algorithm, which is implemented by increasing the population diversity and improving the genetic operators of elitist rule to enhance the local search capacity and to quicken the convergence speed. By the comparison with other algorithms, the effectiveness of the algorithm introduced is demonstrated.展开更多
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method...Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods...There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods for these internal defects remains a challenging task.To address this challenge,in this study,an intelligent detection method based on a generalization feature cluster is proposed for internal defects of railway tracks.First,the defects are classified and counted according to their shape and location features.Then,generalized features of the internal defects are extracted and formulated based on the maximum difference between different types of defects and the maximum tolerance among same defects’types.Finally,the extracted generalized features are expressed by function constraints,and formulated as generalization feature clusters to classify and identify internal defects in the railway track.Furthermore,to improve the detection reliability and speed,a reduced-dimension method of the generalization feature clusters is presented in this paper.Based on this reduced-dimension feature and strongly constrained generalized features,the K-means clustering algorithm is developed for defect clustering,and good clustering results are achieved.Regarding the defects in the rail head region,the clustering accuracy is over 95%,and the Davies-Bouldin index(DBI)index is negligible,which indicates the validation of the proposed generalization features with strong constraints.Experimental results prove that the accuracy of the proposed method based on generalization feature clusters is up to 97.55%,and the average detection time is 0.12 s/frame,which indicates that it performs well in adaptability,high accuracy,and detection speed under complex working environments.The proposed algorithm can effectively detect internal defects in railway tracks using an established generalization feature cluster model.展开更多
Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining a...Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining areas and to be responsible for the observed clustering.The earthquakes accompanied by coal mine disasters may be the vital geophysical evidence for tectonic stress disturbances around mining areas.This paper analyzes all the possible causative factors to demonstrate the authenticity and reliability of the observed phenomena.A quantitative study was performed on the degree of clustering,and space-time distribution curves are obtained.Under the threshold of 100 km,47%of disasters are involved in cluster series and 372 coal mine disasters accompanied by earthquakes.The majority cluster series lasting for 1-2 days correspond well earthquakes nearby,which are speculated to be related to local stress disturbance.While the minority lasting longer than 4 days correspond well with fatal earthquakes,which are speculated to be related to regional stress disturbance.The cluster series possess multiple properties,such as the area,the distance,the related disasters,etc.,and compared with the energy and the magnitude of earthquakes,good correspondences are acquired.It indicates that the cluster series of coal mine disasters and earthquakes are linked with fatal earthquakes and may serve as footprints of regional stress disturbance.Speculations relating to the geological model are made,and five disaster-causing models are examined.To earthquake research and disaster prevention,widely scientific significance is suggested.展开更多
Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension ...Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.展开更多
Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one...Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one of the most important technologies in network security detection.The accuracy of network intrusion detection has reached higher accuracy so far.However,these methods have very low efficiency in network intrusion detection,even the most popular SOM neural network method.In this paper,an efficient and fast network intrusion detection method was proposed.Firstly,the fundamental of the two different methods are introduced respectively.Then,the selforganizing feature map neural network based on K-means clustering(KSOM)algorithms was presented to improve the efficiency of network intrusion detection.Finally,the NSLKDD is used as network intrusion data set to demonstrate that the KSOM method can significantly reduce the number of clustering iteration than SOM method without substantially affecting the clustering results and the accuracy is much higher than Kmeans method.The Experimental results show that our method can relatively improve the accuracy of network intrusion and significantly reduce the number of clustering iteration.展开更多
Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the respons...Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.展开更多
Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at ...Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.展开更多
Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess...Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.展开更多
Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy,namely,high dimensionality and small sample size.In order to improve cluster analysis outcomes,...Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy,namely,high dimensionality and small sample size.In order to improve cluster analysis outcomes,feature selection can be used to remove redundant or irrelevant features and reduce the dimensionality.However,for cluster analysis,this must be done in an unsupervised manner without the benefit of data labels.This paper presents a novel feature selection approach for cluster analysis,utilizing clusterability metrics to remove features that least contribute to a dataset’s tendency to cluster.Two versions are presented and evaluated:The Hopkins clusterability filter which utilizes the Hopkins test for spatial randomness and the Dip clusterability filter which utilizes the Dip test for unimodality.These new techniques,along with a range of existing filter and wrapper feature selection techniques were evaluated on eleven real-world spectroscopy datasets using internal and external clustering indices.Our newly proposed Hopkins clusterability filter performed the best of the six filter techniques evaluated.However,it was observed that results varied greatly for different techniques depending on the specifics of the dataset and the number of features selected,with significant instability observed for most techniques at low numbers of features.It was identified that the genetic algorithm wrapper technique avoided this instability,performed consistently across all datasets and resulted in better results on average than utilizing the all the features in the spectra.展开更多
The following questions are discussed: feature cluster, feature clusterconcept and the reasoning formula. The defect based on approach direction and feed direction areanalyzed. Feature tool axis direction concept and ...The following questions are discussed: feature cluster, feature clusterconcept and the reasoning formula. The defect based on approach direction and feed direction areanalyzed. Feature tool axis direction concept and its definition method are submitted. The featurefor practical part is also clustered by tool axis direction.展开更多
The structures and properties of Wn (n = 2-14) clusters were studied by using the density functional theory (DFT) at LSDA level. The most stable structures of Wn (n = 2-14) clusters with global minimum were dete...The structures and properties of Wn (n = 2-14) clusters were studied by using the density functional theory (DFT) at LSDA level. The most stable structures of Wn (n = 2-14) clusters with global minimum were determined. The average binding energy (Eb), the first and second difference of total energy (△E, △2E), the vertical detachment energy (VDE), and the HOMO-LUMO gap versus the size were also discussed. The abrupt decrease of VDE and HOMO-LUMO gap at size n = 8 and 10 implied that tungsten clusters of W8 and W10 appeared to have metallic features. These changes were also accompanied by the delocalization of electron charge density and the strong hybridization between 5d and 6s orbits in W8 and W10 clusters. Our results are in good agreement with the available experimental data.展开更多
In order to solve the problem of indoor place recognition for indoor service robot, a novel algorithm, clustering of features and images (CFI), is proposed in this work. Different from traditional indoor place recog...In order to solve the problem of indoor place recognition for indoor service robot, a novel algorithm, clustering of features and images (CFI), is proposed in this work. Different from traditional indoor place recognition methods which are based on kernels or bag of features, with large margin classifier, CFI proposed in this work is based on feature matching, image similarity and clustering of features and images. It establishes independent local feature clusters by feature cloud registration to represent each room, and defines image distance to describe the similarity between images or feature clusters, which determines the label of query images. Besides, it improves recognition speed by image scaling, with state inertia and hidden Markov model constraining the transition of the state to kill unreasonable wrong recognitions and achieves remarkable precision and speed. A series of experiments are conducted to test the algorithm based on standard databases, and it achieves recognition rate up to 97% and speed is over 30 fps, which is much superior to traditional methods. Its impressive precision and speed demonstrate the great discriminative power in the face of complicated environment.展开更多
This work reports the structural feature and internal motion of one novel hyperbranching cluster system in dilution solution.The cluster system is composed of HB-PS_(300)-g-Pt BA_(45) hypergraft copolymer chains with ...This work reports the structural feature and internal motion of one novel hyperbranching cluster system in dilution solution.The cluster system is composed of HB-PS_(300)-g-Pt BA_(45) hypergraft copolymer chains with uniform subchain,high molar mass and low polydispersity(M_(w)=1.73×106 g/mol and<M_(w)/M_(n)>≈1.07),where HB-PS and Pt BA represent hyperbranched polystyrene core and poly(tert-butyl polyacrylate)graft,respectively.In the selective solvent of PS blocks(cyclohexane,T_(θ)=34.5℃),the aggregation kinetics and structural feature are found to be precisely tunable for assembled clusters by the aggregation temperature(11℃<T<17℃)and time(0 h<t<24 h).An interesting structural evolution kinetics is observed,namely,the fractal dimension(d_(f))of clusters is found to first increases and then decreases with t,eventually,it reaches a plateau value of d_(f)≈3.0,corresponds to a uniform spherical structure.By using dynamic light scattering(DLS)to monitor the number and strength of relaxation modes inΓ(q)withΓbeing the decay rate and q being the scattering vector,it is quantitatively revealed that the relaxation,intensity contribution and mode origin of internal motions of clusters are neither similar with previously reported cluster systems with high polydispersity,nor with the classical linear chain systems.In particular,in the broad range of 2.0<qR_(h)<6.0,we have observed that the reduced first cumulant[Γ^(*)=Γ(q)/(q^(3)k_(B)T/η_(0))]does not display an asymptotic behavior.Whereas,a better asymptotic behavior is observed by plottingΓ(q)/q^(4) versus qRh.For the first time,our observation provides direct evidence supporting that,for hyperbranching cluster system with low polydispersity and high local chain segment density,the hydrodynamic interaction is greatly weakened due to the enhanced hydrodynamic shielding effect.展开更多
Stream morphology is an important indicator for revealing the geomorphological features and evolution of the Yangtze River.Existing studies on the morphology of the Yangtze River focus on planar features.However,the v...Stream morphology is an important indicator for revealing the geomorphological features and evolution of the Yangtze River.Existing studies on the morphology of the Yangtze River focus on planar features.However,the vertical features are also important.Vertical features mainly control the flow ability and erosion intensity.Furthermore,traditional studies often focus on a few stream profiles in the Yangtze River.However,stream profiles are linked together by runoff nodes,thus affecting the geomorphological evolution of the Yangtze River naturally.In this study,a clustering method of stream profiles in the Yangtze River is proposed by plotting all profiles together.Then,a stream evolution index is used to investigate the geomorphological features of the stream profile clusters to reveal the evolution of the Yangtze River.Based on the stream profile clusters,the erosion base of the Yangtze River generally changes from steep to gentle from the upper reaches to the lower reaches,and the evolution degree of the stream changes from low to high.The asymmetric distribution of knickpoints in the Hanshui River Basin supports the view that the boundary of the eastward growth of the Tibetan Plateau has reached the vicinity of the Daba Mountains.展开更多
Epigenetics is the study of phenotypic variations that do not alter DNA sequences.Cancer epigenetics has grown rapidly over the past few years as epigenetic alterations exist in all human cancers.One of these alterati...Epigenetics is the study of phenotypic variations that do not alter DNA sequences.Cancer epigenetics has grown rapidly over the past few years as epigenetic alterations exist in all human cancers.One of these alterations is DNA methylation;an epigenetic process that regulates gene expression and often occurs at tumor suppressor gene loci in cancer.Therefore,studying this methylation process may shed light on different gene functions that cannot otherwise be interpreted using the changes that occur in DNA sequences.Currently,microarray technologies;such as Illumina Infinium BeadChip assays;are used to study DNA methylation at an extremely large number of varying loci.At each DNA methylation site,a beta value(β)is used to reflect the methylation intensity.Therefore,clustering this data from various types of cancers may lead to the discovery of large partitions that can help objectively classify different types of cancers aswell as identify the relevant loci without user bias.This study proposed a Nested Big Data Clustering Genetic Algorithm(NBDC-GA);a novel evolutionary metaheuristic technique that can perform cluster-based feature selection based on the DNA methylation sites.The efficacy of the NBDC-GA was tested using real-world data sets retrieved from The Cancer Genome Atlas(TCGA);a cancer genomics program created by the NationalCancer Institute(NCI)and the NationalHuman Genome Research Institute.The performance of the NBDC-GA was then compared with that of a recently developed metaheuristic Immuno-Genetic Algorithm(IGA)that was tested using the same data sets.The NBDC-GA outperformed the IGA in terms of convergence performance.Furthermore,the NBDC-GA produced a more robust clustering configuration while simultaneously decreasing the dimensionality of features to a maximumof 67%and of 94.5%for individual cancer type and collective cancer,respectively.The proposed NBDC-GA was also able to identify two chromosomes with highly contrastingDNAmethylations activities that were previously linked to cancer.展开更多
Air quality prediction is an important part of environmental governance.The accuracy of the air quality prediction also affects the planning of people’s outdoor activities.How to mine effective information from histo...Air quality prediction is an important part of environmental governance.The accuracy of the air quality prediction also affects the planning of people’s outdoor activities.How to mine effective information from historical data of air pollution and reduce unimportant factors to predict the law of pollution change is of great significance for pollution prevention,pollution control and pollution early warning.In this paper,we take into account that there are different trends in air pollutants and that different climatic factors have different effects on air pollutants.Firstly,the data of air pollutants in different cities are collected by a sliding window technology,and the data of different cities in the sliding window are clustered by Kohonen method to find the same tends in air pollutants.On this basis,combined with the weather data,we use the ReliefF method to extract the characteristics of climate factors that helpful for prediction.Finally,different types of air pollutants and corresponding extracted the characteristics of climate factors are used to train different sub models.The experimental results of different algorithms with different air pollutants show that this method not only improves the accuracy of air quality prediction,but also improves the operation efficiency.展开更多
基金funded by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Grant No.KFU241683].
文摘This paper proposes a novel hybrid fraud detection framework that integrates multi-stage feature selection,unsupervised clustering,and ensemble learning to improve classification performance in financial transaction monitoring systems.The framework is structured into three core layers:(1)feature selection using Recursive Feature Elimination(RFE),Principal Component Analysis(PCA),and Mutual Information(MI)to reduce dimensionality and enhance input relevance;(2)anomaly detection through unsupervised clustering using K-Means,Density-Based Spatial Clustering(DBSCAN),and Hierarchical Clustering to flag suspicious patterns in unlabeled data;and(3)final classification using a voting-based hybrid ensemble of Support Vector Machine(SVM),Random Forest(RF),and Gradient Boosting Classifier(GBC).The experimental evaluation is conducted on a synthetically generated dataset comprising one million financial transactions,with 5% labelled as fraudulent,simulating realistic fraud rates and behavioural features,including transaction time,origin,amount,and geo-location.The proposed model demonstrated a significant improvement over baseline classifiers,achieving an accuracy of 99%,a precision of 99%,a recall of 97%,and an F1-score of 99%.Compared to individual models,it yielded a 9% gain in overall detection accuracy.It reduced the false positive rate to below 3.5%,thereby minimising the operational costs associated with manually reviewing false alerts.The model’s interpretability is enhanced by the integration of Shapley Additive Explanations(SHAP)values for feature importance,supporting transparency and regulatory auditability.These results affirm the practical relevance of the proposed system for deployment in real-time fraud detection scenarios such as credit card transactions,mobile banking,and cross-border payments.The study also highlights future directions,including the deployment of lightweight models and the integration of multimodal data for scalable fraud analytics.
基金Supported by the National Natural Science Foundation of China(61139002)~~
文摘Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.
文摘In order to enable clustering to be done under a lower dimension, a new feature selection method for clustering is proposed. This method has three steps which are all carried out in a wrapper framework. First, all the original features are ranked according to their importance. An evaluation function E(f) used to evaluate the importance of a feature is introduced. Secondly, the set of important features is selected sequentially. Finally, the possible redundant features are removed from the important feature subset. Because the features are selected sequentially, it is not necessary to search through the large feature subset space, thus the efficiency can be improved. Experimental results show that the set of important features for clustering can be found and those unimportant features or features that may hinder the clustering task will be discarded by this method.
基金the National Natural Science Foundation of China (60234030)the Natural Science Foundationof He’nan Educational Committee of China (2007520019, 2008B520015)Doctoral Foundation of Henan Polytechnic Universityof China (B050901, B2008-61)
文摘Feature extraction of range images provided by ranging sensor is a key issue of pattern recognition. To automatically extract the environmental feature sensed by a 2D ranging sensor laser scanner, an improved method based on genetic clustering VGA-clustering is presented. By integrating the spatial neighbouring information of range data into fuzzy clustering algorithm, a weighted fuzzy clustering algorithm (WFCA) instead of standard clustering algorithm is introduced to realize feature extraction of laser scanner. Aimed at the unknown clustering number in advance, several validation index functions are used to estimate the validity of different clustering algorithms and one validation index is selected as the fitness function of genetic algorithm so as to determine the accurate clustering number automatically. At the same time, an improved genetic algorithm IVGA on the basis of VGA is proposed to solve the local optimum of clustering algorithm, which is implemented by increasing the population diversity and improving the genetic operators of elitist rule to enhance the local search capacity and to quicken the convergence speed. By the comparison with other algorithms, the effectiveness of the algorithm introduced is demonstrated.
基金Supported by the National Natural Science Foundation of China (60503020, 60373066)the Outstanding Young Scientist’s Fund (60425206)+1 种基金the Natural Science Foundation of Jiangsu Province (BK2005060)the Opening Foundation of Jiangsu Key Laboratory of Computer Informa-tion Processing Technology in Soochow University
文摘Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金National Natural Science Foundation of China(Grant No.61573233)Guangdong Provincial Natural Science Foundation of China(Grant No.2018A0303130188)+1 种基金Guangdong Provincial Science and Technology Special Funds Project of China(Grant No.190805145540361)Special Projects in Key Fields of Colleges and Universities in Guangdong Province of China(Grant No.2020ZDZX2005).
文摘There may be several internal defects in railway track work that have different shapes and distribution rules,and these defects affect the safety of high-speed trains.Establishing reliable detection models and methods for these internal defects remains a challenging task.To address this challenge,in this study,an intelligent detection method based on a generalization feature cluster is proposed for internal defects of railway tracks.First,the defects are classified and counted according to their shape and location features.Then,generalized features of the internal defects are extracted and formulated based on the maximum difference between different types of defects and the maximum tolerance among same defects’types.Finally,the extracted generalized features are expressed by function constraints,and formulated as generalization feature clusters to classify and identify internal defects in the railway track.Furthermore,to improve the detection reliability and speed,a reduced-dimension method of the generalization feature clusters is presented in this paper.Based on this reduced-dimension feature and strongly constrained generalized features,the K-means clustering algorithm is developed for defect clustering,and good clustering results are achieved.Regarding the defects in the rail head region,the clustering accuracy is over 95%,and the Davies-Bouldin index(DBI)index is negligible,which indicates the validation of the proposed generalization features with strong constraints.Experimental results prove that the accuracy of the proposed method based on generalization feature clusters is up to 97.55%,and the average detection time is 0.12 s/frame,which indicates that it performs well in adaptability,high accuracy,and detection speed under complex working environments.The proposed algorithm can effectively detect internal defects in railway tracks using an established generalization feature cluster model.
文摘Nearly half of coal mine disasters in China have been found to occur in clusters or to be accompanied by earthquakes nearby,in which all the disaster types are involved.Stress disturbances seem to exist among mining areas and to be responsible for the observed clustering.The earthquakes accompanied by coal mine disasters may be the vital geophysical evidence for tectonic stress disturbances around mining areas.This paper analyzes all the possible causative factors to demonstrate the authenticity and reliability of the observed phenomena.A quantitative study was performed on the degree of clustering,and space-time distribution curves are obtained.Under the threshold of 100 km,47%of disasters are involved in cluster series and 372 coal mine disasters accompanied by earthquakes.The majority cluster series lasting for 1-2 days correspond well earthquakes nearby,which are speculated to be related to local stress disturbance.While the minority lasting longer than 4 days correspond well with fatal earthquakes,which are speculated to be related to regional stress disturbance.The cluster series possess multiple properties,such as the area,the distance,the related disasters,etc.,and compared with the energy and the magnitude of earthquakes,good correspondences are acquired.It indicates that the cluster series of coal mine disasters and earthquakes are linked with fatal earthquakes and may serve as footprints of regional stress disturbance.Speculations relating to the geological model are made,and five disaster-causing models are examined.To earthquake research and disaster prevention,widely scientific significance is suggested.
基金supported by the National Natural Science Foundation of China (60774096)the National HighTech R&D Program of China (2008BAK49B05)
文摘Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.
文摘Due to the widespread use of the Internet,customer information is vulnerable to computer systems attack,which brings urgent need for the intrusion detection technology.Recently,network intrusion detection has been one of the most important technologies in network security detection.The accuracy of network intrusion detection has reached higher accuracy so far.However,these methods have very low efficiency in network intrusion detection,even the most popular SOM neural network method.In this paper,an efficient and fast network intrusion detection method was proposed.Firstly,the fundamental of the two different methods are introduced respectively.Then,the selforganizing feature map neural network based on K-means clustering(KSOM)algorithms was presented to improve the efficiency of network intrusion detection.Finally,the NSLKDD is used as network intrusion data set to demonstrate that the KSOM method can significantly reduce the number of clustering iteration than SOM method without substantially affecting the clustering results and the accuracy is much higher than Kmeans method.The Experimental results show that our method can relatively improve the accuracy of network intrusion and significantly reduce the number of clustering iteration.
文摘Feature selection is very important to obtain meaningful and interpretive clustering results from a clustering analysis. In the application of soil data clustering, there is a lack of good understanding of the response of clustering performance to different features subsets. In the present paper, we analyzed the performance differences between k-means, fuzzy c-means, and spectral clustering algorithms in the conditions of different feature subsets of soil data sets. The experimental results demonstrated that the performances of spectral clustering algorithm were generally better than those of k-means and fuzzy c-means with different features subsets. The feature subsets containing environmental attributes helped to improve clustering performances better than those having spatial attributes and produced more accurate and meaningful clustering results. Our results demonstrated that combination of spectral clustering algorithm with the feature subsets containing environmental attributes rather than spatial attributes may be a better choice in applications of soil data clustering.
基金This work has been supported by.Central University Research Fund(No.2016MS116,No.2016MS117,No.2018MS074)the National Natural Science Foundation(51677072).
文摘Effective storage,processing and analyzing of power device condition monitoring data faces enormous challenges.A framework is proposed that can support both MapReduce and Graph for massive monitoring data analysis at the same time based on Aliyun DTplus platform.First,power device condition monitoring data storage based on MaxCompute table and parallel permutation entropy feature extraction based on MaxCompute MapReduce are designed and implemented on DTplus platform.Then,Graph based k-means algorithm is implemented and used for massive condition monitoring data clustering analysis.Finally,performance tests are performed to compare the execution time between serial program and parallel program.Performance is analyzed from CPU cores consumption,memory utilization and parallel granularity.Experimental results show that the designed framework and parallel algorithms can efficiently process massive power device condition monitoring data.
基金This work was supported by the National Science Foundation of China(62176055)the China University S&T Innovation Plan Guided by the Ministry of Education.
文摘Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.
文摘Cluster analysis in spectroscopy presents some unique challenges due to the specific data characteristics in spectroscopy,namely,high dimensionality and small sample size.In order to improve cluster analysis outcomes,feature selection can be used to remove redundant or irrelevant features and reduce the dimensionality.However,for cluster analysis,this must be done in an unsupervised manner without the benefit of data labels.This paper presents a novel feature selection approach for cluster analysis,utilizing clusterability metrics to remove features that least contribute to a dataset’s tendency to cluster.Two versions are presented and evaluated:The Hopkins clusterability filter which utilizes the Hopkins test for spatial randomness and the Dip clusterability filter which utilizes the Dip test for unimodality.These new techniques,along with a range of existing filter and wrapper feature selection techniques were evaluated on eleven real-world spectroscopy datasets using internal and external clustering indices.Our newly proposed Hopkins clusterability filter performed the best of the six filter techniques evaluated.However,it was observed that results varied greatly for different techniques depending on the specifics of the dataset and the number of features selected,with significant instability observed for most techniques at low numbers of features.It was identified that the genetic algorithm wrapper technique avoided this instability,performed consistently across all datasets and resulted in better results on average than utilizing the all the features in the spectra.
基金National Natural Science Foundation of China (No.59875006).
文摘The following questions are discussed: feature cluster, feature clusterconcept and the reasoning formula. The defect based on approach direction and feed direction areanalyzed. Feature tool axis direction concept and its definition method are submitted. The featurefor practical part is also clustered by tool axis direction.
基金Project supported by the Excellent Young Teachers’ Foundation of Xinjiang Normal University (Grant No XJNU0730)the Prior Developing Subject’ Foundation of Xinjiang Normal University
文摘The structures and properties of Wn (n = 2-14) clusters were studied by using the density functional theory (DFT) at LSDA level. The most stable structures of Wn (n = 2-14) clusters with global minimum were determined. The average binding energy (Eb), the first and second difference of total energy (△E, △2E), the vertical detachment energy (VDE), and the HOMO-LUMO gap versus the size were also discussed. The abrupt decrease of VDE and HOMO-LUMO gap at size n = 8 and 10 implied that tungsten clusters of W8 and W10 appeared to have metallic features. These changes were also accompanied by the delocalization of electron charge density and the strong hybridization between 5d and 6s orbits in W8 and W10 clusters. Our results are in good agreement with the available experimental data.
基金supported by National Natural Science Foundation of China(Nos.61305103 and 61473103)Natural Science Foundation Heilongjiang province(No.QC2014C072)+1 种基金Postdoctoral Science Foundation of Heilongjiang(No.LBH-Z14108)SelfPlanned Task of State Key Laboratory of Robotics and System(HIT)(No.SKLRS201609B)
文摘In order to solve the problem of indoor place recognition for indoor service robot, a novel algorithm, clustering of features and images (CFI), is proposed in this work. Different from traditional indoor place recognition methods which are based on kernels or bag of features, with large margin classifier, CFI proposed in this work is based on feature matching, image similarity and clustering of features and images. It establishes independent local feature clusters by feature cloud registration to represent each room, and defines image distance to describe the similarity between images or feature clusters, which determines the label of query images. Besides, it improves recognition speed by image scaling, with state inertia and hidden Markov model constraining the transition of the state to kill unreasonable wrong recognitions and achieves remarkable precision and speed. A series of experiments are conducted to test the algorithm based on standard databases, and it achieves recognition rate up to 97% and speed is over 30 fps, which is much superior to traditional methods. Its impressive precision and speed demonstrate the great discriminative power in the face of complicated environment.
基金financially supported by the National Natural Science Foundation of China(No.21973088)Shenzhen Science and Technology Program(Nos.RCYX20210706092101012 and ZDSYS20210623100800001)。
文摘This work reports the structural feature and internal motion of one novel hyperbranching cluster system in dilution solution.The cluster system is composed of HB-PS_(300)-g-Pt BA_(45) hypergraft copolymer chains with uniform subchain,high molar mass and low polydispersity(M_(w)=1.73×106 g/mol and<M_(w)/M_(n)>≈1.07),where HB-PS and Pt BA represent hyperbranched polystyrene core and poly(tert-butyl polyacrylate)graft,respectively.In the selective solvent of PS blocks(cyclohexane,T_(θ)=34.5℃),the aggregation kinetics and structural feature are found to be precisely tunable for assembled clusters by the aggregation temperature(11℃<T<17℃)and time(0 h<t<24 h).An interesting structural evolution kinetics is observed,namely,the fractal dimension(d_(f))of clusters is found to first increases and then decreases with t,eventually,it reaches a plateau value of d_(f)≈3.0,corresponds to a uniform spherical structure.By using dynamic light scattering(DLS)to monitor the number and strength of relaxation modes inΓ(q)withΓbeing the decay rate and q being the scattering vector,it is quantitatively revealed that the relaxation,intensity contribution and mode origin of internal motions of clusters are neither similar with previously reported cluster systems with high polydispersity,nor with the classical linear chain systems.In particular,in the broad range of 2.0<qR_(h)<6.0,we have observed that the reduced first cumulant[Γ^(*)=Γ(q)/(q^(3)k_(B)T/η_(0))]does not display an asymptotic behavior.Whereas,a better asymptotic behavior is observed by plottingΓ(q)/q^(4) versus qRh.For the first time,our observation provides direct evidence supporting that,for hyperbranching cluster system with low polydispersity and high local chain segment density,the hydrodynamic interaction is greatly weakened due to the enhanced hydrodynamic shielding effect.
基金National Natural Science Foundation of China,No.41930102,No.41971333。
文摘Stream morphology is an important indicator for revealing the geomorphological features and evolution of the Yangtze River.Existing studies on the morphology of the Yangtze River focus on planar features.However,the vertical features are also important.Vertical features mainly control the flow ability and erosion intensity.Furthermore,traditional studies often focus on a few stream profiles in the Yangtze River.However,stream profiles are linked together by runoff nodes,thus affecting the geomorphological evolution of the Yangtze River naturally.In this study,a clustering method of stream profiles in the Yangtze River is proposed by plotting all profiles together.Then,a stream evolution index is used to investigate the geomorphological features of the stream profile clusters to reveal the evolution of the Yangtze River.Based on the stream profile clusters,the erosion base of the Yangtze River generally changes from steep to gentle from the upper reaches to the lower reaches,and the evolution degree of the stream changes from low to high.The asymmetric distribution of knickpoints in the Hanshui River Basin supports the view that the boundary of the eastward growth of the Tibetan Plateau has reached the vicinity of the Daba Mountains.
文摘Epigenetics is the study of phenotypic variations that do not alter DNA sequences.Cancer epigenetics has grown rapidly over the past few years as epigenetic alterations exist in all human cancers.One of these alterations is DNA methylation;an epigenetic process that regulates gene expression and often occurs at tumor suppressor gene loci in cancer.Therefore,studying this methylation process may shed light on different gene functions that cannot otherwise be interpreted using the changes that occur in DNA sequences.Currently,microarray technologies;such as Illumina Infinium BeadChip assays;are used to study DNA methylation at an extremely large number of varying loci.At each DNA methylation site,a beta value(β)is used to reflect the methylation intensity.Therefore,clustering this data from various types of cancers may lead to the discovery of large partitions that can help objectively classify different types of cancers aswell as identify the relevant loci without user bias.This study proposed a Nested Big Data Clustering Genetic Algorithm(NBDC-GA);a novel evolutionary metaheuristic technique that can perform cluster-based feature selection based on the DNA methylation sites.The efficacy of the NBDC-GA was tested using real-world data sets retrieved from The Cancer Genome Atlas(TCGA);a cancer genomics program created by the NationalCancer Institute(NCI)and the NationalHuman Genome Research Institute.The performance of the NBDC-GA was then compared with that of a recently developed metaheuristic Immuno-Genetic Algorithm(IGA)that was tested using the same data sets.The NBDC-GA outperformed the IGA in terms of convergence performance.Furthermore,the NBDC-GA produced a more robust clustering configuration while simultaneously decreasing the dimensionality of features to a maximumof 67%and of 94.5%for individual cancer type and collective cancer,respectively.The proposed NBDC-GA was also able to identify two chromosomes with highly contrastingDNAmethylations activities that were previously linked to cancer.
基金This research was supported in part by the National Natural Science Foundation of China under grant Nos.61602202 and 61603146the Natural Science Foundation of Jiangsu Province under contracts BK20160428 and BK20160427+1 种基金the Six talent peaks project in Jiangsu Province under contract XYDXX-034the project in Jiangsu Association for science and technology.
文摘Air quality prediction is an important part of environmental governance.The accuracy of the air quality prediction also affects the planning of people’s outdoor activities.How to mine effective information from historical data of air pollution and reduce unimportant factors to predict the law of pollution change is of great significance for pollution prevention,pollution control and pollution early warning.In this paper,we take into account that there are different trends in air pollutants and that different climatic factors have different effects on air pollutants.Firstly,the data of air pollutants in different cities are collected by a sliding window technology,and the data of different cities in the sliding window are clustered by Kohonen method to find the same tends in air pollutants.On this basis,combined with the weather data,we use the ReliefF method to extract the characteristics of climate factors that helpful for prediction.Finally,different types of air pollutants and corresponding extracted the characteristics of climate factors are used to train different sub models.The experimental results of different algorithms with different air pollutants show that this method not only improves the accuracy of air quality prediction,but also improves the operation efficiency.