Though K-means is very popular for general clustering, its performance, which generally converges to numerous local minima, depends highly on initial cluster centers. In this paper a novel initialization scheme to sel...Though K-means is very popular for general clustering, its performance, which generally converges to numerous local minima, depends highly on initial cluster centers. In this paper a novel initialization scheme to select initial cluster centers for K-means clustering is proposed. This algorithm is based on reverse nearest neighbor (RNN) search which retrieves all points in a given data set whose nearest neighbor is a given query point. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. The application of the proposed algorithm to K-means clustering algorithm is demonstrated. An experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of e...This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.展开更多
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method...Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
This paper proposes a method of data-flow testing for Web services composition.Firstly,to facilitate data flow analysis and constraints collecting,the existing model representation of business process execution langua...This paper proposes a method of data-flow testing for Web services composition.Firstly,to facilitate data flow analysis and constraints collecting,the existing model representation of business process execution language(BPEL)is modified in company with the analysis of data dependency and an exact representation of dead path elimination(DPE)is proposed,which over-comes the difficulties brought to dataflow analysis.Then defining and using information based on data flow rules is collected by parsing BPEL and Web services description language(WSDL)documents and the def-use annotated control flow graph is created.Based on this model,data-flow anomalies which indicate potential errors can be discovered by traversing the paths of graph,and all-du-paths used in dynamic data flow testing for Web services composition are automatically generated,then testers can design the test cases according to the collected constraints for each path selected.展开更多
A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,...A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,which improve the accuracy of clustering.Also,the equation to measure the in-dependence of a phrase is proposed in this paper.The new algorithm which improves suffix tree clustering algorithm(STC)is named as improved suffix tree clustering(ISTC).To validate the proposed algorithm,a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine.Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
基金Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086)the Natural Science Foundation of Jiangsu Province (BK2006094)+1 种基金the Opening Foundation of Jiangsu Key Labo-ratory of Computer Information Processing Technology in Soochow University ( KJS0714)the Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)
文摘Though K-means is very popular for general clustering, its performance, which generally converges to numerous local minima, depends highly on initial cluster centers. In this paper a novel initialization scheme to select initial cluster centers for K-means clustering is proposed. This algorithm is based on reverse nearest neighbor (RNN) search which retrieves all points in a given data set whose nearest neighbor is a given query point. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. The application of the proposed algorithm to K-means clustering algorithm is demonstrated. An experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金Supported by the National Natural Science Foun-dation of China (60373066 ,60503020) the Outstanding Young Sci-entist’s Fund(60425206) Doctor Foundatoin of Nanjing Universityof Posts and Telecommunications (2003-02)
文摘This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.
基金Supported by the National Natural Science Foundation of China (60503020, 60373066)the Outstanding Young Scientist’s Fund (60425206)+1 种基金the Natural Science Foundation of Jiangsu Province (BK2005060)the Opening Foundation of Jiangsu Key Laboratory of Computer Informa-tion Processing Technology in Soochow University
文摘Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金the National Natural Science Foundation of China(60425206,60503033)National Basic Research Program of China(973 Program,2002CB312000)Opening Foundation of State Key Laboratory of Software Engineering in Wuhan University
文摘This paper proposes a method of data-flow testing for Web services composition.Firstly,to facilitate data flow analysis and constraints collecting,the existing model representation of business process execution language(BPEL)is modified in company with the analysis of data dependency and an exact representation of dead path elimination(DPE)is proposed,which over-comes the difficulties brought to dataflow analysis.Then defining and using information based on data flow rules is collected by parsing BPEL and Web services description language(WSDL)documents and the def-use annotated control flow graph is created.Based on this model,data-flow anomalies which indicate potential errors can be discovered by traversing the paths of graph,and all-du-paths used in dynamic data flow testing for Web services composition are automatically generated,then testers can design the test cases according to the collected constraints for each path selected.
基金Supported by the National Natural Science Foundation of China(60503020,60503033,60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity(KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications(NY207052,NY207082)National Natural Science Foundation of Jiangsu(BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,which improve the accuracy of clustering.Also,the equation to measure the in-dependence of a phrase is proposed in this paper.The new algorithm which improves suffix tree clustering algorithm(STC)is named as improved suffix tree clustering(ISTC).To validate the proposed algorithm,a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine.Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.