Clustering is one of the recently challenging tasks since there is an ever.growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great d...Clustering is one of the recently challenging tasks since there is an ever.growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great demand to deal with large volume of data. The computational requirements for bringing such growing amount data to a central site for clustering are complex. The proposed algorithm uses optimal centroids for K.Means clustering based on Particle Swarm Optimization(PSO).PSO is used to take advantage of its global search ability to provide optimal centroids which aids in generating more compact clusters with improved accuracy. This proposed methodology utilizes Hadoop and Map Reduce framework which provides distributed storage and analysis to support data intensive distributed applications. Experiments were performed on Reuter's and RCV1 document dataset which shows an improvement in accuracy with reduced execution time.展开更多
With the wider growth of web-based documents,the necessity of automatic document clustering and text summarization is increased.Here,document summarization that is extracting the essential task with appropriate inform...With the wider growth of web-based documents,the necessity of automatic document clustering and text summarization is increased.Here,document summarization that is extracting the essential task with appropriate information,removal of unnecessary data and providing the data in a cohesive and coherent manner is determined to be a most confronting task.In this research,a novel intelligent model for document clustering is designed with graph model and Fuzzy based association rule generation(gFAR).Initially,the graph model is used to map the relationship among the data(multi-source)followed by the establishment of document clustering with the generation of association rule using the fuzzy concept.This method shows benefit in redundancy elimination by mapping the relevant document using graph model and reduces the time consumption and improves the accuracy using the association rule generation with fuzzy.This framework is provided in an interpretable way for document clustering.It iteratively reduces the error rate during relationship mapping among the data(clusters)with the assistance of weighted document content.Also,this model represents the significance of data features with class discrimination.It is also helpful in measuring the significance of the features during the data clustering process.The simulation is done with MATLAB 2016b environment and evaluated with the empirical standards like Relative Risk Patterns(RRP),ROUGE score,and Discrimination Information Measure(DMI)respectively.Here,DailyMail and DUC 2004 dataset is used to extract the empirical results.The proposed gFAR model gives better trade-off while compared with various prevailing approaches.展开更多
In recent years,the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web.As a result,the use of techniques for extracting useful information from lar...In recent years,the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web.As a result,the use of techniques for extracting useful information from large collections of data,and particularly documents,has become more necessary and challenging.Text clustering is such a technique;it consists in dividing a set of text documents into clusters(groups),so that documents within the same cluster are closely related,whereas documents in different clusters are as different as possible.Clustering depends on measuring the content(i.e.,words)of a document in terms of relevance.Nevertheless,as documents usually contain a large number of words,some of them may be irrelevant to the topic under consideration or redundant.This can confuse and complicate the clustering process and make it less accurate.Accordingly,feature selection methods have been employed to reduce data dimensionality by selecting the most relevant features.In this study,we developed a text document clustering optimization model using a novel genetic frog-leaping algorithm that efficiently clusters text documents based on selected features.The proposed approach is based on two metaheuristic algorithms:a genetic algorithm(GA)and a shuffled frog-leaping algorithm(SFLA).The GA performs feature selection,and the SFLA performs clustering.To evaluate its effectiveness,the proposed approach was tested on a well-known text document dataset:the“20Newsgroup”dataset from the University of California Irvine Machine Learning Repository.Overall,after multiple experiments were compared and analyzed,it was demonstrated that using the proposed algorithm on the 20Newsgroup dataset greatly facilitated text document clustering,compared with classical K-means clustering.Nevertheless,this improvement requires longer computational time.展开更多
To discover personalized document structure with the consideration of user preferences,user preferences were captured by limited amount of instance level constraints and given as interested and uninterested key terms....To discover personalized document structure with the consideration of user preferences,user preferences were captured by limited amount of instance level constraints and given as interested and uninterested key terms.Develop a semi-supervised document clustering approach based on the latent Dirichlet allocation(LDA)model,namely,pLDA,guided by the user provided key terms.Propose a generalized Polya urn(GPU) model to integrate the user preferences to the document clustering process.A Gibbs sampler was investigated to infer the document collection structure.Experiments on real datasets were taken to explore the performance of pLDA.The results demonstrate that the pLDA approach is effective.展开更多
This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential c...This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.展开更多
The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ qu...The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries;however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.展开更多
This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for docu...This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.展开更多
As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This spa...As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods., which indicates that tags could be a better resource for the clustering task.展开更多
Purpose:Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields.This also helps in having a better collab...Purpose:Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields.This also helps in having a better collaboration with governments and businesses.This study aims to investigate the development of research fields over time,translating it into a topic detection problem.Design/methodology/approach:To achieve the objectives,we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents.Document embedding approaches are utilized to transform documents into vector-based representations.The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms(i.e.LDA)against a benchmark dataset.A case study is also conducted exploring the evolution of Artificial Intelligence(AI)detecting the research topics or sub-fields in related AI publications.Findings:Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset.Using the proposed method,we also show how the topics have evolved in the period of the recent 30 years,taking advantage of a keyword extraction method for cluster tagging and labeling,demonstrating the context of the topics.Research limitations:We noticed that it is not possible to generalize one solution for all downstream tasks.Hence,it is required to fine-tune or optimize the solutions for each task and even datasets.In addition,interpretation of cluster labels can be subjective and vary based on the readers’opinions.It is also very difficult to evaluate the labeling techniques,rendering the explanation of the clusters further limited.Practical implications:As demonstrated in the case study,we show that in a real-world example,how the proposed method would enable the researchers and reviewers of the academic research to detect,summarize,analyze,and visualize research topics from decades of academic documents.This helps the scientific community and all related organizations in fast and effective analysis of the fields,by establishing and explaining the topics.Originality/value:In this study,we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction.We also use a concept extraction method as a labeling approach in this study.The effectiveness of the method has been evaluated in a case study of AI publications,where we analyze the AI topics during the past three decades.展开更多
Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge.Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web them...Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge.Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information,this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents,and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy.展开更多
Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures...Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures search engine’s efficiency. In this paper, we propose a technique for detecting the similarity in the structure of XML documents;in the following, we would cluster this document with Delaunay Triangulation method. The technique is based on the idea of representing the structure of an XML document as a time series in which each occurrence of a tag corresponds to a given impulse. So we could use Discrete Fourier Transform as a simple method to analyze these signals in frequency domain and make similarity matrices through a kind of distance measurement, in order to group them into clusters. We exploited Delaunay Triangulation as a clustering method to cluster the d-dimension points of XML documents. The results show a significant efficiency and accuracy in front of common methods.展开更多
Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid...Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability.展开更多
Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
背景:色素沉着绒毛结节性滑膜炎在病因、临床表现、诊断与治疗等研究领域依然存在较大争议,对色素沉着绒毛结节性滑膜炎进行文献计量学及可视化研究可以理清该领域研究发展脉络,为未来的研究指明方向。目的:分析色素沉着绒毛结节性滑膜...背景:色素沉着绒毛结节性滑膜炎在病因、临床表现、诊断与治疗等研究领域依然存在较大争议,对色素沉着绒毛结节性滑膜炎进行文献计量学及可视化研究可以理清该领域研究发展脉络,为未来的研究指明方向。目的:分析色素沉着绒毛结节性滑膜炎全球研究现状、热点及趋势。方法:选用Web of Science及中国知网数据库检索1995-2023年所有与色素沉着绒毛结节性滑膜炎相关的文献出版物,用Citespace和Bibliometrics软件对所有文献进行聚类、共现和突现词分析。Web of Science核心合集数据库采用主题词加自由词进行检索,中国知网数据库通过主题词进行检索。最终纳入986篇英文文献和599篇中文文献。结果与结论:①美国在该领域研究占有绝对领导地位,共发文数量、H指数和被引次数均排名第一。中国总发文量排名第4,H指数排名第12,发文质量与国际合作方面仍有待加强。②中国知网数据库相关研究聚类分析显示排名前5的聚类为放射治疗、软组织肉瘤、类风湿性关节炎、磁共振影像和诊断。③持续突现至2023年的关键词有集落刺激因子1、腱鞘巨细胞瘤、个案报道、染色体易位、放射治疗、表达和激酶。④依据关键词及共被引文献分析发现,关于色素沉着绒毛结节性滑膜炎临床特征的研究、集落刺激因子1抑制剂新药开发,以及集落刺激因子1抑制剂在治疗过程中的应用是目前的研究热点。⑤结合主题演化及现有研究热点分析,未来在明确该病病因、发病机制及临床特征的基础上,提高色素沉着绒毛结节性滑膜炎诊断准确率、增强治疗的精准性、降低治疗后复发率将是需要重点关注的问题。展开更多
文摘Clustering is one of the recently challenging tasks since there is an ever.growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great demand to deal with large volume of data. The computational requirements for bringing such growing amount data to a central site for clustering are complex. The proposed algorithm uses optimal centroids for K.Means clustering based on Particle Swarm Optimization(PSO).PSO is used to take advantage of its global search ability to provide optimal centroids which aids in generating more compact clusters with improved accuracy. This proposed methodology utilizes Hadoop and Map Reduce framework which provides distributed storage and analysis to support data intensive distributed applications. Experiments were performed on Reuter's and RCV1 document dataset which shows an improvement in accuracy with reduced execution time.
文摘With the wider growth of web-based documents,the necessity of automatic document clustering and text summarization is increased.Here,document summarization that is extracting the essential task with appropriate information,removal of unnecessary data and providing the data in a cohesive and coherent manner is determined to be a most confronting task.In this research,a novel intelligent model for document clustering is designed with graph model and Fuzzy based association rule generation(gFAR).Initially,the graph model is used to map the relationship among the data(multi-source)followed by the establishment of document clustering with the generation of association rule using the fuzzy concept.This method shows benefit in redundancy elimination by mapping the relevant document using graph model and reduces the time consumption and improves the accuracy using the association rule generation with fuzzy.This framework is provided in an interpretable way for document clustering.It iteratively reduces the error rate during relationship mapping among the data(clusters)with the assistance of weighted document content.Also,this model represents the significance of data features with class discrimination.It is also helpful in measuring the significance of the features during the data clustering process.The simulation is done with MATLAB 2016b environment and evaluated with the empirical standards like Relative Risk Patterns(RRP),ROUGE score,and Discrimination Information Measure(DMI)respectively.Here,DailyMail and DUC 2004 dataset is used to extract the empirical results.The proposed gFAR model gives better trade-off while compared with various prevailing approaches.
基金This research was supported by a grant from the Research Center of the Center for Female Scientific and Medical Colleges Deanship of Scientific Research,King Saud University.
文摘In recent years,the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web.As a result,the use of techniques for extracting useful information from large collections of data,and particularly documents,has become more necessary and challenging.Text clustering is such a technique;it consists in dividing a set of text documents into clusters(groups),so that documents within the same cluster are closely related,whereas documents in different clusters are as different as possible.Clustering depends on measuring the content(i.e.,words)of a document in terms of relevance.Nevertheless,as documents usually contain a large number of words,some of them may be irrelevant to the topic under consideration or redundant.This can confuse and complicate the clustering process and make it less accurate.Accordingly,feature selection methods have been employed to reduce data dimensionality by selecting the most relevant features.In this study,we developed a text document clustering optimization model using a novel genetic frog-leaping algorithm that efficiently clusters text documents based on selected features.The proposed approach is based on two metaheuristic algorithms:a genetic algorithm(GA)and a shuffled frog-leaping algorithm(SFLA).The GA performs feature selection,and the SFLA performs clustering.To evaluate its effectiveness,the proposed approach was tested on a well-known text document dataset:the“20Newsgroup”dataset from the University of California Irvine Machine Learning Repository.Overall,after multiple experiments were compared and analyzed,it was demonstrated that using the proposed algorithm on the 20Newsgroup dataset greatly facilitated text document clustering,compared with classical K-means clustering.Nevertheless,this improvement requires longer computational time.
基金National Natural Science Foundations of China(Nos.61262006,61462011,61202089)the Major Applied Basic Research Program of Guizhou Province Project,China(No.JZ20142001)+2 种基金the Science and Technology Foundation of Guizhou Province Project,China(No.LH20147636)the National Research Foundation for the Doctoral Program of Higher Education of China(No.20125201120006)the Graduate Innovated Foundations of Guizhou University Project,China(No.2015012)
文摘To discover personalized document structure with the consideration of user preferences,user preferences were captured by limited amount of instance level constraints and given as interested and uninterested key terms.Develop a semi-supervised document clustering approach based on the latent Dirichlet allocation(LDA)model,namely,pLDA,guided by the user provided key terms.Propose a generalized Polya urn(GPU) model to integrate the user preferences to the document clustering process.A Gibbs sampler was investigated to infer the document collection structure.Experiments on real datasets were taken to explore the performance of pLDA.The results demonstrate that the pLDA approach is effective.
基金Science and Technology Development Project of Tianjin(No. 06FZRJGX02400)National Natural Science Foundation of China (No.60603027)
文摘This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.
文摘The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries;however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.
文摘This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.
基金supported by the National Natural Science Foundation of China under Grant No. 61070111
文摘As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods., which indicates that tags could be a better resource for the clustering task.
文摘Purpose:Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields.This also helps in having a better collaboration with governments and businesses.This study aims to investigate the development of research fields over time,translating it into a topic detection problem.Design/methodology/approach:To achieve the objectives,we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents.Document embedding approaches are utilized to transform documents into vector-based representations.The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms(i.e.LDA)against a benchmark dataset.A case study is also conducted exploring the evolution of Artificial Intelligence(AI)detecting the research topics or sub-fields in related AI publications.Findings:Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset.Using the proposed method,we also show how the topics have evolved in the period of the recent 30 years,taking advantage of a keyword extraction method for cluster tagging and labeling,demonstrating the context of the topics.Research limitations:We noticed that it is not possible to generalize one solution for all downstream tasks.Hence,it is required to fine-tune or optimize the solutions for each task and even datasets.In addition,interpretation of cluster labels can be subjective and vary based on the readers’opinions.It is also very difficult to evaluate the labeling techniques,rendering the explanation of the clusters further limited.Practical implications:As demonstrated in the case study,we show that in a real-world example,how the proposed method would enable the researchers and reviewers of the academic research to detect,summarize,analyze,and visualize research topics from decades of academic documents.This helps the scientific community and all related organizations in fast and effective analysis of the fields,by establishing and explaining the topics.Originality/value:In this study,we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction.We also use a concept extraction method as a labeling approach in this study.The effectiveness of the method has been evaluated in a case study of AI publications,where we analyze the AI topics during the past three decades.
基金Suppurted by the Vatonnd“863”Prograan of Chia(2002AAI1010,2003AA0010321)
文摘Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge.Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information,this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents,and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy.
文摘Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures search engine’s efficiency. In this paper, we propose a technique for detecting the similarity in the structure of XML documents;in the following, we would cluster this document with Delaunay Triangulation method. The technique is based on the idea of representing the structure of an XML document as a time series in which each occurrence of a tag corresponds to a given impulse. So we could use Discrete Fourier Transform as a simple method to analyze these signals in frequency domain and make similarity matrices through a kind of distance measurement, in order to group them into clusters. We exploited Delaunay Triangulation as a clustering method to cluster the d-dimension points of XML documents. The results show a significant efficiency and accuracy in front of common methods.
基金supported by the Project of Natural Science Foundation Research Project of Shaanxi Province of China (2015JM6318)the Humanities and Social Sciences Research Youth Fund Project of Ministry of Education of China (13YJCZH251)
文摘Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability.
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
文摘背景:色素沉着绒毛结节性滑膜炎在病因、临床表现、诊断与治疗等研究领域依然存在较大争议,对色素沉着绒毛结节性滑膜炎进行文献计量学及可视化研究可以理清该领域研究发展脉络,为未来的研究指明方向。目的:分析色素沉着绒毛结节性滑膜炎全球研究现状、热点及趋势。方法:选用Web of Science及中国知网数据库检索1995-2023年所有与色素沉着绒毛结节性滑膜炎相关的文献出版物,用Citespace和Bibliometrics软件对所有文献进行聚类、共现和突现词分析。Web of Science核心合集数据库采用主题词加自由词进行检索,中国知网数据库通过主题词进行检索。最终纳入986篇英文文献和599篇中文文献。结果与结论:①美国在该领域研究占有绝对领导地位,共发文数量、H指数和被引次数均排名第一。中国总发文量排名第4,H指数排名第12,发文质量与国际合作方面仍有待加强。②中国知网数据库相关研究聚类分析显示排名前5的聚类为放射治疗、软组织肉瘤、类风湿性关节炎、磁共振影像和诊断。③持续突现至2023年的关键词有集落刺激因子1、腱鞘巨细胞瘤、个案报道、染色体易位、放射治疗、表达和激酶。④依据关键词及共被引文献分析发现,关于色素沉着绒毛结节性滑膜炎临床特征的研究、集落刺激因子1抑制剂新药开发,以及集落刺激因子1抑制剂在治疗过程中的应用是目前的研究热点。⑤结合主题演化及现有研究热点分析,未来在明确该病病因、发病机制及临床特征的基础上,提高色素沉着绒毛结节性滑膜炎诊断准确率、增强治疗的精准性、降低治疗后复发率将是需要重点关注的问题。