This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential c...This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.展开更多
Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The...Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.展开更多
To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorith...To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorithm is that the crossover and mutation of operator are constructed according to its own characteristics of information retrieval. Immune operator is adopted to avoid degeneracy. Relevant documents retrieved are merged to a single document list according to rank formula. Experimental results show that the novel immune algorithm can lead to substantial improvements of relevant document retrieval effectiveness.展开更多
为提高应用文编写效率,提出一种融合大语言模型(large language model,LLM)与向量知识库(vector knowledge base)的应用文自动生成框架.根据目标应用场景,以人工编写的标准应用文为范本,构建结构化辅助生成文件,并建立相应类型应用文的...为提高应用文编写效率,提出一种融合大语言模型(large language model,LLM)与向量知识库(vector knowledge base)的应用文自动生成框架.根据目标应用场景,以人工编写的标准应用文为范本,构建结构化辅助生成文件,并建立相应类型应用文的向量知识库.利用目标类型应用文的章节标题和用户输入的关键信息在知识库中进行检索,匹配相关文段;设置提示词引导LLM,以召回的参考文段及用户输入的提示信息为参考,使用末级标题作为分割标志,分章节生成应用文文本;最终按规定格式整合全文并输出完整的目标应用文.以应急预案为例,在同一评价标准下使用ChatGPT-4Turbo进行评测,自动生成的应急预案高度趋近于人工编写的质量,二者的文档质量相似度达95.87%.所提方法能够在算力资源有限的情况下突破字数限制,生成符合基本标准的长篇幅应用文,可供人工参考或直接使用,极大提高了编写人员的工作效率.展开更多
基金Science and Technology Development Project of Tianjin(No. 06FZRJGX02400)National Natural Science Foundation of China (No.60603027)
文摘This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.
文摘Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.
基金TheNationalHigh TechDevelopment 863ProgramofChina (No .2 0 0 3AA1Z2 610 )
文摘To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorithm is that the crossover and mutation of operator are constructed according to its own characteristics of information retrieval. Immune operator is adopted to avoid degeneracy. Relevant documents retrieved are merged to a single document list according to rank formula. Experimental results show that the novel immune algorithm can lead to substantial improvements of relevant document retrieval effectiveness.
文摘为提高应用文编写效率,提出一种融合大语言模型(large language model,LLM)与向量知识库(vector knowledge base)的应用文自动生成框架.根据目标应用场景,以人工编写的标准应用文为范本,构建结构化辅助生成文件,并建立相应类型应用文的向量知识库.利用目标类型应用文的章节标题和用户输入的关键信息在知识库中进行检索,匹配相关文段;设置提示词引导LLM,以召回的参考文段及用户输入的提示信息为参考,使用末级标题作为分割标志,分章节生成应用文文本;最终按规定格式整合全文并输出完整的目标应用文.以应急预案为例,在同一评价标准下使用ChatGPT-4Turbo进行评测,自动生成的应急预案高度趋近于人工编写的质量,二者的文档质量相似度达95.87%.所提方法能够在算力资源有限的情况下突破字数限制,生成符合基本标准的长篇幅应用文,可供人工参考或直接使用,极大提高了编写人员的工作效率.