Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The...Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.展开更多
To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorith...To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorithm is that the crossover and mutation of operator are constructed according to its own characteristics of information retrieval. Immune operator is adopted to avoid degeneracy. Relevant documents retrieved are merged to a single document list according to rank formula. Experimental results show that the novel immune algorithm can lead to substantial improvements of relevant document retrieval effectiveness.展开更多
为提高应用文编写效率,提出一种融合大语言模型(large language model,LLM)与向量知识库(vector knowledge base)的应用文自动生成框架.根据目标应用场景,以人工编写的标准应用文为范本,构建结构化辅助生成文件,并建立相应类型应用文的...为提高应用文编写效率,提出一种融合大语言模型(large language model,LLM)与向量知识库(vector knowledge base)的应用文自动生成框架.根据目标应用场景,以人工编写的标准应用文为范本,构建结构化辅助生成文件,并建立相应类型应用文的向量知识库.利用目标类型应用文的章节标题和用户输入的关键信息在知识库中进行检索,匹配相关文段;设置提示词引导LLM,以召回的参考文段及用户输入的提示信息为参考,使用末级标题作为分割标志,分章节生成应用文文本;最终按规定格式整合全文并输出完整的目标应用文.以应急预案为例,在同一评价标准下使用ChatGPT-4Turbo进行评测,自动生成的应急预案高度趋近于人工编写的质量,二者的文档质量相似度达95.87%.所提方法能够在算力资源有限的情况下突破字数限制,生成符合基本标准的长篇幅应用文,可供人工参考或直接使用,极大提高了编写人员的工作效率.展开更多
随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及...随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法,从大量的特征空间中选择出具有代表性的元数据特征向量元素,以降低向量空间的维度;构建领域概念树并设计基于领域概念树的文本相似度算法,对领域概念中广泛存在的同义词进行处理,以提高文本之间语义相似度度量的性能。实验结果表明:通过降维和概念相似度计算可提高文本相似度计算的性能。展开更多
文摘Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories.
基金TheNationalHigh TechDevelopment 863ProgramofChina (No .2 0 0 3AA1Z2 610 )
文摘To efficiently retrieve relevant document from the rapid proliferation of large information collections, a novel immune algorithm for document query optimization is proposed. The essential ideal of the immune algorithm is that the crossover and mutation of operator are constructed according to its own characteristics of information retrieval. Immune operator is adopted to avoid degeneracy. Relevant documents retrieved are merged to a single document list according to rank formula. Experimental results show that the novel immune algorithm can lead to substantial improvements of relevant document retrieval effectiveness.
文摘为提高应用文编写效率,提出一种融合大语言模型(large language model,LLM)与向量知识库(vector knowledge base)的应用文自动生成框架.根据目标应用场景,以人工编写的标准应用文为范本,构建结构化辅助生成文件,并建立相应类型应用文的向量知识库.利用目标类型应用文的章节标题和用户输入的关键信息在知识库中进行检索,匹配相关文段;设置提示词引导LLM,以召回的参考文段及用户输入的提示信息为参考,使用末级标题作为分割标志,分章节生成应用文文本;最终按规定格式整合全文并输出完整的目标应用文.以应急预案为例,在同一评价标准下使用ChatGPT-4Turbo进行评测,自动生成的应急预案高度趋近于人工编写的质量,二者的文档质量相似度达95.87%.所提方法能够在算力资源有限的情况下突破字数限制,生成符合基本标准的长篇幅应用文,可供人工参考或直接使用,极大提高了编写人员的工作效率.
文摘随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法,从大量的特征空间中选择出具有代表性的元数据特征向量元素,以降低向量空间的维度;构建领域概念树并设计基于领域概念树的文本相似度算法,对领域概念中广泛存在的同义词进行处理,以提高文本之间语义相似度度量的性能。实验结果表明:通过降维和概念相似度计算可提高文本相似度计算的性能。