期刊文献+
共找到1,273篇文章
< 1 2 64 >
每页显示 20 50 100
Text Extraction and Enhancement of Binary Images Using Cellular Automata
1
作者 G. Sahoo Tapas Kumar +1 位作者 B. L. Raina C. M. Bhatia 《International Journal of Automation and computing》 EI 2009年第3期254-260,共7页
Text characters embedded in images represent a rich source of information for content-based indexing and retrieval applications. However, these text characters are difficult to be detected and recognized due to their ... Text characters embedded in images represent a rich source of information for content-based indexing and retrieval applications. However, these text characters are difficult to be detected and recognized due to their various sizes, grayscale values, and complex backgrounds. Existing methods cannot handle well those texts with different contrast or embedded in a complex image background. In this paper, a set of sequential algorithms for text extraction and enhancement of image using cellular automata are proposed. The image enhancement includes gray level, contrast manipulation, edge detection, and filtering. First, it applies edge detection and uses a threshold to filter out for low-contrast text and simplify complex background of high-contrast text from binary image. The proposed algorithm is simple and easy to use and requires only a sample texture binary image as an input. It generates textures with perceived quality, better than those proposed by earlier published techniques. The performance of our method is demonstrated by presenting experimental results for a set of text based binary images. The quality of thresholding is assessed using the precision and recall analysis of the resultant text in the binary image. 展开更多
关键词 text extraction edge detection cellular automata algorithm text detection thresholding.
在线阅读 下载PDF
An Efficient HW/SW Design for Text Extraction from Complex Color Image
2
作者 Mohamed Amin Ben Atitallah Rostom Kachouri +1 位作者 Ahmed Ben Atitallah Hassene Mnif 《Computers, Materials & Continua》 SCIE EI 2022年第6期5963-5977,共15页
In the context of constructing an embedded system to help visually impaired people to interpret text,in this paper,an efficient High-level synthesis(HLS)Hardware/Software(HW/SW)design for text extraction using the Gam... In the context of constructing an embedded system to help visually impaired people to interpret text,in this paper,an efficient High-level synthesis(HLS)Hardware/Software(HW/SW)design for text extraction using the Gamma Correction Method(GCM)is proposed.Indeed,the GCM is a common method used to extract text from a complex color image and video.The purpose of this work is to study the complexity of the GCM method on Xilinx ZCU102 FPGA board and to propose a HW implementation as Intellectual Property(IP)block of the critical blocks in this method using HLS flow with taking account the quality of the text extraction.This IP is integrated and connected to the ARM Cortex-A53 as coprocessor in HW/SW codesign context.The experimental results show that theHLS HW/SW implementation of the GCM method on ZCU102 FPGA board allows a reduction in processing time by about 89%compared to the SW implementation.This result is given for the same potency and strength of SW implementation for the text extraction. 展开更多
关键词 text extraction GCM HW/SW codesign FPGA HLS flow
在线阅读 下载PDF
Text extraction method for historical Tibetan document images based on block projections 被引量:3
3
作者 段立娟 张西群 +1 位作者 马龙龙 吴健 《Optoelectronics Letters》 EI 2017年第6期457-461,共5页
Text extraction is an important initial step in digitizing the historical documents. In this paper, we present a text extraction method for historical Tibetan document images based on block projections. The task of te... Text extraction is an important initial step in digitizing the historical documents. In this paper, we present a text extraction method for historical Tibetan document images based on block projections. The task of text extraction is considered as text area detection and location problem. The images are divided equally into blocks and the blocks are filtered by the information of the categories of connected components and corner point density. By analyzing the filtered blocks' projections, the approximate text areas can be located, and the text regions are extracted. Experiments on the dataset of historical Tibetan documents demonstrate the effectiveness of the proposed method. 展开更多
关键词 HISTORICAL TIBETAN document filtered BLOCKS bounding CORNER APPROXIMATE projection COORDINATE
原文传递
Efficient Text Extraction Algorithm Using Color Clustering for Language Translation in Mobile Phone 被引量:2
4
作者 Adrián Canedo-Rodríguez Jung Hyoun Kim +5 位作者 Soo-Hyung Kim John Kelly Jung Hee Kim Sun Yi Sai Kiran Veeramachaneni Yolanda Blanco-Fernández 《Journal of Signal and Information Processing》 2012年第2期228-237,共10页
Many Text Extraction methodologies have been proposed, but none of them are suitable to be part of a real system implemented on a device with low computational resources, either because their accuracy is insufficient,... Many Text Extraction methodologies have been proposed, but none of them are suitable to be part of a real system implemented on a device with low computational resources, either because their accuracy is insufficient, or because their performance is too slow. In this sense, we propose a Text Extraction algorithm for the context of language translation of scene text images with mobile phones, which is fast and accurate at the same time. The algorithm uses very efficient computations to calculate the Principal Color Components of a previously quantized image, and decides which ones are the main foreground-background colors, after which it extracts the text in the image. We have compared our algorithm with other algorithms using commercial OCR, achieving accuracy rates more than 12% higher, and performing two times faster. Also, our methodology is more robust against common degradations, such as uneven illumination, or blurring. Thus, we developed a very attractive system to accurately separate foreground and background from scene text images, working over low computational resources devices. 展开更多
关键词 text extraction COLOR QUANTIZATION text BINARIZATION LANGUAGE TRANSLATION
在线阅读 下载PDF
Drug and Vaccine Extractive Text Summarization Insights Using Fine-Tuned Transformers
5
作者 Rajesh Bandaru Y.Radhika 《Journal of Artificial Intelligence and Technology》 2024年第4期351-362,共12页
Text representation is a key aspect in determining the success of various text summarizing techniques.Summarization using pretrained transformer models has produced encouraging results.Yet the scope of applying these ... Text representation is a key aspect in determining the success of various text summarizing techniques.Summarization using pretrained transformer models has produced encouraging results.Yet the scope of applying these models in medical and drug discovery is not examined to a proper extent.To address this issue,this article aims to perform extractive summarization based on fine-tuned transformers pertaining to drug and medical domain.This research also aims to enhance sentence representation.Exploring the extractive text summarization aspects of medical and drug discovery is a challenging task as the datasets are limited.Hence,this research concentrates on the collection of abstracts collected from PubMed for various domains of medical and drug discovery such as drug and COVID,with a total capacity of 1,370 abstracts.A detailed experimentation using BART(Bidirectional Autoregressive Transformer),T5(Text-to-Text Transfer Transformer),LexRank,and TexRank for the analysis of the dataset is carried out in this research to perform extractive text summarization. 展开更多
关键词 BART BERT extractive text summarization LexRank TexRank
暂未订购
A Hybrid Query-Based Extractive Text Summarization Based on K-Means and Latent Dirichlet Allocation Techniques
6
作者 Sohail Muhammad Muzammil Khan Sarwar Shah Khan 《Journal on Artificial Intelligence》 2024年第1期193-209,共17页
Retrieving information from evolving digital data collection using a user’s query is always essential and needs efficient retrieval mechanisms that help reduce the required time from such massive collections.Large-sc... Retrieving information from evolving digital data collection using a user’s query is always essential and needs efficient retrieval mechanisms that help reduce the required time from such massive collections.Large-scale time consumption is certain to scan and analyze to retrieve the most relevant textual data item from all the documents required a sophisticated technique for a query against the document collection.It is always challenging to retrieve a more accurate and fast retrieval from a large collection.Text summarization is a dominant research field in information retrieval and text processing to locate the most appropriate data object as single or multiple documents from the collection.Machine learning and knowledge-based techniques are the two query-based extractive text summarization techniques in Natural Language Processing(NLP)which can be used for precise retrieval and are considered to be the best option.NLP uses machine learning approaches for both supervised and unsupervised learning for calculating probabilistic features.The study aims to propose a hybrid approach for query-based extractive text summarization in the research study.Text-Rank Algorithm is used as a core algorithm for the flow of an implementation of the approach to gain the required goals.Query-based text summarization of multiple documents using a hybrid approach,combining the K-Means clustering technique with Latent Dirichlet Allocation(LDA)as topic modeling technique produces 0.288,0.631,and 0.328 for precision,recall,and F-score,respectively.The results show that the proposed hybrid approach performs better than the graph-based independent approach and the sentences and word frequency-based approach. 展开更多
关键词 extractive text summarization machine learning natural language processing K-MEANS latent dirichlet allocation
在线阅读 下载PDF
基于语义特征和TextRank算法的科研成果论文中文文本关键词提取方法
7
作者 张世超 王建宾 孟浩 《华南地震》 2025年第3期188-194,共7页
为准确提取科研成果论文中文文本关键词,并准确排列,研究基于语义特征和TextRank算法的科研成果论文中文文本关键词提取方法。基于语义特征的科研成果论文中文文本候选关键词筛选方法,在Word2Vec工具中,将中文文本转换为词向量,作为论... 为准确提取科研成果论文中文文本关键词,并准确排列,研究基于语义特征和TextRank算法的科研成果论文中文文本关键词提取方法。基于语义特征的科研成果论文中文文本候选关键词筛选方法,在Word2Vec工具中,将中文文本转换为词向量,作为论文中文文本语义特征;将语义特征输入卷积神经网络中,以分类的方式,提取属于候选关键词类型的语义特征,将其所属文本词语作为候选关键词;通过基于TextRank算法的科研成果论文中文文本关键词提取方法,在候选关键词中,以候选关键词的平均信息熵、词性、位置三种特征,为关键词提取指标,构建提取关键词的图模型,运算候选关键词综合权重,以从大到小的方式排列候选关键词,将排名靠前的候选关键词,作为最终提取的关键词,完成科研成果论文中文文本关键词提取。经测试,此方法可提高科研成果论文中文文本关键词提取精度、提高关键词排名准确性。 展开更多
关键词 语义特征 textRank算法 科研成果论文 中文文本 关键词提取 卷积神经网络
在线阅读 下载PDF
基于改进TextRank的科技文本关键词抽取方法 被引量:6
8
作者 杨冬菊 胡成富 《计算机应用》 CSCD 北大核心 2024年第6期1720-1726,共7页
针对科技文本关键词抽取任务中抽取出现次数少但能较好表达文本主旨的词语效果差的问题,提出一种基于改进TextRank的关键词抽取方法。首先,利用词语的词频-逆文档频率(TF-IDF)统计特征和位置特征优化共现图中词语间的概率转移矩阵,通过... 针对科技文本关键词抽取任务中抽取出现次数少但能较好表达文本主旨的词语效果差的问题,提出一种基于改进TextRank的关键词抽取方法。首先,利用词语的词频-逆文档频率(TF-IDF)统计特征和位置特征优化共现图中词语间的概率转移矩阵,通过迭代计算得到词语的初始得分;然后,利用K-Core(K-Core decomposition)算法挖掘KCore子图得到词语的层级特征,利用平均信息熵特征衡量词语的主题表征能力;最后,在词语初始得分的基础上融合层级特征和平均信息熵特征,从而确定关键词。实验结果表明,在公开数据集上,与TextRank方法和OTextRank(Optimized TextRank)方法相比,所提方法在抽取不同关键词数量的实验中,F1均值分别提高了6.5和3.3个百分点;在科技服务项目数据集上,与TextRank方法和OTextRank方法相比,所提方法在抽取不同关键词数量的实验中,F1均值分别提高了7.4和3.2个百分点。实验结果验证了所提方法抽取出现频率低但较好表达文本主旨关键词的有效性。 展开更多
关键词 科技文本 关键词抽取 textRank K-Core图 平均信息熵
在线阅读 下载PDF
A New Method to Extract Text from Natural Scenes
9
作者 郝峻晟 戚飞虎 +1 位作者 朱凯华 蒋人杰 《Journal of Donghua University(English Edition)》 EI CAS 2005年第4期52-57,共6页
This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japan... This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japanese characters. Next bonnding boxes are processed by a new “Expand, Break and Merge” (EBM) method to get the precise text areas. Finally, text is binarized by a hybrid method based on Otsu and Niblack. This new approach can extract different kinds of text from complicated natural scenes. It is insensitive to noise, distortedness, and text orientation. It also has good performance on extracting texts in various sizes. 展开更多
关键词 text extraction mathematical morphology bounding boxes binarization
在线阅读 下载PDF
Smart Approaches to Efficient Text Mining for Categorizing Sexual Reproductive Health Short Messages into Key Themes
10
作者 Tobias Makai Mayumbo Nyirenda 《Open Journal of Applied Sciences》 2024年第2期511-532,共22页
To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved a... To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms. 展开更多
关键词 Knowledge Discovery in text (KDT) Sexual Reproductive Health (SRH) text Categorization text Classification text extraction text Mining Feature extraction Automated Classification Process Performance Stemming and Lemmatization Natural Language Processing (NLP)
在线阅读 下载PDF
A Hybrid Method of Extractive Text Summarization Based on Deep Learning and Graph Ranking Algorithms 被引量:1
11
作者 SHI Hui WANG Tiexin 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI CSCD 2022年第S01期158-165,共8页
In the era of Big Data,we are faced with an inevitable and challenging problem of“overload information”.To alleviate this problem,it is important to use effective automatic text summarization techniques to obtain th... In the era of Big Data,we are faced with an inevitable and challenging problem of“overload information”.To alleviate this problem,it is important to use effective automatic text summarization techniques to obtain the key information quickly and efficiently from the huge amount of text.In this paper,we propose a hybrid method of extractive text summarization based on deep learning and graph ranking algorithms(ETSDG).In this method,a pre-trained deep learning model is designed to yield useful sentence embeddings.Given the association between sentences in raw documents,a traditional LexRank algorithm with fine-tuning is adopted fin ETSDG.In order to improve the performance of the extractive text summarization method,we further integrate the traditional LexRank algorithm with deep learning.Testing results on the data set DUC2004 show that ETSDG has better performance in ROUGE metrics compared with certain benchmark methods. 展开更多
关键词 extractive text summarization deep learning sentence embeddings LexRank
在线阅读 下载PDF
A Method of Text Extremum Region Extraction Based on Joint-Channels 被引量:1
12
作者 Xueming Qiao Weiyi Zhu +4 位作者 Dongjie Zhu Liang Kong Yingxue Xia Chunxu Lin Zhenhao Guo Yiheng Sun 《Journal on Artificial Intelligence》 2020年第1期29-37,共9页
Natural scene recognition has important significance and value in the fields of image retrieval,autonomous navigation,human-computer interaction and industrial automation.Firstly,the natural scene image non-text conte... Natural scene recognition has important significance and value in the fields of image retrieval,autonomous navigation,human-computer interaction and industrial automation.Firstly,the natural scene image non-text content takes up relatively high proportion;secondly,the natural scene images have a cluttered background and complex lighting conditions,angle,font and color.Therefore,how to extract text extreme regions efficiently from complex and varied natural scene images plays an important role in natural scene image text recognition.In this paper,a Text extremum region Extraction algorithm based on Joint-Channels(TEJC)is proposed.On the one hand,it can solve the problem that the maximum stable extremum region(MSER)algorithm is only suitable for gray images and difficult to process color images.On the other hand,it solves the problem that the MSER algorithm has high complexity and low accuracy when extracting the most stable extreme region.In this paper,the proposed algorithm is tested and evaluated on the ICDAR data set.The experimental results show that the method has superiority. 展开更多
关键词 Feature extraction scene text detection scene text feature extraction extreme region
在线阅读 下载PDF
A Deep Look into Extractive Text Summarization
13
作者 Jhonathan Quillo-Espino Rosa María Romero-González Ana-Marcela Herrera-Navarro 《Journal of Computer and Communications》 2021年第6期24-37,共14页
This investigation has presented an approach to Extractive Automatic Text Summarization (EATS). A framework focused on the summary of a single document has been developed, using the Tf-ldf method (Frequency Term, Inve... This investigation has presented an approach to Extractive Automatic Text Summarization (EATS). A framework focused on the summary of a single document has been developed, using the Tf-ldf method (Frequency Term, Inverse Document Frequency) as a reference, dividing the document into a subset of documents and generating value of each of the words contained in each document, those documents that show Tf-Idf equal or higher than the threshold are those that represent greater importance, therefore;can be weighted and generate a text summary according to the user’s request. This document represents a derived model of text mining application in today’s world. We demonstrate the way of performing the summarization. Random values were used to check its performance. The experimented results show a satisfactory and understandable summary and summaries were found to be able to run efficiently and quickly, showing which are the most important text sentences according to the threshold selected by the user. 展开更多
关键词 text Mining Preprocesses text Summarization extractive text Sumarization
在线阅读 下载PDF
基于BiGRU TextCNN框架的漏洞自动分类技术研究
14
作者 张浩 何东昊 《信息安全研究》 CSCD 北大核心 2024年第5期446-452,共7页
通用缺陷枚举(CVE)信息可以用于记录已知漏洞并提供标准化的语义描述,利用CWE信息对漏洞进行分类,可以为漏洞挖掘提供更丰富的背景知识和更详细的预防措施.但由于人工分类的不确定性和漏洞本身信息参数的变化,在具体实践中漏洞分类的准... 通用缺陷枚举(CVE)信息可以用于记录已知漏洞并提供标准化的语义描述,利用CWE信息对漏洞进行分类,可以为漏洞挖掘提供更丰富的背景知识和更详细的预防措施.但由于人工分类的不确定性和漏洞本身信息参数的变化,在具体实践中漏洞分类的准确性亟待提高,此外大量且不断增加的新漏洞对人工分类的效率和准确性也提出了巨大挑战.为解决这一问题,提出了一个基于BiGRU TextCNN模型的漏洞分类方法,可用于对漏洞信息的处理、训练和预测,并根据漏洞自身所表征的描述信息自动进行分类.为验证所提方法的适用性和可行性,首先对不同分类模型进行对比分析,然后利用所提出的框架模型通过对漏洞所表征的描述信息进行预测分类,结果证明了所提方法的正确性. 展开更多
关键词 漏洞分类 文本分类 条件抽取 深度学习 安全告警
在线阅读 下载PDF
基于改进TextRank的抽取式自动文本摘要生成方法
15
作者 梁高鹏 徐鲁强 《计算机与数字工程》 2024年第12期3643-3648,共6页
模型TextRank在抽取式自动文摘方法中的表现相对较好,但其在初始文本质量和节点权重得分计算等环节,仍有较大的调整提升空间。针对此情况,提出了一种新的调整方法。结合自动文摘的实际应用环境与文本在文学方面的表达特点,通过在文本预... 模型TextRank在抽取式自动文摘方法中的表现相对较好,但其在初始文本质量和节点权重得分计算等环节,仍有较大的调整提升空间。针对此情况,提出了一种新的调整方法。结合自动文摘的实际应用环境与文本在文学方面的表达特点,通过在文本预处理阶段增加预排序流程来突出表现文本的主旨观点,减少语义重复内容,提升输入文本的质量。通过对相似度计算公式的调整,在最后的节点得分公式中将词频、与标题相似度、段间位置等因素按照特定的比例加入到权重系数中参与得分计算来优化整个计算流程。最终的实验结果表明,调整后的模型在各方面的得分情况要优于原模型,生成的摘要质量更高,更接近于人工生成的摘要。 展开更多
关键词 抽取式 自动文本摘要 textRank 预排序 权重得分
在线阅读 下载PDF
Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
16
作者 Bat-Erdene Nyandag Ru Li G. Indruska 《Journal of Computer and Communications》 2016年第10期79-89,共12页
This paper had developed and tested optimized content extraction algorithm using NLP method, TFIDF method for word of weight, VSM for information search, cosine method for similar quality calculation from learning doc... This paper had developed and tested optimized content extraction algorithm using NLP method, TFIDF method for word of weight, VSM for information search, cosine method for similar quality calculation from learning document at the distance learning system database. This test covered following things: 1) to parse word structure at the distance learning system database documents and Cyrillic Mongolian language documents at the section, to form new documents by algorithm for identifying word stem;2) to test optimized content extraction from text material based on e-test results (key word, correct answer, base form with affix and new form formed by word stem without affix) at distance learning system, also to search key word by automatically selecting using word extraction algorithm;3) to test Boolean and probabilistic retrieval method through extended vector space retrieval method. This chapter covers: to process document content extraction retrieval algorithm, to propose recommendations query through word stem, not depending on word position based on Cyrillic Mongolian language documents distinction. 展开更多
关键词 Cyrillic Mongolian Language Content extraction Formatting Learning text Materials Style
在线阅读 下载PDF
Mathematical Expression Extraction in Text Fields of Documents Based on HMM
17
作者 Xuedong Tian Ruihan Bai +2 位作者 Fang Yang Jinyuan Bai Xinfu Li 《Journal of Computer and Communications》 2017年第14期1-13,共13页
Aiming at the problem that the mathematical expressions in unstructured text fields of documents are hard to be extracted automatically, rapidly and effectively, a method based on Hidden Markov Model (HMM) is proposed... Aiming at the problem that the mathematical expressions in unstructured text fields of documents are hard to be extracted automatically, rapidly and effectively, a method based on Hidden Markov Model (HMM) is proposed. Firstly, this method trained the HMM model through employing the symbol combination features of mathematical expressions. Then, some preprocessing works such as removing labels and filtering words were carried out. Finally, the preprocessed text was converted into an observation sequence as the input of the HMM model to determine which is the mathematical expression and extracts it. The experimental results show that the proposed method can effectively extract the mathematical expressions from the text fields of documents, and also has the relatively high accuracy rate and recall rate. 展开更多
关键词 Mathematical Expression extractION Hidden MARKOV Model text FIELDS DOCUMENTS SYMBOL Combination Features
在线阅读 下载PDF
面向通航训练隐患的文本提取及特征分布研究
18
作者 何昕 孙文霞 +2 位作者 宫献鑫 王煜涵 虞启洲 《电子设计工程》 2025年第18期44-50,共7页
在通航训练业务不断增长的背景下,针对隐患数据利用不足的问题,通过分析通航训练隐患文本数据,提出不同目标特征下的无监督文本提取方法。建立自定义词典,引入TF-IDF值提取不同分类文档下的高频实词;构造文本图模型,融合词性及词位置特... 在通航训练业务不断增长的背景下,针对隐患数据利用不足的问题,通过分析通航训练隐患文本数据,提出不同目标特征下的无监督文本提取方法。建立自定义词典,引入TF-IDF值提取不同分类文档下的高频实词;构造文本图模型,融合词性及词位置特征,优化节点初始权重,改进Tex-tRank算法用于提取行为短语字段。实例验证结果表明,相关算法能有效提取目标特征,引入自定义词典的TF-IDF算法提取高频实词的精确率优于基于图模型的TextRank算法;融合词特征的TextRank算法提取关键行为短语的F1均值较传统的TextRank提高了0.407。将所提取的特征利用可视化图进行呈现,可为飞行训练的风险控制提供参考。 展开更多
关键词 通航训练 文本提取 TF-IDF值 textRank 特征分布
在线阅读 下载PDF
基于大语言模型和提示工程的中文医学文本实体关系抽取研究
19
作者 段宇锋 谢佳宏 《数据分析与知识发现》 北大核心 2025年第9期25-36,共12页
【目的】研究现有大语言模型抽取中文医学文本实体关系的性能差异,分析示例数量和关系类型数量对模型抽取效果的影响。【方法】基于提示工程方法,通过API调用9种主流大语言模型,从示例数量和关系类型数量两个角度修改提示模板,使用CMeIE... 【目的】研究现有大语言模型抽取中文医学文本实体关系的性能差异,分析示例数量和关系类型数量对模型抽取效果的影响。【方法】基于提示工程方法,通过API调用9种主流大语言模型,从示例数量和关系类型数量两个角度修改提示模板,使用CMeIE-V2数据集进行实验并比较抽取效果。【结果】(1)GLM-4-0520的综合抽取能力居于首位,在抽取“临床表现”“药物治疗”“病因”三种关系类型时F1值分别达到0.4422、0.3869、0.3874;(2)改变提示中的示例数量m,起初F1值随m的增加而上升,当m=8时达到最大值0.4742,m>8后F1值开始下降;(3)增加需要抽取的关系类型数量n后,F1值下降明显,n=2时F1值较n=1时下降0.1182,至n=10时F1值仅有0.2949。【局限】现有公开数据集较少,实验结果仅基于单个数据集得到;由于目前医学垂直领域的大语言模型难以通过API调用,本文使用的模型均来自通用领域。【结论】不同大模型的抽取效果差别较大;合适数量的示例能够提高模型抽取效果,但示例并非越多越好;大模型不擅长同时抽取多种关系类型。 展开更多
关键词 大语言模型 提示工程 实体关系抽取 中文医学文本
原文传递
微信会话文本关键词提取的算法研究
20
作者 王宝会 许卜仁 +1 位作者 李长傲 叶子豪 《计算机科学》 北大核心 2025年第S1期239-246,共8页
微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,... 微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Knowledge,SP_(TSPK)),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_(TSPK)算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F_(1)@5和F_(1)@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。 展开更多
关键词 文本聚类 文本生成 会话主题聚类 关键词提取
在线阅读 下载PDF
上一页 1 2 64 下一页 到第
使用帮助 返回顶部