Purpose–This study aims to enhance the accuracy of key entity extraction from railway accident report texts and address challenges such as complex domain-specific semantics,data sparsity and strong inter-sentence sem...Purpose–This study aims to enhance the accuracy of key entity extraction from railway accident report texts and address challenges such as complex domain-specific semantics,data sparsity and strong inter-sentence semantic dependencies.A robust entity extraction method tailored for accident texts is proposed.Design/methodology/approach–This method is implemented through a dual-branch multi-task mutual learning model named R-MLP,which jointly performs entity recognition and accident phase classification.The model leverages a shared BERT encoder to extract contextual features and incorporates a sentence span indexing module to align feature granularity.A cross-task mutual learning mechanism is also introduced to strengthen semantic representation.Findings–R-MLP effectively mitigates the impact of semantic complexity and data sparsity in domain entities and enhances the model’s ability to capture inter-sentence semantic dependencies.Experimental results show that R-MLP achieves a maximum F1-score of 0.736 in extracting six types of key railway accident entities,significantly outperforming baseline models such as RoBERTa and MacBERT.Originality/value–This demonstrates the proposed method’s superior generalization and accuracy in domainspecific entity extraction tasks,confirming its effectiveness and practical value.展开更多
Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or d...Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.展开更多
近年来,大语言模型在多种下游任务上表现出了较强的理解和生成能力,但如何有效利用大语言模型来理解和分析漏洞仍是一个挑战。为此,提出了一种融合关键信息的上下文学习方法(key information in-context learning,KIICL)进行漏洞类型分...近年来,大语言模型在多种下游任务上表现出了较强的理解和生成能力,但如何有效利用大语言模型来理解和分析漏洞仍是一个挑战。为此,提出了一种融合关键信息的上下文学习方法(key information in-context learning,KIICL)进行漏洞类型分类。通过提供上下文示例和漏洞关键信息突出漏洞描述中的细节,以强化大语言模型对漏洞描述的理解,进而提高分类能力。为获得关键信息,文章采用了基于条件随机场(CRF)的关键信息识别方法。实验结果表明,KIICL方法在大语言模型上比无示例样本方法提升了6.6%,比不包含关键信息的少量示例方法提升了2.2%,验证了KIICL方法的有效性。展开更多
传统的档案信息提取方法主要依赖人工操作,这不仅耗时费力,还易出现错误,影响数据的准确性和可靠性。随着自然语言处理(Natural Language Processing,NLP)技术的迅速发展,医院档案信息提取的效率得到了显著提升。文章探讨了如何应用NLP...传统的档案信息提取方法主要依赖人工操作,这不仅耗时费力,还易出现错误,影响数据的准确性和可靠性。随着自然语言处理(Natural Language Processing,NLP)技术的迅速发展,医院档案信息提取的效率得到了显著提升。文章探讨了如何应用NLP技术来提高医院档案信息提取的效率,重点介绍了文本分类、命名实体识别和关系抽取等关键技术。其中,文本分类可以自动对档案进行分类,有效组织信息;命名实体识别用于识别和提取关键信息,如患者姓名、疾病名称和药物等;关系抽取则可以揭示不同信息间的关系,帮助建立完整的信息网络。展开更多
Named Entity Recognition aims to identify and to classify rigid designators in text such as proper names, biological species, and temporal expressions into some predefined categories. There has been growing interest i...Named Entity Recognition aims to identify and to classify rigid designators in text such as proper names, biological species, and temporal expressions into some predefined categories. There has been growing interest in this field of research since the early 1990s. Named Entity Recognition has a vital role in different fields of natural language processing such as Machine Translation, Information Extraction, Question Answering System and various other fields. In this paper, Named Entity Recognition for Nepali text, based on the Support Vector Machine (SVM) is presented which is one of machine learning approaches for the classification task. A set of features are extracted from training data set. Accuracy and efficiency of SVM classifier are analyzed in three different sizes of training data set. Recognition systems are tested with ten datasets for Nepali text. The strength of this work is the efficient feature extraction and the comprehensive recognition techniques. The Support Vector Machine based Named Entity Recognition is limited to use a certain set of features and it uses a small dictionary which affects its performance. The learning performance of recognition system is observed. It is found that system can learn well from the small set of training data and increase the rate of learning on the increment of training size.展开更多
针对大多数跨度模型将文本分割成跨度序列时,产生大量非实体跨度,导致了数据不平衡和计算复杂度高等问题,提出了基于跨度和边界探测的实体关系联合抽取模型(joint extraction model for entity relationships based on span and boundar...针对大多数跨度模型将文本分割成跨度序列时,产生大量非实体跨度,导致了数据不平衡和计算复杂度高等问题,提出了基于跨度和边界探测的实体关系联合抽取模型(joint extraction model for entity relationships based on span and boundary detection,SBDM)。SBDM首先使用训练Transformer的双向编码器表征量(bidirectional encoder representations from Transformer,BERT)模型将文本转化为词向量,并融合了通过图卷积获取的句法依赖信息以形成文本的特征表示;接着通过局部信息和句子上下文信息去探测实体边界并进行标记,以减少非实体跨度;然后将实体边界标记形成的跨度序列进行实体识别;最后将局部上下文信息融合到1个跨度实体对中并使用sigmoid函数进行关系分类。实验表明,SBDM在SciERC(multi-task identification of entities,relations,and coreference for scientific knowledge graph construction)数据集、CoNLL04(the 2004 conference on natural language learning)数据集上的关系分类指标S F1分别达到52.86%、74.47%,取得了较好效果。SBDM用于关系分类任务中,能促进跨度分类方法在关系抽取上的研究。展开更多
基金funded by the Technology Research and Development Plan Program of China State Railway Group Co.,Ltd.(No.Q2024T001)the Foundation of China Academy of Railway Sciences Co.,Ltd.(No:2024YJ259).
文摘Purpose–This study aims to enhance the accuracy of key entity extraction from railway accident report texts and address challenges such as complex domain-specific semantics,data sparsity and strong inter-sentence semantic dependencies.A robust entity extraction method tailored for accident texts is proposed.Design/methodology/approach–This method is implemented through a dual-branch multi-task mutual learning model named R-MLP,which jointly performs entity recognition and accident phase classification.The model leverages a shared BERT encoder to extract contextual features and incorporates a sentence span indexing module to align feature granularity.A cross-task mutual learning mechanism is also introduced to strengthen semantic representation.Findings–R-MLP effectively mitigates the impact of semantic complexity and data sparsity in domain entities and enhances the model’s ability to capture inter-sentence semantic dependencies.Experimental results show that R-MLP achieves a maximum F1-score of 0.736 in extracting six types of key railway accident entities,significantly outperforming baseline models such as RoBERTa and MacBERT.Originality/value–This demonstrates the proposed method’s superior generalization and accuracy in domainspecific entity extraction tasks,confirming its effectiveness and practical value.
基金supported by Yunnan Provincial Major Science and Technology Special Plan Projects(Grant Nos.202202AD080003,202202AE090008,202202AD080004,202302AD080003)National Natural Science Foundation of China(Grant Nos.U21B2027,62266027,62266028,62266025)Yunnan Province Young and Middle-Aged Academic and Technical Leaders Reserve Talent Program(Grant No.202305AC160063).
文摘Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.
文摘近年来,大语言模型在多种下游任务上表现出了较强的理解和生成能力,但如何有效利用大语言模型来理解和分析漏洞仍是一个挑战。为此,提出了一种融合关键信息的上下文学习方法(key information in-context learning,KIICL)进行漏洞类型分类。通过提供上下文示例和漏洞关键信息突出漏洞描述中的细节,以强化大语言模型对漏洞描述的理解,进而提高分类能力。为获得关键信息,文章采用了基于条件随机场(CRF)的关键信息识别方法。实验结果表明,KIICL方法在大语言模型上比无示例样本方法提升了6.6%,比不包含关键信息的少量示例方法提升了2.2%,验证了KIICL方法的有效性。
文摘传统的档案信息提取方法主要依赖人工操作,这不仅耗时费力,还易出现错误,影响数据的准确性和可靠性。随着自然语言处理(Natural Language Processing,NLP)技术的迅速发展,医院档案信息提取的效率得到了显著提升。文章探讨了如何应用NLP技术来提高医院档案信息提取的效率,重点介绍了文本分类、命名实体识别和关系抽取等关键技术。其中,文本分类可以自动对档案进行分类,有效组织信息;命名实体识别用于识别和提取关键信息,如患者姓名、疾病名称和药物等;关系抽取则可以揭示不同信息间的关系,帮助建立完整的信息网络。
文摘Named Entity Recognition aims to identify and to classify rigid designators in text such as proper names, biological species, and temporal expressions into some predefined categories. There has been growing interest in this field of research since the early 1990s. Named Entity Recognition has a vital role in different fields of natural language processing such as Machine Translation, Information Extraction, Question Answering System and various other fields. In this paper, Named Entity Recognition for Nepali text, based on the Support Vector Machine (SVM) is presented which is one of machine learning approaches for the classification task. A set of features are extracted from training data set. Accuracy and efficiency of SVM classifier are analyzed in three different sizes of training data set. Recognition systems are tested with ten datasets for Nepali text. The strength of this work is the efficient feature extraction and the comprehensive recognition techniques. The Support Vector Machine based Named Entity Recognition is limited to use a certain set of features and it uses a small dictionary which affects its performance. The learning performance of recognition system is observed. It is found that system can learn well from the small set of training data and increase the rate of learning on the increment of training size.
文摘针对大多数跨度模型将文本分割成跨度序列时,产生大量非实体跨度,导致了数据不平衡和计算复杂度高等问题,提出了基于跨度和边界探测的实体关系联合抽取模型(joint extraction model for entity relationships based on span and boundary detection,SBDM)。SBDM首先使用训练Transformer的双向编码器表征量(bidirectional encoder representations from Transformer,BERT)模型将文本转化为词向量,并融合了通过图卷积获取的句法依赖信息以形成文本的特征表示;接着通过局部信息和句子上下文信息去探测实体边界并进行标记,以减少非实体跨度;然后将实体边界标记形成的跨度序列进行实体识别;最后将局部上下文信息融合到1个跨度实体对中并使用sigmoid函数进行关系分类。实验表明,SBDM在SciERC(multi-task identification of entities,relations,and coreference for scientific knowledge graph construction)数据集、CoNLL04(the 2004 conference on natural language learning)数据集上的关系分类指标S F1分别达到52.86%、74.47%,取得了较好效果。SBDM用于关系分类任务中,能促进跨度分类方法在关系抽取上的研究。