Named Entity Recognition(NER)is vital in natural language processing for the analysis of news texts,as it accurately identifies entities such as locations,persons,and organizations,which is crucial for applications li...Named Entity Recognition(NER)is vital in natural language processing for the analysis of news texts,as it accurately identifies entities such as locations,persons,and organizations,which is crucial for applications like news summarization and event tracking.However,NER in the news domain faces challenges due to insufficient annotated data,complex entity structures,and strong context dependencies.To address these issues,we propose a new Chinesenamed entity recognition method that integrates transfer learning with word embeddings.Our approach leverages the ERNIE pre-trained model for transfer learning and obtaining general language representations and incorporates the Soft-lexicon word embedding technique to handle varied entity structures.This dual-strategy enhances the model’s understanding of context and boosts its ability to process complex texts.Experimental results show that our method achieves an F1 score of 94.72% on a news dataset,surpassing baseline methods by 3%–4%,thereby confirming its effectiveness for Chinese-named entity recognition in the news domain.展开更多
Named entity recognition(NER)in musk deer domain is the extraction of specific types of entities from unstructured texts,constituting a fundamental component of the knowledge graph,Q&A system,and text summarizatio...Named entity recognition(NER)in musk deer domain is the extraction of specific types of entities from unstructured texts,constituting a fundamental component of the knowledge graph,Q&A system,and text summarization system of musk deer domain.Due to limited annotated data,diverse entity types,and the ambiguity of Chinese word boundaries in musk deer domain NER,we present a novel NER model,CAELF-GP,which is based on cross-attention mechanism enhanced lexical features(CAELF).Specifically,we employ BERT as a character encoder and advocate the integration of external lexical information at the character representation layer.In the feature fusion module,instead of indiscriminately merging external dictionary information,we innovatively adopted a feature fusion method based on a cross-attention mechanism,which guides the model to focus on important lexical information by calculating the correlation between each character and its corresponding word sets.This module enhances the model’s semantic representation ability and entity boundary recognition capability.Ultimately,we introduce the decoding module of GlobalPointer(GP)for entity type recognition,capable of identifying both nested and non-nested entities.Since there is currently no publicly available dataset for the musk deer domain,we built a named entity recognition dataset for this domain by collecting relevant literature and working under the guidance of domain experts.The dataset facilitates the training and validation of the model and provides data foundation for subsequent related research.The model undergoes experimentation on two public datasets and the dataset of musk deer domain.The results show that it is superior to the baseline models,offering a promising technical avenue for the intelligent recognition of named entities in the musk deer domain.展开更多
Tibetan medical named entity recognition(Tibetan MNER)involves extracting specific types of medical entities from unstructured Tibetan medical texts.Tibetan MNER provide important data support for the work related to ...Tibetan medical named entity recognition(Tibetan MNER)involves extracting specific types of medical entities from unstructured Tibetan medical texts.Tibetan MNER provide important data support for the work related to Tibetan medicine.However,existing Tibetan MNER methods often struggle to comprehensively capture multi-level semantic information,failing to sufficiently extract multi-granularity features and effectively filter out irrelevant information,which ultimately impacts the accuracy of entity recognition.This paper proposes an improved embedding representation method called syllable-word-sentence embedding.By leveraging features at different granularities and using un-scaled dot-product attention to focus on key features for feature fusion,the syllable-word-sentence embedding is integrated into the transformer,enhancing the specificity and diversity of feature representations.The model leverages multi-level and multi-granularity semantic information,thereby improving the performance of Tibetan MNER.We evaluate our proposed model on datasets from various domains.The results indicate that the model effectively identified three types of entities in the Tibetan news dataset we constructed,achieving an F1 score of 93.59%,which represents an improvement of 1.24%compared to the vanilla FLAT.Additionally,results from the Tibetan medical dataset we developed show that it is effective in identifying five kinds of medical entities,with an F1 score of 71.39%,which is a 1.34%improvement over the vanilla FLAT.展开更多
Named data networking(NDNs)is an idealized deployment of information-centric networking(ICN)that has attracted attention from scientists and scholars worldwide.A distributed in-network caching scheme can efficiently r...Named data networking(NDNs)is an idealized deployment of information-centric networking(ICN)that has attracted attention from scientists and scholars worldwide.A distributed in-network caching scheme can efficiently realize load balancing.However,such a ubiquitous caching approach may cause problems including duplicate caching and low data diversity,thus reducing the caching efficiency of NDN routers.To mitigate these caching problems and improve the NDN caching efficiency,in this paper,a hierarchical-based sequential caching(HSC)scheme is proposed.In this scheme,the NDN routers in the data transmission path are divided into various levels and data with different request frequencies are cached in distinct router levels.The aim is to cache data with high request frequencies in the router that is closest to the content requester to increase the response probability of the nearby data,improve the data caching efficiency of named data networks,shorten the response time,and reduce cache redundancy.Simulation results show that this scheme can effectively improve the cache hit rate(CHR)and reduce the average request delay(ARD)and average route hop(ARH).展开更多
Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or ...Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or obtaining entity related external knowledge from knowledge bases or Large Language Models(LLMs).However,these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches.In this paper,we present MMAVK,a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion,which aims to leverage the Multi-modal Large Language Model(MLLM)as an implicit knowledge base.It also extracts vision-based auxiliary knowledge from the image formore accurate and effective recognition.Specifically,we propose vision-based auxiliary knowledge generation,which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts,thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs.Furthermore,we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformerbased encoder.Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets,even when the largemodels employed have significantly fewer parameters than other baselines.展开更多
Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and c...Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.展开更多
Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science ...Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science and technology,a large number of textual reports have accumulated in the field of geology.However,many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining,making it more challenging for some researchers to extract necessary information from these texts.Natural Language Processing(NLP)has obvious advantages in processing large amounts of textual data.The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques.We propose the RoBERTa-Prompt-Tuning-NER method,which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations.The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors.Finally,we conducted experiments on the constructed Geological Named Entity Recognition(GNER)dataset.Our experimental results show that the proposed model achieves the highest F1 score of 80.64%among the four baseline algorithms,demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.展开更多
Named entity recognition(NER)is an important part in knowledge extraction and one of the main tasks in constructing knowledge graphs.In today’s Chinese named entity recognition(CNER)task,the BERT-BiLSTM-CRF model is ...Named entity recognition(NER)is an important part in knowledge extraction and one of the main tasks in constructing knowledge graphs.In today’s Chinese named entity recognition(CNER)task,the BERT-BiLSTM-CRF model is widely used and often yields notable results.However,recognizing each entity with high accuracy remains challenging.Many entities do not appear as single words but as part of complex phrases,making it difficult to achieve accurate recognition using word embedding information alone because the intricate lexical structure often impacts the performance.To address this issue,we propose an improved Bidirectional Encoder Representations from Transformers(BERT)character word conditional random field(CRF)(BCWC)model.It incorporates a pre-trained word embedding model using the skip-gram with negative sampling(SGNS)method,alongside traditional BERT embeddings.By comparing datasets with different word segmentation tools,we obtain enhanced word embedding features for segmented data.These features are then processed using the multi-scale convolution and iterated dilated convolutional neural networks(IDCNNs)with varying expansion rates to capture features at multiple scales and extract diverse contextual information.Additionally,a multi-attention mechanism is employed to fuse word and character embeddings.Finally,CRFs are applied to learn sequence constraints and optimize entity label annotations.A series of experiments are conducted on three public datasets,demonstrating that the proposed method outperforms the recent advanced baselines.BCWC is capable to address the challenge of recognizing complex entities by combining character-level and word-level embedding information,thereby improving the accuracy of CNER.Such a model is potential to the applications of more precise knowledge extraction such as knowledge graph construction and information retrieval,particularly in domain-specific natural language processing tasks that require high entity recognition precision.展开更多
As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate unders...As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate understanding of geological reports guided by domain knowledge.While generic named entity recognition models/tools can be utilized for the processing of geoscience reports/documents,their effectiveness is hampered by a dearth of domain-specific knowledge,which in turn leads to a pronounced decline in recognition accuracy.This study summarizes six types of typical geological entities,with reference to the ontological system of geological domains and builds a high quality corpus for the task of geological named entity recognition(GNER).In addition,Geo Wo BERT-adv BGP(Geological Word-base BERTadversarial training Bi-directional Long Short-Term Memory Global Pointer)is proposed to address the issues of ambiguity,diversity and nested entities for the geological entities.The model first uses the fine-tuned word granularitybased pre-training model Geo Wo BERT(Geological Word-base BERT)and combines the text features that are extracted using the Bi LSTM(Bi-directional Long Short-Term Memory),followed by an adversarial training algorithm to improve the robustness of the model and enhance its resistance to interference,the decoding finally being performed using a global association pointer algorithm.The experimental results show that the proposed model for the constructed dataset achieves high performance and is capable of mining the rich geological information.展开更多
The few-shot named entity recognition(NER)task aims to train a robust model in the source domain and transfer it to the target domain with very few annotated data.Currently,some approaches rely on the prototypical net...The few-shot named entity recognition(NER)task aims to train a robust model in the source domain and transfer it to the target domain with very few annotated data.Currently,some approaches rely on the prototypical network for NER.However,these approaches often overlook the spatial relations in the span boundary matrix because entity words tend to depend more on adjacent words.We propose using a multidimensional convolution module to address this limitation to capture short-distance spatial dependencies.Additionally,we uti-lize an improved prototypical network and assign different weights to different samples that belong to the same class,thereby enhancing the performance of the few-shot NER task.Further experimental analysis demonstrates that our approach has significantly improved over baseline models across multiple datasets.展开更多
Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly prom...Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.展开更多
Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or d...Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.展开更多
Mathematical named entity recognition(MNER)is one of the fundamental tasks in the analysis of mathematical texts.To solve the existing problems of the current neural network that has local instability,fuzzy entity bou...Mathematical named entity recognition(MNER)is one of the fundamental tasks in the analysis of mathematical texts.To solve the existing problems of the current neural network that has local instability,fuzzy entity boundary,and long-distance dependence between entities in Chinese mathematical entity recognition task,we propose a series of optimization processing methods and constructed an Adversarial Training and Bidirectional long shortterm memory-Selfattention Conditional random field(AT-BSAC)model.In our model,the mathematical text was vectorized by the word embedding technique,and small perturbations were added to the word vector to generate adversarial samples,while local features were extracted by Bi-directional Long Short-Term Memory(BiLSTM).The self-attentive mechanism was incorporated to extract more dependent features between entities.The experimental results demonstrated that the AT-BSAC model achieved a precision(P)of 93.88%,a recall(R)of 93.84%,and an F1-score of 93.74%,respectively,which is 8.73%higher than the F1-score of the previous Bi-directional Long Short-Term Memory Conditional Random Field(BiLSTM-CRF)model.The effectiveness of the proposed model in mathematical named entity recognition.展开更多
Automatic extraction of key data from design specifications is an important means to assist in engineering design automation.Considering the characteristics of diverse data types,small scale,insufficient character inf...Automatic extraction of key data from design specifications is an important means to assist in engineering design automation.Considering the characteristics of diverse data types,small scale,insufficient character information content and strong contextual relevance of design specification,a named entity recognition model integrated with high-quality topic and attention mechanism,namely Quality Topic-Char Embedding-BiLSTMAttention-CRF,was proposed to automatically identify entities in design specification.Based on the topic model,an improved algorithm for high-quality topic extraction was proposed first,and then the high-quality topic information obtained was added into the distributed representation of Chinese characters to better enrich character features.Next,the attention mechanism was used in parallel on the basis of the BiLSTM-CRF model to fully mine the contextual semantic information.Finally,the experiment was performed on the collected corpus of Chinese ship design specification,and the model was compared with multiple sets of models.The results show that F-score(harmonic mean of precision and recall)of the model is 80.24%.The model performs better than other models in design specification,and is expected to provide an automatic means for engineering design.展开更多
Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained La...Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained Language Models(PLMs)offers new possibilities.PLMs excel at contextual learning,potentially simplifying many natural language processing tasks.However,their application to NER remains underexplored.This paper investigates leveraging the GPT-3 PLM for NER without fine-tuning.We propose a novel scheme that utilizes carefully crafted templates and context examples selected based on semantic similarity.Our experimental results demonstrate the feasibility of this approach,suggesting a promising direction for harnessing PLMs in NER.展开更多
Electronic Medical Records(EMR) with unstructured sentences and various conceptual expressions provide rich information for medical information extraction. However, common Named Entity Recognition(NER)in Natural Langu...Electronic Medical Records(EMR) with unstructured sentences and various conceptual expressions provide rich information for medical information extraction. However, common Named Entity Recognition(NER)in Natural Language Processing(NLP) are not well suitable for clinical NER in EMR. This study aims at applying neural networks to clinical concept extractions. We integrate Bidirectional Long Short-Term Memory Networks(Bi-LSTM) with a Conditional Random Fields(CRF) layer to detect three types of clinical named entities. Word representations fed into the neural networks are concatenated by character-based word embeddings and Continuous Bag of Words(CBOW) embeddings trained both on domain and non-domain corpus. We test our NER system on i2b2/VA open datasets and compare the performance with six related works, achieving the best result of NER with F1 value 0.853 7. We also point out a few specific problems in clinical concept extractions which will give some hints to deeper studies.展开更多
Named Entity Recognition(NER)is one of the fundamental tasks in Natural Language Processing(NLP),which aims to locate,extract,and classify named entities into a predefined category such as person,organization and loca...Named Entity Recognition(NER)is one of the fundamental tasks in Natural Language Processing(NLP),which aims to locate,extract,and classify named entities into a predefined category such as person,organization and location.Most of the earlier research for identifying named entities relied on using handcrafted features and very large knowledge resources,which is time consuming and not adequate for resource-scarce languages such as Arabic.Recently,deep learning achieved state-of-the-art performance on many NLP tasks including NER without requiring hand-crafted features.In addition,transfer learning has also proven its efficiency in several NLP tasks by exploiting pretrained language models that are used to transfer knowledge learned from large-scale datasets to domain-specific tasks.Bidirectional Encoder Representation from Transformer(BERT)is a contextual language model that generates the semantic vectors dynamically according to the context of the words.BERT architecture relay on multi-head attention that allows it to capture global dependencies between words.In this paper,we propose a deep learning-based model by fine-tuning BERT model to recognize and classify Arabic named entities.The pre-trained BERT context embeddings were used as input features to a Bidirectional Gated Recurrent Unit(BGRU)and were fine-tuned using two annotated Arabic Named Entity Recognition(ANER)datasets.Experimental results demonstrate that the proposed model outperformed state-of-the-art ANER models achieving 92.28%and 90.68%F-measure values on the ANERCorp dataset and the merged ANERCorp and AQMAR dataset,respectively.展开更多
Owing to the continuous barrage of cyber threats,there is a massive amount of cyber threat intelligence.However,a great deal of cyber threat intelligence come from textual sources.For analysis of cyber threat intellig...Owing to the continuous barrage of cyber threats,there is a massive amount of cyber threat intelligence.However,a great deal of cyber threat intelligence come from textual sources.For analysis of cyber threat intelligence,many security analysts rely on cumbersome and time-consuming manual efforts.Cybersecurity knowledge graph plays a significant role in automatics analysis of cyber threat intelligence.As the foundation for constructing cybersecurity knowledge graph,named entity recognition(NER)is required for identifying critical threat-related elements from textual cyber threat intelligence.Recently,deep neural network-based models have attained very good results in NER.However,the performance of these models relies heavily on the amount of labeled data.Since labeled data in cybersecurity is scarce,in this paper,we propose an adversarial active learning framework to effectively select the informative samples for further annotation.In addition,leveraging the long short-term memory(LSTM)network and the bidirectional LSTM(BiLSTM)network,we propose a novel NER model by introducing a dynamic attention mechanism into the BiLSTM-LSTM encoderdecoder.With the selected informative samples annotated,the proposed NER model is retrained.As a result,the performance of the NER model is incrementally enhanced with low labeling cost.Experimental results show the effectiveness of the proposed method.展开更多
Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system mak...Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail, and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better resuits.展开更多
基金funded by Advanced Research Project(30209040702).
文摘Named Entity Recognition(NER)is vital in natural language processing for the analysis of news texts,as it accurately identifies entities such as locations,persons,and organizations,which is crucial for applications like news summarization and event tracking.However,NER in the news domain faces challenges due to insufficient annotated data,complex entity structures,and strong context dependencies.To address these issues,we propose a new Chinesenamed entity recognition method that integrates transfer learning with word embeddings.Our approach leverages the ERNIE pre-trained model for transfer learning and obtaining general language representations and incorporates the Soft-lexicon word embedding technique to handle varied entity structures.This dual-strategy enhances the model’s understanding of context and boosts its ability to process complex texts.Experimental results show that our method achieves an F1 score of 94.72% on a news dataset,surpassing baseline methods by 3%–4%,thereby confirming its effectiveness for Chinese-named entity recognition in the news domain.
基金funded by 5·5 Engineering Research&Innovation Team Project of Beijing Forestry University(No.BLRC2023C02).
文摘Named entity recognition(NER)in musk deer domain is the extraction of specific types of entities from unstructured texts,constituting a fundamental component of the knowledge graph,Q&A system,and text summarization system of musk deer domain.Due to limited annotated data,diverse entity types,and the ambiguity of Chinese word boundaries in musk deer domain NER,we present a novel NER model,CAELF-GP,which is based on cross-attention mechanism enhanced lexical features(CAELF).Specifically,we employ BERT as a character encoder and advocate the integration of external lexical information at the character representation layer.In the feature fusion module,instead of indiscriminately merging external dictionary information,we innovatively adopted a feature fusion method based on a cross-attention mechanism,which guides the model to focus on important lexical information by calculating the correlation between each character and its corresponding word sets.This module enhances the model’s semantic representation ability and entity boundary recognition capability.Ultimately,we introduce the decoding module of GlobalPointer(GP)for entity type recognition,capable of identifying both nested and non-nested entities.Since there is currently no publicly available dataset for the musk deer domain,we built a named entity recognition dataset for this domain by collecting relevant literature and working under the guidance of domain experts.The dataset facilitates the training and validation of the model and provides data foundation for subsequent related research.The model undergoes experimentation on two public datasets and the dataset of musk deer domain.The results show that it is superior to the baseline models,offering a promising technical avenue for the intelligent recognition of named entities in the musk deer domain.
基金supported in part by the National Science and Technology Major Project under(Grant 2022ZD0116100)in part by the National Natural Science Foundation Key Project under(Grant 62436006)+4 种基金in part by the National Natural Science Foundation Youth Fund under(Grant 62406257)in part by the Xizang Autonomous Region Natural Science Foundation General Project under(Grant XZ202401ZR0031)in part by the National Natural Science Foundation of China under(Grant 62276055)in part by the Sichuan Science and Technology Program under(Grant 23ZDYF0755)in part by the Xizang University‘High-Level Talent Training Program’Project under(Grant 2022-GSP-S098).
文摘Tibetan medical named entity recognition(Tibetan MNER)involves extracting specific types of medical entities from unstructured Tibetan medical texts.Tibetan MNER provide important data support for the work related to Tibetan medicine.However,existing Tibetan MNER methods often struggle to comprehensively capture multi-level semantic information,failing to sufficiently extract multi-granularity features and effectively filter out irrelevant information,which ultimately impacts the accuracy of entity recognition.This paper proposes an improved embedding representation method called syllable-word-sentence embedding.By leveraging features at different granularities and using un-scaled dot-product attention to focus on key features for feature fusion,the syllable-word-sentence embedding is integrated into the transformer,enhancing the specificity and diversity of feature representations.The model leverages multi-level and multi-granularity semantic information,thereby improving the performance of Tibetan MNER.We evaluate our proposed model on datasets from various domains.The results indicate that the model effectively identified three types of entities in the Tibetan news dataset we constructed,achieving an F1 score of 93.59%,which represents an improvement of 1.24%compared to the vanilla FLAT.Additionally,results from the Tibetan medical dataset we developed show that it is effective in identifying five kinds of medical entities,with an F1 score of 71.39%,which is a 1.34%improvement over the vanilla FLAT.
基金supported in part by the National Natural Science Foundation of China under Grant 61972424 and 62372479in part by the High Value Intellectual Property Cultivation Project of Hubei Province,China,under grant D2021002094+1 种基金in part by JSPS KAKENHI under Grants JP16K00117 and JP19K20250in part by the Leading Initiative for Excellent Young Researchers(LEADER),MEXT,Japan,and KDDI Foundation.
文摘Named data networking(NDNs)is an idealized deployment of information-centric networking(ICN)that has attracted attention from scientists and scholars worldwide.A distributed in-network caching scheme can efficiently realize load balancing.However,such a ubiquitous caching approach may cause problems including duplicate caching and low data diversity,thus reducing the caching efficiency of NDN routers.To mitigate these caching problems and improve the NDN caching efficiency,in this paper,a hierarchical-based sequential caching(HSC)scheme is proposed.In this scheme,the NDN routers in the data transmission path are divided into various levels and data with different request frequencies are cached in distinct router levels.The aim is to cache data with high request frequencies in the router that is closest to the content requester to increase the response probability of the nearby data,improve the data caching efficiency of named data networks,shorten the response time,and reduce cache redundancy.Simulation results show that this scheme can effectively improve the cache hit rate(CHR)and reduce the average request delay(ARD)and average route hop(ARH).
基金funded by Research Project,grant number BHQ090003000X03.
文摘Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or obtaining entity related external knowledge from knowledge bases or Large Language Models(LLMs).However,these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches.In this paper,we present MMAVK,a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion,which aims to leverage the Multi-modal Large Language Model(MLLM)as an implicit knowledge base.It also extracts vision-based auxiliary knowledge from the image formore accurate and effective recognition.Specifically,we propose vision-based auxiliary knowledge generation,which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts,thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs.Furthermore,we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformerbased encoder.Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets,even when the largemodels employed have significantly fewer parameters than other baselines.
基金supported by the Outstanding Youth Team Project of Central Universities(QNTD202308)the Ant Group through CCF-Ant Research Fund(CCF-AFSG 769498 RF20220214).
文摘Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.
基金supported by the National Natural Science Foundation of China(Nos.42488201,42172137,42050104,and 42050102)the National Key R&D Program of China(No.2023YFF0804000)Sichuan Provincial Youth Science&Technology Innovative Research Group Fund(No.2022JDTD0004)
文摘Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information.With the rapid development of science and technology,a large number of textual reports have accumulated in the field of geology.However,many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining,making it more challenging for some researchers to extract necessary information from these texts.Natural Language Processing(NLP)has obvious advantages in processing large amounts of textual data.The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques.We propose the RoBERTa-Prompt-Tuning-NER method,which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations.The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors.Finally,we conducted experiments on the constructed Geological Named Entity Recognition(GNER)dataset.Our experimental results show that the proposed model achieves the highest F1 score of 80.64%among the four baseline algorithms,demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.
基金supported by the International Research Center of Big Data for Sustainable Development Goals under Grant No.CBAS2022GSP05the Open Fund of State Key Laboratory of Remote Sensing Science under Grant No.6142A01210404the Hubei Key Laboratory of Intelligent Geo-Information Processing under Grant No.KLIGIP-2022-B03.
文摘Named entity recognition(NER)is an important part in knowledge extraction and one of the main tasks in constructing knowledge graphs.In today’s Chinese named entity recognition(CNER)task,the BERT-BiLSTM-CRF model is widely used and often yields notable results.However,recognizing each entity with high accuracy remains challenging.Many entities do not appear as single words but as part of complex phrases,making it difficult to achieve accurate recognition using word embedding information alone because the intricate lexical structure often impacts the performance.To address this issue,we propose an improved Bidirectional Encoder Representations from Transformers(BERT)character word conditional random field(CRF)(BCWC)model.It incorporates a pre-trained word embedding model using the skip-gram with negative sampling(SGNS)method,alongside traditional BERT embeddings.By comparing datasets with different word segmentation tools,we obtain enhanced word embedding features for segmented data.These features are then processed using the multi-scale convolution and iterated dilated convolutional neural networks(IDCNNs)with varying expansion rates to capture features at multiple scales and extract diverse contextual information.Additionally,a multi-attention mechanism is employed to fuse word and character embeddings.Finally,CRFs are applied to learn sequence constraints and optimize entity label annotations.A series of experiments are conducted on three public datasets,demonstrating that the proposed method outperforms the recent advanced baselines.BCWC is capable to address the challenge of recognizing complex entities by combining character-level and word-level embedding information,thereby improving the accuracy of CNER.Such a model is potential to the applications of more precise knowledge extraction such as knowledge graph construction and information retrieval,particularly in domain-specific natural language processing tasks that require high entity recognition precision.
基金financially supported by the Natural Science Foundation of China(Grant No.42301492)the National Key R&D Program of China(Grant Nos.2022YFF0711600,2022YFF0801201,2022YFF0801200)+3 种基金the Major Special Project of Xinjiang(Grant No.2022A03009-3)the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation,Ministry of Natural Resources(Grant No.KF-2022-07014)the Opening Fund of the Key Laboratory of the Geological Survey and Evaluation of the Ministry of Education(Grant No.GLAB 2023ZR01)the Fundamental Research Funds for the Central Universities。
文摘As important geological data,a geological report contains rich expert and geological knowledge,but the challenge facing current research into geological knowledge extraction and mining is how to render accurate understanding of geological reports guided by domain knowledge.While generic named entity recognition models/tools can be utilized for the processing of geoscience reports/documents,their effectiveness is hampered by a dearth of domain-specific knowledge,which in turn leads to a pronounced decline in recognition accuracy.This study summarizes six types of typical geological entities,with reference to the ontological system of geological domains and builds a high quality corpus for the task of geological named entity recognition(GNER).In addition,Geo Wo BERT-adv BGP(Geological Word-base BERTadversarial training Bi-directional Long Short-Term Memory Global Pointer)is proposed to address the issues of ambiguity,diversity and nested entities for the geological entities.The model first uses the fine-tuned word granularitybased pre-training model Geo Wo BERT(Geological Word-base BERT)and combines the text features that are extracted using the Bi LSTM(Bi-directional Long Short-Term Memory),followed by an adversarial training algorithm to improve the robustness of the model and enhance its resistance to interference,the decoding finally being performed using a global association pointer algorithm.The experimental results show that the proposed model for the constructed dataset achieves high performance and is capable of mining the rich geological information.
基金Supported by the Scientific and Technological Innovation 2030-Major Project of New Generation Artificial Intelligence(2020AAA0109300)Science and Technology Commission of Shanghai Municipality(21DZ2203100)2023 Anhui Province Key Research and Development Plan Project-Special Project of Science and Technology Cooperation(2023i11020002)。
文摘The few-shot named entity recognition(NER)task aims to train a robust model in the source domain and transfer it to the target domain with very few annotated data.Currently,some approaches rely on the prototypical network for NER.However,these approaches often overlook the spatial relations in the span boundary matrix because entity words tend to depend more on adjacent words.We propose using a multidimensional convolution module to address this limitation to capture short-distance spatial dependencies.Additionally,we uti-lize an improved prototypical network and assign different weights to different samples that belong to the same class,thereby enhancing the performance of the few-shot NER task.Further experimental analysis demonstrates that our approach has significantly improved over baseline models across multiple datasets.
基金This research was supported by the National Key Research and Development Program[2020YFB1006302].
文摘Named entity recognition(NER)is a fundamental task of information extraction(IE),and it has attracted considerable research attention in recent years.The abundant annotated English NER datasets have significantly promoted the NER research in the English field.By contrast,much fewer efforts are made to the Chinese NER research,especially in the scientific domain,due to the scarcity of Chinese NER datasets.To alleviate this problem,we present aChinese scientificNER dataset–SciCN,which contains entity annotations of titles and abstracts derived from 3,500 scientific papers.We manually annotate a total of 62,059 entities,and these entities are classified into six types.Compared to English scientific NER datasets,SciCN has a larger scale and is more diverse,for it not only contains more paper abstracts but these abstracts are derived from more research fields.To investigate the properties of SciCN and provide baselines for future research,we adapt a number of previous state-of-theart Chinese NER models to evaluate SciCN.Experimental results show that SciCN is more challenging than other Chinese NER datasets.In addition,previous studies have proven the effectiveness of using lexicons to enhance Chinese NER models.Motivated by this fact,we provide a scientific domain-specific lexicon.Validation results demonstrate that our lexicon delivers better performance gains than lexicons of other domains.We hope that the SciCN dataset and the lexicon will enable us to benchmark the NER task regarding the Chinese scientific domain and make progress for future research.The dataset and lexicon are available at:https://github.com/yangjingla/SciCN.git.
基金supported by Yunnan Provincial Major Science and Technology Special Plan Projects(Grant Nos.202202AD080003,202202AE090008,202202AD080004,202302AD080003)National Natural Science Foundation of China(Grant Nos.U21B2027,62266027,62266028,62266025)Yunnan Province Young and Middle-Aged Academic and Technical Leaders Reserve Talent Program(Grant No.202305AC160063).
文摘Chinese named entity recognition(CNER)has received widespread attention as an important task of Chinese information extraction.Most previous research has focused on individually studying flat CNER,overlapped CNER,or discontinuous CNER.However,a unified CNER is often needed in real-world scenarios.Recent studies have shown that grid tagging-based methods based on character-pair relationship classification hold great potential for achieving unified NER.Nevertheless,how to enrich Chinese character-pair grid representations and capture deeper dependencies between character pairs to improve entity recognition performance remains an unresolved challenge.In this study,we enhance the character-pair grid representation by incorporating both local and global information.Significantly,we introduce a new approach by considering the character-pair grid representation matrix as a specialized image,converting the classification of character-pair relationships into a pixel-level semantic segmentation task.We devise a U-shaped network to extract multi-scale and deeper semantic information from the grid image,allowing for a more comprehensive understanding of associative features between character pairs.This approach leads to improved accuracy in predicting their relationships,ultimately enhancing entity recognition performance.We conducted experiments on two public CNER datasets in the biomedical domain,namely CMeEE-V2 and Diakg.The results demonstrate the effectiveness of our approach,which achieves F1-score improvements of 7.29 percentage points and 1.64 percentage points compared to the current state-of-the-art(SOTA)models,respectively.
文摘Mathematical named entity recognition(MNER)is one of the fundamental tasks in the analysis of mathematical texts.To solve the existing problems of the current neural network that has local instability,fuzzy entity boundary,and long-distance dependence between entities in Chinese mathematical entity recognition task,we propose a series of optimization processing methods and constructed an Adversarial Training and Bidirectional long shortterm memory-Selfattention Conditional random field(AT-BSAC)model.In our model,the mathematical text was vectorized by the word embedding technique,and small perturbations were added to the word vector to generate adversarial samples,while local features were extracted by Bi-directional Long Short-Term Memory(BiLSTM).The self-attentive mechanism was incorporated to extract more dependent features between entities.The experimental results demonstrated that the AT-BSAC model achieved a precision(P)of 93.88%,a recall(R)of 93.84%,and an F1-score of 93.74%,respectively,which is 8.73%higher than the F1-score of the previous Bi-directional Long Short-Term Memory Conditional Random Field(BiLSTM-CRF)model.The effectiveness of the proposed model in mathematical named entity recognition.
基金the High Tech Ship Project of the Ministry of Industry and Information Technology(No.[2019]331)。
文摘Automatic extraction of key data from design specifications is an important means to assist in engineering design automation.Considering the characteristics of diverse data types,small scale,insufficient character information content and strong contextual relevance of design specification,a named entity recognition model integrated with high-quality topic and attention mechanism,namely Quality Topic-Char Embedding-BiLSTMAttention-CRF,was proposed to automatically identify entities in design specification.Based on the topic model,an improved algorithm for high-quality topic extraction was proposed first,and then the high-quality topic information obtained was added into the distributed representation of Chinese characters to better enrich character features.Next,the attention mechanism was used in parallel on the basis of the BiLSTM-CRF model to fully mine the contextual semantic information.Finally,the experiment was performed on the collected corpus of Chinese ship design specification,and the model was compared with multiple sets of models.The results show that F-score(harmonic mean of precision and recall)of the model is 80.24%.The model performs better than other models in design specification,and is expected to provide an automatic means for engineering design.
文摘Named Entity Recognition(NER)is crucial for extracting structured information from text.While traditional methods rely on rules,Conditional Random Fields(CRFs),or deep learning,the advent of large-scale Pre-trained Language Models(PLMs)offers new possibilities.PLMs excel at contextual learning,potentially simplifying many natural language processing tasks.However,their application to NER remains underexplored.This paper investigates leveraging the GPT-3 PLM for NER without fine-tuning.We propose a novel scheme that utilizes carefully crafted templates and context examples selected based on semantic similarity.Our experimental results demonstrate the feasibility of this approach,suggesting a promising direction for harnessing PLMs in NER.
基金the National Social Science Foundation of China(No.17BYY047)
文摘Electronic Medical Records(EMR) with unstructured sentences and various conceptual expressions provide rich information for medical information extraction. However, common Named Entity Recognition(NER)in Natural Language Processing(NLP) are not well suitable for clinical NER in EMR. This study aims at applying neural networks to clinical concept extractions. We integrate Bidirectional Long Short-Term Memory Networks(Bi-LSTM) with a Conditional Random Fields(CRF) layer to detect three types of clinical named entities. Word representations fed into the neural networks are concatenated by character-based word embeddings and Continuous Bag of Words(CBOW) embeddings trained both on domain and non-domain corpus. We test our NER system on i2b2/VA open datasets and compare the performance with six related works, achieving the best result of NER with F1 value 0.853 7. We also point out a few specific problems in clinical concept extractions which will give some hints to deeper studies.
基金funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University through the Graduate Students Research Support Program.
文摘Named Entity Recognition(NER)is one of the fundamental tasks in Natural Language Processing(NLP),which aims to locate,extract,and classify named entities into a predefined category such as person,organization and location.Most of the earlier research for identifying named entities relied on using handcrafted features and very large knowledge resources,which is time consuming and not adequate for resource-scarce languages such as Arabic.Recently,deep learning achieved state-of-the-art performance on many NLP tasks including NER without requiring hand-crafted features.In addition,transfer learning has also proven its efficiency in several NLP tasks by exploiting pretrained language models that are used to transfer knowledge learned from large-scale datasets to domain-specific tasks.Bidirectional Encoder Representation from Transformer(BERT)is a contextual language model that generates the semantic vectors dynamically according to the context of the words.BERT architecture relay on multi-head attention that allows it to capture global dependencies between words.In this paper,we propose a deep learning-based model by fine-tuning BERT model to recognize and classify Arabic named entities.The pre-trained BERT context embeddings were used as input features to a Bidirectional Gated Recurrent Unit(BGRU)and were fine-tuned using two annotated Arabic Named Entity Recognition(ANER)datasets.Experimental results demonstrate that the proposed model outperformed state-of-the-art ANER models achieving 92.28%and 90.68%F-measure values on the ANERCorp dataset and the merged ANERCorp and AQMAR dataset,respectively.
基金the National Natural Science Foundation of China undergrant 61501515.
文摘Owing to the continuous barrage of cyber threats,there is a massive amount of cyber threat intelligence.However,a great deal of cyber threat intelligence come from textual sources.For analysis of cyber threat intelligence,many security analysts rely on cumbersome and time-consuming manual efforts.Cybersecurity knowledge graph plays a significant role in automatics analysis of cyber threat intelligence.As the foundation for constructing cybersecurity knowledge graph,named entity recognition(NER)is required for identifying critical threat-related elements from textual cyber threat intelligence.Recently,deep neural network-based models have attained very good results in NER.However,the performance of these models relies heavily on the amount of labeled data.Since labeled data in cybersecurity is scarce,in this paper,we propose an adversarial active learning framework to effectively select the informative samples for further annotation.In addition,leveraging the long short-term memory(LSTM)network and the bidirectional LSTM(BiLSTM)network,we propose a novel NER model by introducing a dynamic attention mechanism into the BiLSTM-LSTM encoderdecoder.With the selected informative samples annotated,the proposed NER model is retrained.As a result,the performance of the NER model is incrementally enhanced with low labeling cost.Experimental results show the effectiveness of the proposed method.
基金Supported by The National Natural Science Foundation of China(No.60302021).
文摘Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail, and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better resuits.