预训练语言模型能够表达句子丰富的句法和语法信息,并且能够对词的多义性建模,在自然语言处理中有着广泛的应用,BERT(bidirectional encoder representations from transformers)预训练语言模型是其中之一。在基于BERT微调的命名实体识...预训练语言模型能够表达句子丰富的句法和语法信息,并且能够对词的多义性建模,在自然语言处理中有着广泛的应用,BERT(bidirectional encoder representations from transformers)预训练语言模型是其中之一。在基于BERT微调的命名实体识别方法中,存在的问题是训练参数过多,训练时间过长。针对这个问题提出了基于BERT-IDCNN-CRF(BERT-iterated dilated convolutional neural network-conditional random field)的中文命名实体识别方法,该方法通过BERT预训练语言模型得到字的上下文表示,再将字向量序列输入IDCNN-CRF模型中进行训练,训练过程中保持BERT参数不变,只训练IDCNN-CRF部分,在保持多义性的同时减少了训练参数。实验表明,该模型在MSRA语料上F1值能够达到94.41%,在中文命名实体任务上优于目前最好的Lattice-LSTM模型,提高了1.23%;与基于BERT微调的方法相比,该方法的F1值略低但是训练时间大幅度缩短。将该模型应用于信息安全、电网电磁环境舆情等领域的敏感实体识别,速度更快,响应更及时。展开更多
Named entity recognition(NER)is an important part in knowledge extraction and one of the main tasks in constructing knowledge graphs.In today’s Chinese named entity recognition(CNER)task,the BERT-BiLSTM-CRF model is ...Named entity recognition(NER)is an important part in knowledge extraction and one of the main tasks in constructing knowledge graphs.In today’s Chinese named entity recognition(CNER)task,the BERT-BiLSTM-CRF model is widely used and often yields notable results.However,recognizing each entity with high accuracy remains challenging.Many entities do not appear as single words but as part of complex phrases,making it difficult to achieve accurate recognition using word embedding information alone because the intricate lexical structure often impacts the performance.To address this issue,we propose an improved Bidirectional Encoder Representations from Transformers(BERT)character word conditional random field(CRF)(BCWC)model.It incorporates a pre-trained word embedding model using the skip-gram with negative sampling(SGNS)method,alongside traditional BERT embeddings.By comparing datasets with different word segmentation tools,we obtain enhanced word embedding features for segmented data.These features are then processed using the multi-scale convolution and iterated dilated convolutional neural networks(IDCNNs)with varying expansion rates to capture features at multiple scales and extract diverse contextual information.Additionally,a multi-attention mechanism is employed to fuse word and character embeddings.Finally,CRFs are applied to learn sequence constraints and optimize entity label annotations.A series of experiments are conducted on three public datasets,demonstrating that the proposed method outperforms the recent advanced baselines.BCWC is capable to address the challenge of recognizing complex entities by combining character-level and word-level embedding information,thereby improving the accuracy of CNER.Such a model is potential to the applications of more precise knowledge extraction such as knowledge graph construction and information retrieval,particularly in domain-specific natural language processing tasks that require high entity recognition precision.展开更多
针对中文命名实体识别经典的BiLSTM-CRF(bi-directional long short-term memory-conditional random field)模型存在的嵌入向量无法表征多义词、编码层建模时注意力分散以及缺少对局部空间特征捕获的问题,本文提出一种融合BERT-BiGRU-M...针对中文命名实体识别经典的BiLSTM-CRF(bi-directional long short-term memory-conditional random field)模型存在的嵌入向量无法表征多义词、编码层建模时注意力分散以及缺少对局部空间特征捕获的问题,本文提出一种融合BERT-BiGRU-MHA-CRF和BERT-IDCNN-CRF模型优势的集成模型完成命名实体识别.该方法利用裁剪的BERT模型得到包含上下文信息的语义向量;再将语义向量输入BiGRU-MHA(bi-directional gated recurrent unit-multi head attention)及IDCNN(Iterated Dilated Convolutional Neural Network)网络.前者捕获输入序列的时序特征并能够根据字符重要性分配权值,后者主要捕获输入的空间特征,利用平均集成方式将捕获到的特征融合;最后通过CRF层获得全局最优的标注序列.集成模型在人民日报和微软亚洲研究院(Microsoft research asia, MSRA)数据集上的F1值分别达到了96.09%和95.01%.相较于单个模型分别提高了0.74%和0.55%以上,验证了本文方法的有效性.展开更多
基金supported by the International Research Center of Big Data for Sustainable Development Goals under Grant No.CBAS2022GSP05the Open Fund of State Key Laboratory of Remote Sensing Science under Grant No.6142A01210404the Hubei Key Laboratory of Intelligent Geo-Information Processing under Grant No.KLIGIP-2022-B03.
文摘Named entity recognition(NER)is an important part in knowledge extraction and one of the main tasks in constructing knowledge graphs.In today’s Chinese named entity recognition(CNER)task,the BERT-BiLSTM-CRF model is widely used and often yields notable results.However,recognizing each entity with high accuracy remains challenging.Many entities do not appear as single words but as part of complex phrases,making it difficult to achieve accurate recognition using word embedding information alone because the intricate lexical structure often impacts the performance.To address this issue,we propose an improved Bidirectional Encoder Representations from Transformers(BERT)character word conditional random field(CRF)(BCWC)model.It incorporates a pre-trained word embedding model using the skip-gram with negative sampling(SGNS)method,alongside traditional BERT embeddings.By comparing datasets with different word segmentation tools,we obtain enhanced word embedding features for segmented data.These features are then processed using the multi-scale convolution and iterated dilated convolutional neural networks(IDCNNs)with varying expansion rates to capture features at multiple scales and extract diverse contextual information.Additionally,a multi-attention mechanism is employed to fuse word and character embeddings.Finally,CRFs are applied to learn sequence constraints and optimize entity label annotations.A series of experiments are conducted on three public datasets,demonstrating that the proposed method outperforms the recent advanced baselines.BCWC is capable to address the challenge of recognizing complex entities by combining character-level and word-level embedding information,thereby improving the accuracy of CNER.Such a model is potential to the applications of more precise knowledge extraction such as knowledge graph construction and information retrieval,particularly in domain-specific natural language processing tasks that require high entity recognition precision.
文摘针对中文命名实体识别经典的BiLSTM-CRF(bi-directional long short-term memory-conditional random field)模型存在的嵌入向量无法表征多义词、编码层建模时注意力分散以及缺少对局部空间特征捕获的问题,本文提出一种融合BERT-BiGRU-MHA-CRF和BERT-IDCNN-CRF模型优势的集成模型完成命名实体识别.该方法利用裁剪的BERT模型得到包含上下文信息的语义向量;再将语义向量输入BiGRU-MHA(bi-directional gated recurrent unit-multi head attention)及IDCNN(Iterated Dilated Convolutional Neural Network)网络.前者捕获输入序列的时序特征并能够根据字符重要性分配权值,后者主要捕获输入的空间特征,利用平均集成方式将捕获到的特征融合;最后通过CRF层获得全局最优的标注序列.集成模型在人民日报和微软亚洲研究院(Microsoft research asia, MSRA)数据集上的F1值分别达到了96.09%和95.01%.相较于单个模型分别提高了0.74%和0.55%以上,验证了本文方法的有效性.