摘要
敏感信息命名实体识别(NER)是隐私保护的关键技术之一。然而,现有的NER方法在敏感信息领域的相关数据集稀缺,且传统技术存在准确率低、可移植性差等问题。为解决这些问题,首先,从互联网中爬取并人工标注含有敏感信息的文本语料,以构建敏感信息NER数据集SenResume;其次,提出一种基于实体掩码的数据增强模型E-MLM(Entity-based Masked Language Modeling),通过整词掩码技术生成新的数据样本,并扩充数据集以提升数据多样性;再次,提出RoBERTa-ResBiLSTM-CRF模型,该模型结合RoBERTa-WWM(Robustly optimized Bidirectional Encoder Representations from Transformers approach with Whole Word Masking)提取上下文特征以生成高质量的词向量编码,并利用残差双向长短期记忆(ResBiLSTM)增强文本特征;最后,通过多层残差网络提高训练效率和模型稳定性,并通过条件随机场(CRF)进行全局解码以提升序列标注的准确性。实验结果表明,E-MLM对数据集质量有显著的提升,并且提出的NER模型在原始和1倍扩充后的数据集上表现均为最优,F1分数分别为96.16%和97.84%。可见,E-MLM与残差网络的引入有利于提升敏感信息NER的准确度。
Named Entity Recognition(NER)for sensitive information is a key technology of privacy protection.However,the existing NER methods face challenges in the sensitive information domain due to the scarcity of relevant datasets and the traditional techniques have problems such as low accuracy and poor portability.To address these issues,firstly,a sensitive information NER dataset,SenResume,was constructed by crawling and manually annotating text corpora containing sensitive information from the Internet.Secondly,a data augmentation model—Entity-based Masked Language Modeling(E-MLM)was proposed to utilize whole-word masking technique to generate new data samples,and expand the dataset to enhance data diversity.Thirdly,a RoBERTa-ResBiLSTM-CRF model was introduced,which combined the Robustly optimized Bidirectional Encoder Representations from Transformers approach with Whole Word Masking(RoBERTa-WWM)to extract contextual features for generating high-quality word vector representations,while ResBiLSTM(Residual Bidirectional Long Short-Term Memory)was employed to enhance text features.Finally,a multi-layer residual network was applied to improve training efficiency and model stability,and Conditional Random Field(CRF)was used for global decoding to enhance the accuracy of sequence labeling.Experimental results demonstrate that E-MLM improves dataset quality significantly,and the proposed NER model achieves the optimal performance on both the original and 1x augmented datasets,with F1 scores of 96.16%and 97.84%,respectively.It can be seen that the introduction of E-MLM and residual networks contribute to improvements in the accuracy of sensitive information NER.
作者
李莉
宋涵
刘培鹤
陈汉林
LI Li;SONG Han;LIU Peihe;CHEN Hanlin(Department of Electronic and Communication Engineering,Beijing Electronic Science and Technology Institute,Beijing 100070,China)
出处
《计算机应用》
北大核心
2025年第9期2790-2797,共8页
journal of Computer Applications
基金
中央高校基本科研业务费专项资金资助项目(3282023017,3282024006,3282023054)
多学科交叉的电子信息工程创新人才培养模式的研究与实践项目(jy202202)。
关键词
敏感信息
数据集构建
数据增强
BERT
命名实体识别
sensitive information
dataset construction
data enhancement
Bidirectional Encoder Representations from Transformers(BERT)
Named Entity Recognition(NER)