基于字符表示学习与时序边界扩散的网络安全实体识别方法

A Cybersecurity Entity Recognition Approach Based on Character Representation Learning and Temporal Boundary Diffusion

下载PDF

导出

摘要网络安全实体识别作为威胁信息抽取、构建知识图谱的基础,对于发现和应对网络威胁具有至关重要的作用。该文针对当前主流的命名实体识别方法在网络安全领域泛化能力欠佳、难以清晰判断网络安全实体边界的问题,提出一种基于字符表示学习与时序边界扩散的网络安全实体识别方法。该方法首先将命名实体识别任务分解为实体边界检测与实体分类两个子任务,分别进行处理;其次,对于实体边界检测任务,使用基于问答的方法将预定义的问题与数据进行编码,采用膨胀卷积残差字符网络进行数据的字符级特征提取,并使用时序边界扩散网络判断实体边界;然后,对于实体分类任务,同样使用问答方法,并独立训练分类器进行实体类型判断;最后将实体边界检测任务的结果输入实体分类任务判断实体的类型。为验证方法有效性,在网络威胁情报数据集DNRTI上进行测试。实验结果表明,边界检测效率的提升能够有效增强命名实体识别的性能。该方法在网络安全实体识别任务中不仅资源开销较小,且对比近年提出的基线方法性能有所提升,其中较最近两年的方法在F1分数上提升了0.40%~1.65%。 Objective The vast amount of unstructured cybersecurity information available online holds significant value.Named Entity Recognition(NER)in cybersecurity facilitates the automatic extraction of such information,providing a foundation for cyber threat analysis and knowledge graph construction.However,existing cybersecurity NER research remains limited,primarily relying on general-purpose approaches that struggle to generalize effectively to domain-specific datasets,often resulting in errors when recognizing cybersecurityspecific terms.Some recent studies decompose the NER task into entity boundary detection and entity classification,optimizing these subtasks separately to enhance performance.However,the representation of complex cybersecurity entities often exceeds the capability of single-feature semantic representations,and existing boundary detection methods frequently produce misjudgments.To address these challenges,this study proposes a cybersecurity entity recognition approach based on character representation learning and temporal boundary diffusion.The approach integrates character-level feature extraction with a boundary diffusion network based on a denoising diffusion probabilistic model.By focusing on optimizing entity boundary detection,the proposed method improves performance in cybersecurity NER tasks.Methods The proposed approach divides the NER task into two subtasks:entity boundary detection and entity classification,which are processed independently,as illustrated(Fig.1).For entity boundary detection,a Question-Answering(QA)framework is adopted.The framework first generates questions about the entities to be extracted,concatenates them with the corresponding input sentences,and encodes them using a pre-trained BERT model to extract preliminary semantic features.Character-level feature extraction is then performed using a Dilated Convolutional Residual Character Network(DCR-CharNet),which processes character-level information through dilated residual blocks.Dilated convolution expands the model’s receptive field,capturing broader contextual information,while a self-attention mechanism dynamically identifies key features.These components enhance the global representation of input data and provide multi-dimensional feature representations.A Temporal Boundary Diffusion Network(TBDN)is then applied for entity boundary detection.TBDN employs a fixed forward diffusion process that introduces Gaussian noise to entity boundaries at each time step,progressively blurring them.A learnable reverse diffusion process subsequently predicts and removes noise at each time step,enabling the gradual recovery of accurate entity boundaries and leading to precise boundary detection.For entity classification,an independent network is trained to assign labels to detected entities.Like boundary detection,this subtask also adopts a QA framework.A cybersecurity-specific pre-trained language model,SecRoBERTa,encodes the concatenated question and input data to extract entity classification features.These features are then processed through a linear-layer-based entity classifier,which outputs the recognized entity type.Results and Discussions The performance of the proposed approach is evaluated on the DNRTI cybersecurity dataset,with comparative results against baseline methods presented(Table 3).The proposed approach achieved a 0.40%improvement in F1-score over UTERMMF,a model incorporating character-level,part-of-speech,and positional features along with inter-word relationship classification.Compared to CTERMRFRAT,which employs an adversarial training framework,the proposed approach improved the F1-score by 1.65%.Additionally,it outperformed BERT+BiLSTM+CRF by 5.20%and achieved gains of 12.21%,17.90%,and 18.31%over BERT,CNN+BiLSTM+CRF,and IDCNN+CRF,respectively.These results highlight that boundary detection accuracy is a key factor limiting NER performance,and optimizing boundary detection methods can significantly enhance overall model effectiveness.The proposed approach’s emphasis on boundary detection enables more accurate identification of entity boundaries,contributing to higher F1-scores.However,in terms of accuracy,it slightly underperforms CNN+BiLSTM+CRF.This discrepancy is attributed to class imbalance in the dataset,where certain categories are overrepresented while others are underrepresented.The approach demonstrates strong performance in handling minority categories,but its focus on rare entities slightly reduces prediction accuracy for common categories,affecting overall accuracy.Despite this trade-off,the approach enhances entity boundary detection,reducing misidentifications and improving precision and recall,thereby increasing the F1-score.Errors in boundary detection may propagate to the entity classification stage,impacting overall accuracy.However,the proposed two-stage approach,which prioritizes boundary detection optimization,ensures more precise boundary identification,which is crucial for improving NER performance.In terms of computational efficiency,the proposed approach is compared with DiffusionNER(Table 4),another diffusion-based NER model.Results indicate that the proposed approach requires fewer parameters,achieves faster inference speeds,and delivers higher F1-scores under the same hardware and software conditions.Conclusions Enhancing boundary detection efficiency significantly improves NER performance.The proposed approach reduces resource consumption while achieving superior performance compared to recent baseline methods in cybersecurity NER tasks.

作者胡泽李文君杨宏宇 HU Ze;LI Wenjun;YANG Hongyu(School of Safety Science and Engineering,Civil Aviation University of China,Tianjin 300300,China;School of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)

机构地区中国民航大学安全科学与工程学院中国民航大学计算机科学与技术学院

出处《电子与信息学报》北大核心 2025年第5期1554-1568,共15页 Journal of Electronics & Information Technology

基金国家自然科学基金(62201576,U1833107),国家自然科学基金配套基金(3122023PT10)。

关键词命名实体识别网络安全边界检测深度学习自然语言处理 Named Entity Recognition(NER) Cybersecurity Boundary detection Deep learning Natural language processing

分类号 TN915.08 [电子电信—通信与信息系统] TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1Pingchuan Ma,Bo Jiang,Zhigang Lu,Ning Li,Zhengwei Jiang.Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields[J].Tsinghua Science and Technology,2021,26(3):259-265. 被引量：18
2Chen GAO,Xuan ZHANG,Mengting HAN,Hui LIU.A review on cyber security named entity recognition[J].Frontiers of Information Technology & Electronic Engineering,2021,22(9):1153-1168. 被引量：9

二级参考文献3

1Ya QIN,Guo-wei SHEN,Wen-bo ZHAO,Yan-ping CHEN,Miao YU,Xin JIN.A network security entity recognition method based on feature template and CNN-BiLSTM-CRF[J].Frontiers of Information Technology & Electronic Engineering,2019,20(6):872-884. 被引量：29
2古雪梅,刘嘉勇,程芃森,何祥.基于增强BiLSTM-CRF模型的推文恶意软件名称识别[J].计算机科学,2020,47(2):245-250. 被引量：6
3Pingchuan Ma,Bo Jiang,Zhigang Lu,Ning Li,Zhengwei Jiang.Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields[J].Tsinghua Science and Technology,2021,26(3):259-265. 被引量：18

共引文献24

1Chen GAO,Xuan ZHANG,Mengting HAN,Hui LIU.A review on cyber security named entity recognition[J].Frontiers of Information Technology & Electronic Engineering,2021,22(9):1153-1168. 被引量：9
2Jie Man,Honghui Dong,Limin Jia,Yong Qin.GGC:Gray-Granger Causality Method for Sensor Correlation Network Structure Mining on High-Speed Train[J].Tsinghua Science and Technology,2022,27(1):207-222.
3钟爱,梁小青,肖梅,向黎藜,段凯,李竹.基于正则算法和命名实体识别模型的95598工单结构化信息自动提取[J].电力大数据,2021,24(12):38-45. 被引量：3
4陈雨,玄宇航,张玉志.基于深度学习和指代消解的中文人名识别[J].数据与计算发展前沿,2022,4(2):63-73. 被引量：3
5邓凯,杨频,李益洲,杨星,曾凡瑞,张振毓.一种可快速迁移的领域知识图谱构建方法[J].计算机科学,2022,49(S01):100-108. 被引量：2
6Yang Xu,Boming Xia,Yueliang Wan,Fan Zhang,Jiabo Xu,Huansheng Ning.CDCAT: A Multi-Language Cross-Document Entity and Event Coreference Annotation Tool[J].Tsinghua Science and Technology,2022,27(3):589-598.
7李思洁,王亚慧,张子豪.燃气输配突发事件应急处置的知识图谱构建[J].消防科学与技术,2022,41(6):812-817. 被引量：5
8张大波,郭怀新,储著伟,王博欣.基于多分类BiLSTM-CRF的电网启动方案结构化数据转换模型研究[J].电力信息与通信技术,2023,21(1):54-61. 被引量：4
9张猛.基于医疗BERT的电子病历命名实体识别[J].信息技术与信息化,2023(2):122-125. 被引量：1
10于韬,张英,拥措.基于小样本学习的藏文命名实体识别[J].计算机与现代化,2023(5):13-19. 被引量：3

1唐彝龙.网络信息安全与防范[J].中国科技期刊数据库科研,2017(2):00266-00266.
2王纪恬,陈艳平,黄蓉,黄瑞章,秦永彬.结合位置感知的命名实体识别方法[J].广西科学,2025,32(1):96-105.
3李嘉欣,莫思特.基于MiniRBT-LSTM-GAT与标签平滑的台区电力工单分类[J].计算机应用,2025,45(4):1356-1362. 被引量：1
4刘林.计算机软件开发中网络安全设计的应用研究[J].中国宽带,2025,21(5):70-72.
5范生平.大数据技术在信息网络情报中的运用研究[J].中国科技期刊数据库科研,2020(2):00241-00242.
6韩思齐.大数据背景下高职院校校园网络信息安全研究[J].软件,2025,46(4):181-183. 被引量：1
7吴宝江.基于数据驱动的网络安全态势感知预测[J].网络安全与数据治理,2025,44(5):17-20.
8张玉江,王赵东,张秋瑶,成月红,宫勋.人工智能下的电力短路故障自动化诊断方案[J].今日自动化,2025(3):161-163.
9李潇.电力通信网管系统专网网络安全防护研究[J].通信电源技术,2025,42(10):219-221.

电子与信息学报

2025年第5期

浏览历史

内容加载中请稍等...

基于字符表示学习与时序边界扩散的网络安全实体识别方法

参考文献2

二级参考文献3

共引文献24

相关作者

相关机构

相关主题

浏览历史