摘要
软件概念漂移指同类型软件的软件结构和组成成分会随着时间的推移而改变.在恶意软件分类领域,发生概念漂移意味着同一家族的恶意样本的结构和组成特征会随时间发生变化,这会导致固定模式的恶意软件分类算法的性能会随时间推移而发生下降.现有的恶意软件静态分类研究方法在面临概念漂移场景时都会有显著的性能下降,因此难以满足实际应用的需求.针对这一问题,鉴于自然语言理解领域与二进制程序字节流分析领域的共性,基于BERT和自定义的自编码器架构提出一种高精度、鲁棒的恶意软件分类方法.该方法首先通过反汇编分析提取执行导向的恶意软件操作码序列,减少冗余信息;然后使用BERT理解序列的上下文语义并进行向量嵌入,有效地理解恶意软件的深层程序语义;再通过几何中位数子空间投影和瓶颈自编码器进行任务相关的有效特征筛选;最后通过全连接层构成的分类器输出分类结果.在普通场景和概念漂移场景中,通过与最先进的9种恶意软件分类方法进行对比实验验证所提方法的实际有效性.实验结果显示:所提方法在普通场景下的分类F1值达到99.49%,高于所有对比方法,且在概念漂移场景中的分类F1值比所有对比方法提高10.78%–43.71%.
Software concept drift means that the structure and composition of the same type of software will change over time.In malware classification,concept drift means that the structure and composition characteristics of malware samples from the same family can change over time.This will cause a decline in the performance of fixed-mode malware classification algorithms over time.Existing methods for static malware classification experience significant performance degradation when faced with concept drift scenarios,making it difficult to meet the needs of practical applications.To address this problem,given the commonalities between natural language understanding and binary byte stream analysis,a highly accurate and robust malware classification method is proposed based on BERT and a custom autoencoder architecture.This method extracts execution-oriented malware opcode sequences through disassembly analysis to reduce redundant information.Then,it uses BERT to understand the contextual semantics of the sequences and perform vector embedding to effectively understand the deep program semantics of the malware samples.It also screens effective task-related features through the geometric median subspace projection and bottleneck autoencoders.Finally,a classifier composed of fully connected layers is used to output the classification results.The practical effectiveness of the proposed method is validated through comparative experiments with nine state-of-the-art malware classification methods in both normal and concept drift scenarios.Experimental results show that the proposed method achieves an F1 score of 99.49%in normal scenarios,outperforming those nine methods.Moreover,in concept drift scenarios,the F1 score is improved by 10.78%to 43.71%compared to the nine methods.
作者
赵浩钧
邹德清
薛文杰
吴月明
金海
ZHAO Hao-Jun;ZOU De-Qing;XUE Wen-Jie;WU Yue-Ming;JIN Hai(National Engineering Research Center for Big Data Technology and System,Wuhan 430074,China;Key Laboratory of Services Computing Technology and System,Ministry of Education,Wuhan 430074,China;Hubei Engineering Research Center on Big Data Security,Wuhan 430074,China;Hubei Key Laboratory of Cluster and Grid Computing,Wuhan 430074,China;School of Cyber Science and Engineering,Huazhong University of Science and Technology,Wuhan 430074,China;School of Computing and Data Science,Nanyang Technological University,Singapore 639798,Singapore;School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China)
出处
《软件学报》
北大核心
2025年第8期3709-3725,共17页
Journal of Software
基金
国家自然科学基金(62172168)。