The proliferation of malware and the emergence of adversarial samples pose severe threats to global cybersecurity,demanding robust detection mechanisms.Traditional malware detection methods suffer from limited feature...The proliferation of malware and the emergence of adversarial samples pose severe threats to global cybersecurity,demanding robust detection mechanisms.Traditional malware detection methods suffer from limited feature extraction capabilities,while existing Vision Transformer(ViT)-based approaches face high computational complexity due to global self-attention,hindering their efficiency in handling large-scale image data.To address these issues,this paper proposes a novel hybrid enhanced Vision Transformer architecture,HERL-ViT,tailored for malware detection.The detection framework involves five phases:malware image visualization,image segmentation with patch embedding,regional-local attention-based feature extraction,enhanced feature transformation,and classification.Methodologically,HERL-ViT integrates a multi-level pyramid structure to capture multi-scale features,a regionalto-local attention mechanism to reduce computational complexity,an Optimized Position Encoding Generator for dynamic relative position encoding,and enhanced MLP and downsampling modules to balance performance and efficiency.Key contributions include:(1)A unified framework integrating visualization,adversarial training,and hybrid attention for malware detection;(2)Regional-local attention to achieve both global awareness and local detail capture with lower complexity;(3)Optimized PEG to enhance spatial perception and reduce overfitting;(4)Lightweight network design(5.8M parameters)ensuring high efficiency.Experimental results show HERL-ViT achieves 99.2%accuracy(Loss=0.066)on malware classification and 98.9%accuracy(Loss=0.081)on adversarial samples,demonstrating superior performance and robustness compared to state-of-the-art methods.展开更多
基金funded by the Special Project of Langfang Key Research and Development under Grant No.2023011005Bthe Technology Innovation Platform Construction Project of North China Institute of Aerospace Engineering under Grant No.CXPT-2023-02.
文摘The proliferation of malware and the emergence of adversarial samples pose severe threats to global cybersecurity,demanding robust detection mechanisms.Traditional malware detection methods suffer from limited feature extraction capabilities,while existing Vision Transformer(ViT)-based approaches face high computational complexity due to global self-attention,hindering their efficiency in handling large-scale image data.To address these issues,this paper proposes a novel hybrid enhanced Vision Transformer architecture,HERL-ViT,tailored for malware detection.The detection framework involves five phases:malware image visualization,image segmentation with patch embedding,regional-local attention-based feature extraction,enhanced feature transformation,and classification.Methodologically,HERL-ViT integrates a multi-level pyramid structure to capture multi-scale features,a regionalto-local attention mechanism to reduce computational complexity,an Optimized Position Encoding Generator for dynamic relative position encoding,and enhanced MLP and downsampling modules to balance performance and efficiency.Key contributions include:(1)A unified framework integrating visualization,adversarial training,and hybrid attention for malware detection;(2)Regional-local attention to achieve both global awareness and local detail capture with lower complexity;(3)Optimized PEG to enhance spatial perception and reduce overfitting;(4)Lightweight network design(5.8M parameters)ensuring high efficiency.Experimental results show HERL-ViT achieves 99.2%accuracy(Loss=0.066)on malware classification and 98.9%accuracy(Loss=0.081)on adversarial samples,demonstrating superior performance and robustness compared to state-of-the-art methods.