摘要
针对多数恶意代码分类研究都基于家族分类和恶意、良性代码分类,而种类分类比较少的问题,提出了多特征融合的恶意代码分类算法。采用纹理图和反汇编文件提取3组特征进行融合分类研究,首先使用源文件和反汇编文件提取灰度共生矩阵特征,由n-gram算法提取操作码序列;然后采用改进型信息增益(IG)算法提取操作码特征,其次将多组特征进行标准化处理后以随机森林(RF)为分类器进行学习;最后实现了基于多特征融合的随机森林分类器。通过对九类恶意代码进行学习和测试,所提算法取得了85%的准确度,相比单一特征下的随机森林、多特征下的多层感知器和Logistic回归算法分类器,准确率更高。
Concerning the fact that most malicious code classification researches are based on family classification and malicious and benign code classification,and the classification of categories is relatively few,a malicious code classification algorithm based on multi-feature fusion was proposed.Three sets of features extracted from texture maps and disassembly files were used for fusion classification research.Firstly,the gray level co-occurrence matrix features were extracted from source files and disassembly files and the sequences of operation codes were extracted by n-gram algorithm.Secondly,the improved Information Gain(IG) algorithm was used to extract the operation code features.Thirdly,Random Forest(RF) was used as the classifier to learn the multi-group features after normalization.Finally,the random forest classifier based on multi-feature fusion was realized.The proposed algorithm achieves 85% accuracy by learning and testing nine types of malicious codes.Compared with random forest under single feature,multi-layer perceptron under multi-feature and Logistic regression classifier,it has higher accuracy.
作者
郎大鹏
丁巍
姜昊辰
陈志远
LANG Dapeng;DING Wei;JIANG Haocheng;CHEN Zhiyuang(College of Computer Science and Technology,Harbin Engineerning University,Harbin Heilongjiang 150001,China;Key Laboratory of Network Assessment Technology,Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China)
出处
《计算机应用》
CSCD
北大核心
2019年第8期2333-2338,共6页
journal of Computer Applications
基金
中国科学院信息工程研究所中国科学院网络测评技术重点实验室开放课题资助项目(10201050201)~~
关键词
恶意代码
纹理特征
操作码序列
随机森林
静态分析
malicious code
texture feature
opcode sequence
Random Forest(RF)
static analysis