Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Althoug...Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code,existing techniques predominantly depend on inserting artificial instructions,which incur high computational costs and offer limited diversity of perturbations.To address these limitations,we propose AIMA,a novel gradient-guided assembly instruction relocation method.Our method decouples the detection model into tokenization,embedding,and encoding layers to enable efficient gradient computation.Since token IDs of instructions are discrete and nondifferentiable,we compute gradients in the continuous embedding space to evaluate the influence of each token.The most critical tokens are identified by calculating the L2 norm of their embedding gradients.We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instructionlevel significance.To maximize adversarial impact,a sliding window algorithm selects the most influential contiguous segments for relocation,ensuring optimal perturbation with minimal length.This approach efficiently locates critical code regions without expensive search operations.The selected segments are relocated outside their original function boundaries via a jump mechanism,which preserves runtime control flow and functionality while introducing“deletion”effects in the static instruction sequence.Extensive experiments show that AIMA reduces similarity scores by up to 35.8%in state-of-the-art BCSD models.When incorporated into training data,it also enhances model robustness,achieving a 5.9%improvement in AUROC.展开更多
: This paper proposes a new sequential similarity detection algorithm (SSDA), which can overcome matching error caused by grayscale distortion; meanwhile, time consumption is much less than that of regular algorith...: This paper proposes a new sequential similarity detection algorithm (SSDA), which can overcome matching error caused by grayscale distortion; meanwhile, time consumption is much less than that of regular algorithms based on image feature. The algorithm adopts Sobel operator to deal with subgraph and template image, and regards the region which has maximum relevance as final result. In order to solve time-consuming problem existing in original algorithm, a coarse-to-fine matching method is put forward. Besides, the location correlation keeps updating and remains the minimum value in the whole scanning process, which can significantly decrease time consumption. Experiments show that the algorithm proposed in this article can not only overcome gray distortion, but also ensure accuracy. Time consumption is at least one time orders of magnitude shorter than that of primal algorithm.展开更多
Aimed at the problem of the end effect when using empirical mode decomposition(EMD),a method for constraining the end effect of EMD is proposed based on sequential similarity detection and adaptive filter.The method d...Aimed at the problem of the end effect when using empirical mode decomposition(EMD),a method for constraining the end effect of EMD is proposed based on sequential similarity detection and adaptive filter.The method divides the signal into many wavelets,and it changes the initial wavelet length to select the best initial wavelet that has the minimum error and maximum number of matching seed wavelets,and the wavelet slopes are used for pre-matching and secondary matching to speed up the matching speed.Then,folded self-adaptive threshold is used to select multiple seed wavelets,and finally the end waveform is predicted and expanded according to the adaptive filter method.The proposed method is used to analyze the non-stationary nonlinear simulation signal and experimental signal,and it is compared with the mirror extension and RBF extension methods.The orthogonality index and similarity index of the EMD results of the extended signal after the proposed method are better than those of the other methods.The results show that the proposed method can better constrain the end effect,and has certain validity,accuracy and stability in solving the end effect problem.展开更多
Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation va...Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.展开更多
On the basis of Hartmann Shack sensor imaging analysis, a new method is presented with which the wavefront slope can be obtained when the object is incoherent and extended. This method, which is demonstrated by both ...On the basis of Hartmann Shack sensor imaging analysis, a new method is presented with which the wavefront slope can be obtained when the object is incoherent and extended. This method, which is demonstrated by both theoretical interpreting and computer simulation, explains how to measure the wavefront slope difference between two sub apertures through the determination of image displacements on detector plane. It includes a fast and accurate digital algorithm for detecting wavefront disturbance, which is much suitable for realization in such electrical hardwares as digital signal processors.展开更多
Recently,security issues of smart contracts are arising great attention due to the enormous financial loss caused by vulnerability attacks.There is an increasing need to detect similar codes for hunting vulnerability ...Recently,security issues of smart contracts are arising great attention due to the enormous financial loss caused by vulnerability attacks.There is an increasing need to detect similar codes for hunting vulnerability with the increase of critical security issues in smart contracts.Binary similarity detection that quantitatively measures the given code diffing has been widely adopted to facilitate critical security analysis.However,due to the difference between common programs and smart contract,such as diversity of bytecode generation and highly code homogeneity,directly adopting existing graph matching and machine learning based techniques to smart contracts suffers from low accuracy,poor scalability and the limitation of binary similarity on function level.Therefore,this paper investigates graph neural network to detect smart contract binary code similarity at the program level,where we conduct instruction-level normalization to reduce the noise code for smart contract pre-processing and construct contract control flow graphs to represent smart contracts.In particular,two improved Graph Convolutional Network(GCN)and Message Passing Neural Network(MPNN)models are explored to encode the contract graphs into quantitatively vectors,which can capture the semantic information and the program-wide control flow information with temporal orders.Then we can efficiently accomplish the similarity detection by measuring the distance between two targeted contract embeddings.To evaluate the effectiveness and efficient of our proposed method,extensive experiments are performed on two real-world datasets,i.e.,smart contracts from Ethereum and Enterprise Operation System(EOS)blockchain-based platforms.The results show that our proposed approach outperforms three state-of-the-art methods by a large margin,achieving a great improvement up to 6.1%and 17.06%in accuracy.展开更多
The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as...The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as well as the graph- based approach and presented a metrics-based and graph- based combined approach. First, source codes are represented as augmented system dependence graphs. Then, metrics- based candidate similar code extraction is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, code normalization is performed on the candidate similar codes to remove code variations so as to detect similar code at the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar codes. Experiment results show that our approach can detect similar codes with code variations, and it can be applied to large software.展开更多
基金supported by Key Laboratory of Cyberspace Security,Ministry of Education,China。
文摘Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code,existing techniques predominantly depend on inserting artificial instructions,which incur high computational costs and offer limited diversity of perturbations.To address these limitations,we propose AIMA,a novel gradient-guided assembly instruction relocation method.Our method decouples the detection model into tokenization,embedding,and encoding layers to enable efficient gradient computation.Since token IDs of instructions are discrete and nondifferentiable,we compute gradients in the continuous embedding space to evaluate the influence of each token.The most critical tokens are identified by calculating the L2 norm of their embedding gradients.We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instructionlevel significance.To maximize adversarial impact,a sliding window algorithm selects the most influential contiguous segments for relocation,ensuring optimal perturbation with minimal length.This approach efficiently locates critical code regions without expensive search operations.The selected segments are relocated outside their original function boundaries via a jump mechanism,which preserves runtime control flow and functionality while introducing“deletion”effects in the static instruction sequence.Extensive experiments show that AIMA reduces similarity scores by up to 35.8%in state-of-the-art BCSD models.When incorporated into training data,it also enhances model robustness,achieving a 5.9%improvement in AUROC.
基金the National Natural Science Foundation of China(No.61165008)
文摘: This paper proposes a new sequential similarity detection algorithm (SSDA), which can overcome matching error caused by grayscale distortion; meanwhile, time consumption is much less than that of regular algorithms based on image feature. The algorithm adopts Sobel operator to deal with subgraph and template image, and regards the region which has maximum relevance as final result. In order to solve time-consuming problem existing in original algorithm, a coarse-to-fine matching method is put forward. Besides, the location correlation keeps updating and remains the minimum value in the whole scanning process, which can significantly decrease time consumption. Experiments show that the algorithm proposed in this article can not only overcome gray distortion, but also ensure accuracy. Time consumption is at least one time orders of magnitude shorter than that of primal algorithm.
基金The National Natural Science Foundation of China(No.51675100).
文摘Aimed at the problem of the end effect when using empirical mode decomposition(EMD),a method for constraining the end effect of EMD is proposed based on sequential similarity detection and adaptive filter.The method divides the signal into many wavelets,and it changes the initial wavelet length to select the best initial wavelet that has the minimum error and maximum number of matching seed wavelets,and the wavelet slopes are used for pre-matching and secondary matching to speed up the matching speed.Then,folded self-adaptive threshold is used to select multiple seed wavelets,and finally the end waveform is predicted and expanded according to the adaptive filter method.The proposed method is used to analyze the non-stationary nonlinear simulation signal and experimental signal,and it is compared with the mirror extension and RBF extension methods.The orthogonality index and similarity index of the EMD results of the extended signal after the proposed method are better than those of the other methods.The results show that the proposed method can better constrain the end effect,and has certain validity,accuracy and stability in solving the end effect problem.
文摘Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.
文摘On the basis of Hartmann Shack sensor imaging analysis, a new method is presented with which the wavefront slope can be obtained when the object is incoherent and extended. This method, which is demonstrated by both theoretical interpreting and computer simulation, explains how to measure the wavefront slope difference between two sub apertures through the determination of image displacements on detector plane. It includes a fast and accurate digital algorithm for detecting wavefront disturbance, which is much suitable for realization in such electrical hardwares as digital signal processors.
基金supported by the Basic Research Program(No.JCKY2019210B029)Network threat depth analysis software(KY10800210013).
文摘Recently,security issues of smart contracts are arising great attention due to the enormous financial loss caused by vulnerability attacks.There is an increasing need to detect similar codes for hunting vulnerability with the increase of critical security issues in smart contracts.Binary similarity detection that quantitatively measures the given code diffing has been widely adopted to facilitate critical security analysis.However,due to the difference between common programs and smart contract,such as diversity of bytecode generation and highly code homogeneity,directly adopting existing graph matching and machine learning based techniques to smart contracts suffers from low accuracy,poor scalability and the limitation of binary similarity on function level.Therefore,this paper investigates graph neural network to detect smart contract binary code similarity at the program level,where we conduct instruction-level normalization to reduce the noise code for smart contract pre-processing and construct contract control flow graphs to represent smart contracts.In particular,two improved Graph Convolutional Network(GCN)and Message Passing Neural Network(MPNN)models are explored to encode the contract graphs into quantitatively vectors,which can capture the semantic information and the program-wide control flow information with temporal orders.Then we can efficiently accomplish the similarity detection by measuring the distance between two targeted contract embeddings.To evaluate the effectiveness and efficient of our proposed method,extensive experiments are performed on two real-world datasets,i.e.,smart contracts from Ethereum and Enterprise Operation System(EOS)blockchain-based platforms.The results show that our proposed approach outperforms three state-of-the-art methods by a large margin,achieving a great improvement up to 6.1%and 17.06%in accuracy.
基金Acknowledgements This work was supported by the National Natural Science Foundation of China (Grant Nos. 61202092 and 61173021), the Research Fund for the Doctoral Program of Higher Education of China (20112302120052), Research Fund for the Innovative Scholars of Harbin (RC2013QN010001), and Young Colleger Academic Backbone Project of Heilongjiang.
文摘The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as well as the graph- based approach and presented a metrics-based and graph- based combined approach. First, source codes are represented as augmented system dependence graphs. Then, metrics- based candidate similar code extraction is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, code normalization is performed on the candidate similar codes to remove code variations so as to detect similar code at the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar codes. Experiment results show that our approach can detect similar codes with code variations, and it can be applied to large software.