Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Althoug...Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code,existing techniques predominantly depend on inserting artificial instructions,which incur high computational costs and offer limited diversity of perturbations.To address these limitations,we propose AIMA,a novel gradient-guided assembly instruction relocation method.Our method decouples the detection model into tokenization,embedding,and encoding layers to enable efficient gradient computation.Since token IDs of instructions are discrete and nondifferentiable,we compute gradients in the continuous embedding space to evaluate the influence of each token.The most critical tokens are identified by calculating the L2 norm of their embedding gradients.We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instructionlevel significance.To maximize adversarial impact,a sliding window algorithm selects the most influential contiguous segments for relocation,ensuring optimal perturbation with minimal length.This approach efficiently locates critical code regions without expensive search operations.The selected segments are relocated outside their original function boundaries via a jump mechanism,which preserves runtime control flow and functionality while introducing“deletion”effects in the static instruction sequence.Extensive experiments show that AIMA reduces similarity scores by up to 35.8%in state-of-the-art BCSD models.When incorporated into training data,it also enhances model robustness,achieving a 5.9%improvement in AUROC.展开更多
Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation va...Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.展开更多
Various binary similarity measures have been employed in clustering approaches to make homogeneous groups of similar entities in the data. These similarity measures are mostly based only on the presence or absence of ...Various binary similarity measures have been employed in clustering approaches to make homogeneous groups of similar entities in the data. These similarity measures are mostly based only on the presence or absence of features. Binary similarity measures have also been explored with different clustering approaches (e.g., agglomera- tive hierarchical clustering) for software modularization to make software systems understandable and manageable. Each similarity measure has its own strengths and weaknesses which improve and deteriorate the clustering results, respectively. We highlight the strengths of some well-known existing binary similarity measures for software mod- ularization. Furthermore, based on these existing similarity measures, we introduce several improved new binary similarity measures. Proofs of the correctness with illustration and a series of experiments are presented to evaluate the effectiveness of our new binary similarity measures.展开更多
A new algorithm using polar coordinate system similarity (PCSS) for tracking particle in particle tracking velocimetry (PTV) is proposed. The essence of the algorithm is to consider simultaneously the changes of t...A new algorithm using polar coordinate system similarity (PCSS) for tracking particle in particle tracking velocimetry (PTV) is proposed. The essence of the algorithm is to consider simultaneously the changes of the distance and angle of surrounding particles relative to the object particle. Monte Carlo simulations of a solid body rotational flow and a parallel shearing flow are used to investigate flows measurable by PCSS and the influences of experimental parameters on the implementation of the new algorithm. The results indicate that the PCSS algorithm can be applied to flows subjected to strong rotation and is not sensitive to experimental parameters in comparison with the conventional binary image cross-correlation (BICC) algorithm. Finally, PCSS is applied to images of a real experiment.展开更多
A procedure to recognize individual discontinuities in rock mass from measurement while drilling(MWD)technology is developed,using the binary pattern of structural rock characteristics obtained from in-hole images for...A procedure to recognize individual discontinuities in rock mass from measurement while drilling(MWD)technology is developed,using the binary pattern of structural rock characteristics obtained from in-hole images for calibration.Data from two underground operations with different drilling technology and different rock mass characteristics are considered,which generalizes the application of the methodology to different sites and ensures the full operational integration of MWD data analysis.Two approaches are followed for site-specific structural model building:a discontinuity index(DI)built from variations in MWD parameters,and a machine learning(ML)classifier as function of the drilling parameters and their variability.The prediction ability of the models is quantitatively assessed as the rate of recognition of discontinuities observed in borehole logs.Differences between the parameters involved in the models for each site,and differences in their weights,highlight the site-dependence of the resulting models.The ML approach offers better performance than the classical DI,with recognition rates in the range 89%to 96%.However,the simpler DI still yields fairly accurate results,with recognition rates 70%to 90%.These results validate the adaptive MWD-based methodology as an engineering solution to predict rock structural condition in underground mining operations.展开更多
Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain...Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain under most circumstances,binary-level code similarity analysis(BCSA)has been paid much attention to.In recent years,many BCSA studies incorporating Al techniques focus on deriving semantic information from binary functions with code representations such as assembly code,intermediate representations,and control flow graphs to measure the similarity.However,due to the impacts of different compilers,architectures,and obfuscations,binaries compiled from the same source code may vary considerably,which becomes the major obstacle for these works to obtain robust features.In this paper,we propose a solution,named UPPC(Unleashing the Power of Pseudo-code),which leverages the pseudo-code of binary function as input,to address the binary code similarity analysis challenge,since pseudocode has higher abstraction and is platform-independent compared to binary instructions.UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function.We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures(X86,ARM),different optimization options(O0-O3),different compilers(GCC,Clang),and four obfuscation strategies.The experimental results show that the accuracy of UPPC in function search is 33.2%higher than that of existing methods.展开更多
基金supported by Key Laboratory of Cyberspace Security,Ministry of Education,China。
文摘Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code,existing techniques predominantly depend on inserting artificial instructions,which incur high computational costs and offer limited diversity of perturbations.To address these limitations,we propose AIMA,a novel gradient-guided assembly instruction relocation method.Our method decouples the detection model into tokenization,embedding,and encoding layers to enable efficient gradient computation.Since token IDs of instructions are discrete and nondifferentiable,we compute gradients in the continuous embedding space to evaluate the influence of each token.The most critical tokens are identified by calculating the L2 norm of their embedding gradients.We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instructionlevel significance.To maximize adversarial impact,a sliding window algorithm selects the most influential contiguous segments for relocation,ensuring optimal perturbation with minimal length.This approach efficiently locates critical code regions without expensive search operations.The selected segments are relocated outside their original function boundaries via a jump mechanism,which preserves runtime control flow and functionality while introducing“deletion”effects in the static instruction sequence.Extensive experiments show that AIMA reduces similarity scores by up to 35.8%in state-of-the-art BCSD models.When incorporated into training data,it also enhances model robustness,achieving a 5.9%improvement in AUROC.
文摘Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.
基金supported by the Office of Research,Innovation,Commercialization and Consultancy(ORICC)Universiti Tun Hussein Onn Malaysia(UTHM),Malaysia(No.U063)
文摘Various binary similarity measures have been employed in clustering approaches to make homogeneous groups of similar entities in the data. These similarity measures are mostly based only on the presence or absence of features. Binary similarity measures have also been explored with different clustering approaches (e.g., agglomera- tive hierarchical clustering) for software modularization to make software systems understandable and manageable. Each similarity measure has its own strengths and weaknesses which improve and deteriorate the clustering results, respectively. We highlight the strengths of some well-known existing binary similarity measures for software mod- ularization. Furthermore, based on these existing similarity measures, we introduce several improved new binary similarity measures. Proofs of the correctness with illustration and a series of experiments are presented to evaluate the effectiveness of our new binary similarity measures.
基金supported by the National Natural Science Foundation of China(50206019)
文摘A new algorithm using polar coordinate system similarity (PCSS) for tracking particle in particle tracking velocimetry (PTV) is proposed. The essence of the algorithm is to consider simultaneously the changes of the distance and angle of surrounding particles relative to the object particle. Monte Carlo simulations of a solid body rotational flow and a parallel shearing flow are used to investigate flows measurable by PCSS and the influences of experimental parameters on the implementation of the new algorithm. The results indicate that the PCSS algorithm can be applied to flows subjected to strong rotation and is not sensitive to experimental parameters in comparison with the conventional binary image cross-correlation (BICC) algorithm. Finally, PCSS is applied to images of a real experiment.
基金conducted under the illu MINEation project, funded by the European Union’s Horizon 2020 research and innovation program under grant agreement (No. 869379)supported by the China Scholarship Council (No. 202006370006)
文摘A procedure to recognize individual discontinuities in rock mass from measurement while drilling(MWD)technology is developed,using the binary pattern of structural rock characteristics obtained from in-hole images for calibration.Data from two underground operations with different drilling technology and different rock mass characteristics are considered,which generalizes the application of the methodology to different sites and ensures the full operational integration of MWD data analysis.Two approaches are followed for site-specific structural model building:a discontinuity index(DI)built from variations in MWD parameters,and a machine learning(ML)classifier as function of the drilling parameters and their variability.The prediction ability of the models is quantitatively assessed as the rate of recognition of discontinuities observed in borehole logs.Differences between the parameters involved in the models for each site,and differences in their weights,highlight the site-dependence of the resulting models.The ML approach offers better performance than the classical DI,with recognition rates in the range 89%to 96%.However,the simpler DI still yields fairly accurate results,with recognition rates 70%to 90%.These results validate the adaptive MWD-based methodology as an engineering solution to predict rock structural condition in underground mining operations.
文摘Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain under most circumstances,binary-level code similarity analysis(BCSA)has been paid much attention to.In recent years,many BCSA studies incorporating Al techniques focus on deriving semantic information from binary functions with code representations such as assembly code,intermediate representations,and control flow graphs to measure the similarity.However,due to the impacts of different compilers,architectures,and obfuscations,binaries compiled from the same source code may vary considerably,which becomes the major obstacle for these works to obtain robust features.In this paper,we propose a solution,named UPPC(Unleashing the Power of Pseudo-code),which leverages the pseudo-code of binary function as input,to address the binary code similarity analysis challenge,since pseudocode has higher abstraction and is platform-independent compared to binary instructions.UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function.We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures(X86,ARM),different optimization options(O0-O3),different compilers(GCC,Clang),and four obfuscation strategies.The experimental results show that the accuracy of UPPC in function search is 33.2%higher than that of existing methods.