Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation va...Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.展开更多
Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Althoug...Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code,existing techniques predominantly depend on inserting artificial instructions,which incur high computational costs and offer limited diversity of perturbations.To address these limitations,we propose AIMA,a novel gradient-guided assembly instruction relocation method.Our method decouples the detection model into tokenization,embedding,and encoding layers to enable efficient gradient computation.Since token IDs of instructions are discrete and nondifferentiable,we compute gradients in the continuous embedding space to evaluate the influence of each token.The most critical tokens are identified by calculating the L2 norm of their embedding gradients.We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instructionlevel significance.To maximize adversarial impact,a sliding window algorithm selects the most influential contiguous segments for relocation,ensuring optimal perturbation with minimal length.This approach efficiently locates critical code regions without expensive search operations.The selected segments are relocated outside their original function boundaries via a jump mechanism,which preserves runtime control flow and functionality while introducing“deletion”effects in the static instruction sequence.Extensive experiments show that AIMA reduces similarity scores by up to 35.8%in state-of-the-art BCSD models.When incorporated into training data,it also enhances model robustness,achieving a 5.9%improvement in AUROC.展开更多
文摘Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.
基金supported by Key Laboratory of Cyberspace Security,Ministry of Education,China。
文摘Transformer-based models have significantly advanced binary code similarity detection(BCSD)by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings.Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code,existing techniques predominantly depend on inserting artificial instructions,which incur high computational costs and offer limited diversity of perturbations.To address these limitations,we propose AIMA,a novel gradient-guided assembly instruction relocation method.Our method decouples the detection model into tokenization,embedding,and encoding layers to enable efficient gradient computation.Since token IDs of instructions are discrete and nondifferentiable,we compute gradients in the continuous embedding space to evaluate the influence of each token.The most critical tokens are identified by calculating the L2 norm of their embedding gradients.We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instructionlevel significance.To maximize adversarial impact,a sliding window algorithm selects the most influential contiguous segments for relocation,ensuring optimal perturbation with minimal length.This approach efficiently locates critical code regions without expensive search operations.The selected segments are relocated outside their original function boundaries via a jump mechanism,which preserves runtime control flow and functionality while introducing“deletion”effects in the static instruction sequence.Extensive experiments show that AIMA reduces similarity scores by up to 35.8%in state-of-the-art BCSD models.When incorporated into training data,it also enhances model robustness,achieving a 5.9%improvement in AUROC.