In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become promine...In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment.展开更多
大型语言模型(Large Language Models,LLMs)在自然语言处理领域展现出强大的能力,但其安全漏洞,尤其是越狱攻击已成为当前的核心挑战。越狱攻击利用精心构造的对抗性提示突破模型的安全对齐机制,揭示了基于人类反馈强化学习(Reinforceme...大型语言模型(Large Language Models,LLMs)在自然语言处理领域展现出强大的能力,但其安全漏洞,尤其是越狱攻击已成为当前的核心挑战。越狱攻击利用精心构造的对抗性提示突破模型的安全对齐机制,揭示了基于人类反馈强化学习(Reinforcement Learning from Human Feedback,RLHF)等对齐技术的局限性。当前基于模版或者手工设计的越狱方法因其成功率低且泛化性差,在持续迭代的LLMs安全机制下迅速失效。而基于优化的越狱方法凭借其自动生成对抗性提示的能力,在攻击成功率和隐蔽性方面表现显著,能够有效规避常规检测手段。针对白盒攻击对梯度信息的依赖与迁移性差等问题,本文聚焦黑盒优化范式,首次系统性地将现有越狱方法归纳为4类框架:基于遗传算法的优化、基于强化学习的优化、基于模糊测试的优化和基于LLMs对抗生成的优化。深入剖析各类方法的核心机制、技术优势与约束。本文的主要贡献在于提出一种新颖的分类体系与研究视角,明确指出现有防御手段在实时性、泛化性和攻防平衡方面的严重不足,并进一步倡导构建动态化防御架构与标准化评估基准,为探索LLMs在对抗环境中的安全性与性能平衡机制提供理论支持与实践指引。展开更多
近年来,大语言模型(large language model,LLM)在一系列下游任务中得到了广泛应用,并在多个领域表现出了卓越的文本理解、生成与推理能力.然而,越狱攻击正成为大语言模型的新兴威胁.越狱攻击能够绕过大语言模型的安全机制,削弱价值观对...近年来,大语言模型(large language model,LLM)在一系列下游任务中得到了广泛应用,并在多个领域表现出了卓越的文本理解、生成与推理能力.然而,越狱攻击正成为大语言模型的新兴威胁.越狱攻击能够绕过大语言模型的安全机制,削弱价值观对齐的影响,诱使经过对齐的大语言模型产生有害输出.越狱攻击带来的滥用、劫持、泄露等问题已对基于大语言模型的对话系统与应用程序造成了严重威胁.对近年的越狱攻击研究进行了系统梳理,并基于攻击原理将其分为基于人工设计的攻击、基于模型生成的攻击与基于对抗性优化的攻击3类.详细总结了相关研究的基本原理、实施方法与研究结论,全面回顾了大语言模型越狱攻击的发展历程,为后续的研究提供了有效参考.对现有的安全措施进行了简略回顾,从内部防御与外部防御2个角度介绍了能够缓解越狱攻击并提高大语言模型生成内容安全性的相关技术,并对不同方法的利弊进行了罗列与比较.在上述工作的基础上,对大语言模型越狱攻击领域的现存问题与前沿方向进行探讨,并结合多模态、模型编辑、多智能体等方向进行研究展望.展开更多
文摘In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment.
文摘大型语言模型(Large Language Models,LLMs)在自然语言处理领域展现出强大的能力,但其安全漏洞,尤其是越狱攻击已成为当前的核心挑战。越狱攻击利用精心构造的对抗性提示突破模型的安全对齐机制,揭示了基于人类反馈强化学习(Reinforcement Learning from Human Feedback,RLHF)等对齐技术的局限性。当前基于模版或者手工设计的越狱方法因其成功率低且泛化性差,在持续迭代的LLMs安全机制下迅速失效。而基于优化的越狱方法凭借其自动生成对抗性提示的能力,在攻击成功率和隐蔽性方面表现显著,能够有效规避常规检测手段。针对白盒攻击对梯度信息的依赖与迁移性差等问题,本文聚焦黑盒优化范式,首次系统性地将现有越狱方法归纳为4类框架:基于遗传算法的优化、基于强化学习的优化、基于模糊测试的优化和基于LLMs对抗生成的优化。深入剖析各类方法的核心机制、技术优势与约束。本文的主要贡献在于提出一种新颖的分类体系与研究视角,明确指出现有防御手段在实时性、泛化性和攻防平衡方面的严重不足,并进一步倡导构建动态化防御架构与标准化评估基准,为探索LLMs在对抗环境中的安全性与性能平衡机制提供理论支持与实践指引。
文摘近年来,大语言模型(large language model,LLM)在一系列下游任务中得到了广泛应用,并在多个领域表现出了卓越的文本理解、生成与推理能力.然而,越狱攻击正成为大语言模型的新兴威胁.越狱攻击能够绕过大语言模型的安全机制,削弱价值观对齐的影响,诱使经过对齐的大语言模型产生有害输出.越狱攻击带来的滥用、劫持、泄露等问题已对基于大语言模型的对话系统与应用程序造成了严重威胁.对近年的越狱攻击研究进行了系统梳理,并基于攻击原理将其分为基于人工设计的攻击、基于模型生成的攻击与基于对抗性优化的攻击3类.详细总结了相关研究的基本原理、实施方法与研究结论,全面回顾了大语言模型越狱攻击的发展历程,为后续的研究提供了有效参考.对现有的安全措施进行了简略回顾,从内部防御与外部防御2个角度介绍了能够缓解越狱攻击并提高大语言模型生成内容安全性的相关技术,并对不同方法的利弊进行了罗列与比较.在上述工作的基础上,对大语言模型越狱攻击领域的现存问题与前沿方向进行探讨,并结合多模态、模型编辑、多智能体等方向进行研究展望.