摘要
扩散模型在图像生成任务中表现出较高的视觉保真度,但在图像编辑方面仍面临用户意图理解偏差、局部细节控制不足、交互响应滞后等的问题.为此,文中提出基于大语言模型双向协同的跨模态交互式图像编辑方法(Cross-Modal Interactive Image Editing Method Based on Bidirectional Collaboration between Large Language Models and User Interaction,BiC-LLM),其核心是一种双向协同控制机制,将大语言模型自顶向下的高级语义引导与用户直接参与的自底向上底层视觉控制有机融合,通过语义增强、特征解耦与动态反馈机制提升图像编辑的可控性与精度.首先,设计层次化语义驱动模块,使用大语言模型对用户输入文本进行语义解耦与推理,生成细粒度语义向量,精准理解用户意图.然后,构建视觉-结构解耦的动态控制模块,结合多层视觉特征提取器与对象级建模,实现图像全局结构与局部风格的独立控制.最后,引入实时交互机制,支持掩膜标注与参数调节,实现图像编辑过程的动态优化.在LSUN、CelebA-HQ、COCO数据集上的实验表明,BiC-LLM在文本一致性、结构稳定性与交互控制方面均较优,能实现复杂场景下的多对象语义编辑,并保持非编辑区域的内容一致性,由此验证其在图像编辑任务中的有效性与鲁棒性.
Diffusion models exhibit high visual fidelity in image generation tasks.However,they are confronted with critical challenges in image editing,such as ambiguity in user intent interpretation,insufficient control over local details,and lag in interactive response.To address these issues,a cross-modal interactive image editing method based on bidirectional collaboration with large language models(BiC-LLM)is proposed.A bidirectional collaboration mechanism is introduced as its core.The top-down semantic guidance from large language models is combined synergistically with bottom-up direct interaction from users.Therefore,controllability and precision in image editing are fundamentally enhanced by employing semantic enhancement,feature decoupling and a dynamic feedback mechanism.First,a hierarchical semantic-driven module is designed.The user-input text is decoupled and reasoned by the large language model,and fine-grained semantic vectors are generated to interpret user intent precisely.Second,a dynamic control module for vision-structure decoupling is constructed.Multi-level visual feature extractors and object-level modeling are combined to achieve independent control over global structure and local appearance.Finally,a real-time interaction mechanism is introduced to enable users to dynamically intervene in the editing process through mask annotations and parameter adjustments,thereby supporting iterative optimization.Experiments on LSUN,CelebA-HQ,and COCO datasets demonstrate that BiC-LLM significantly outperforms baseline models in terms of textual consistency,structural stability,and interactive controllability.Moreover,BiC-LLM effectively enables multi-object semantic editing in complex scenes while preserving the integrity of unedited regions,demonstrating its robustness and effectiveness in image editing tasks.
作者
石慧
金聪慧
SHI Hui;JIN Conghui(School of Computer Science and Artificial Intelligence,Liaoning Normal University,Dalian 116029)
出处
《模式识别与人工智能》
北大核心
2025年第7期596-612,共17页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.61601214,61976109)
辽宁省教育厅项目(No.JYTMS20231039)
辽宁省教育科学规划项目(No.JG22CB252)资助。