To address the limitations of existing instruction-based image editing methods in handling complex Multi-target instructions and maintaining semantic consistency,we present InferEdit,a training-free image editing syst...To address the limitations of existing instruction-based image editing methods in handling complex Multi-target instructions and maintaining semantic consistency,we present InferEdit,a training-free image editing system driven by a Multimodal Large Language Model(MLLM).The system parses complex multi-target instructions into sequential subtasks and performs editing iteratively through target localization and semantic reasoning.Furthermore,to adaptively select the most suitable editing models,we construct the evaluation dataset InferDataset to evaluate various editing models on three types of tasks:object removal,object replacement,and local editing.Based on a comprehensive scoring mechanism,we build Binary Search Trees(BSTs)for different editing types to facilitate model scheduling.Experiments demonstrate that InferEdit outperforms existing methods in handling complex instructions while maintaining semantic consistency and visual quality.展开更多
文摘To address the limitations of existing instruction-based image editing methods in handling complex Multi-target instructions and maintaining semantic consistency,we present InferEdit,a training-free image editing system driven by a Multimodal Large Language Model(MLLM).The system parses complex multi-target instructions into sequential subtasks and performs editing iteratively through target localization and semantic reasoning.Furthermore,to adaptively select the most suitable editing models,we construct the evaluation dataset InferDataset to evaluate various editing models on three types of tasks:object removal,object replacement,and local editing.Based on a comprehensive scoring mechanism,we build Binary Search Trees(BSTs)for different editing types to facilitate model scheduling.Experiments demonstrate that InferEdit outperforms existing methods in handling complex instructions while maintaining semantic consistency and visual quality.