We tackle image editing from abstract requests, such as "make this advertisement more eco-friendly." Our system creates a step-by-step plan for the request and learns how to execute it with the right tools.
Image editing systems are getting very good at making realistic local changes. However, they still struggle when the request is more like an abstract goal than a single concrete edit: "make this advertisement more eco-friendly." Such instructions require planning, judgment, and several coordinated transformations rather than a single model invocation.
We introduce an agentic tool-calling system for open-ended image editing. A planner breaks the user goal into executable steps, while an orchestrator decides which tools to use and where to apply them in the image to satisfy each planned step.
Because these abstract goals do not come with a single prescribed edit sequence, our system learns from experience. After each attempt, a multimodal LLM judge scores whether the result follows the instruction and remains visually plausible; those scores improve the orchestrator. Directly training from exhaustive rollouts is intractable, so we introduce reward approximations that make learning practical.
The result is a system that can handle longer, more open-ended edits and produce more coherent, reliable visual transformations than single-step or rule-based editing pipelines.
Abstract edit requests rarely have one canonical plan, but they do have concrete requirements a good plan should satisfy.
Different editing tools specialize in different subtasks, so the orchestrator must learn when each tool should be used.
To train the orchestrator, we need to compare tools by running them and judging the results. Exhaustively doing this over full multi-step edits is intractable, so we use two approximations.
Further details, experiments, and implementation choices are described in the paper.
You've reached the end.
This website template is adopted from visii (NeurIPS 2023) and DreamFusion (ICLR 2023), source code can be found here and here.
You are more than welcome to use this website's source code for your own project, just add credit back to here.
Thank you!
(.❛ ᴗ ❛.).