From Plans to Pixels
Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan1     Krishna Kumar Singh2     Yong Jae Lee1,2
1. University of Wisconsin-Madison     2. Adobe Research
arxiv Paper

Brief Description

We tackle image editing from abstract requests, such as "make this advertisement more eco-friendly." Our system creates a step-by-step plan for the request and learns how to execute it with the right tools.


Our system produces coherent visual transformations for abstract editing goals that require multiple coordinated tool calls.

How does step-by-step editing look?


Abstract

Image editing systems are getting very good at making realistic local changes. However, they still struggle when the request is more like an abstract goal than a single concrete edit: "make this advertisement more eco-friendly." Such instructions require planning, judgment, and several coordinated transformations rather than a single model invocation.

We introduce an agentic tool-calling system for open-ended image editing. A planner breaks the user goal into executable steps, while an orchestrator decides which tools to use and where to apply them in the image to satisfy each planned step.

Because these abstract goals do not come with a single prescribed edit sequence, our system learns from experience. After each attempt, a multimodal LLM judge scores whether the result follows the instruction and remains visually plausible; those scores improve the orchestrator. Directly training from exhaustive rollouts is intractable, so we introduce reward approximations that make learning practical.

The result is a system that can handle longer, more open-ended edits and produce more coherent, reliable visual transformations than single-step or rule-based editing pipelines.


Key Ideas

Checklist Driven Plan Generation

Abstract edit requests rarely have one canonical plan, but they do have concrete requirements a good plan should satisfy.

  • A request like "target business travelers with added corporate benefits" may require loyalty perks, productivity messaging, premium support, or other valid editing choices.
  • Instead of supervising with one prescribed edit sequence, we specify the criteria that any acceptable plan must satisfy.
  • These criteria form a checklist, written by a human or generated by a strong multimodal LLM, that conditions our planner, Qwen3-VL-8B, as it proposes the edit sequence.
  • We then train this planner to reproduce its own checklist-conditioned plans, using self-distillation to teach the model this planning behavior.
Checklist driven plan generation diagram

Tool Orchestration From Experience

Different editing tools specialize in different subtasks, so the orchestrator must learn when each tool should be used.

  • A local editing tool may be good at replacing objects, while another may be better at text edits, style changes, or background modifications.
  • Determining the best tool for a subtask is difficult from the instruction alone; it has to be learned from experience.
  • We therefore try candidate tools, inspect the resulting edited images, and use a strong multimodal LLM judge to score how well each edit satisfies the subtask.
  • The orchestrator learns from these judged outcomes, preferring the tools and regions that produce the strongest edits.
  • However, naive exploration over full multi-step edits is intractable, which motivates the reward approximations below.
Tool orchestration from experience diagram

Reward Approximations

To train the orchestrator, we need to compare tools by running them and judging the results. Exhaustively doing this over full multi-step edits is intractable, so we use two approximations.

Approximation 1

Additive Subtask Rewards

  • Given a multi-step plan, one rollout requires calling a tool for each subtask and waiting until all edits finish, making exhaustive training intractable.
  • Instead, we use subtask-level supervision: when a tool performs one subtask, we score that subtask immediately.
  • This avoids waiting for the full rollout before assigning feedback to each tool decision.
total reward ≈ sum of subtask rewards the final edit score is approximated by adding the scores of its subtasks
  • Note that this is an approximation and does not always hold: later steps can undo progress made in earlier steps, for instance a background change can remove text added earlier.
Approximation 2

Original-Image Criterion

  • Subtask-level rewards help, but they still do not make exploration tractable.
  • To score a later subtask exactly, we would need to first run every earlier edit that produces its input image.
  • We therefore use a second approximation: evaluate each candidate tool directly on the original image, rather than inside every possible partial edit sequence.
  • This is reasonable for our setting because plan subtasks usually refer to objects or regions already present in the original image.
  • The subtasks also tend to target different semantic concepts or spatial regions, except for global edits such as background or color palette changes.
Conjecture: if a tool performs a subtask well on the original image, it will usually perform well when the same subtask appears inside the full sequential edit.

Further details, experiments, and implementation choices are described in the paper.


📚 Citation

@article{rajan2026plans,
  title={From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing},
  author={Rajan, Anirudh Sundara and Singh, Krishna Kumar and Lee, Yong Jae},
  journal={arXiv preprint arXiv:2605.15181},
  year={2026}
}

You've reached the end.
This website template is adopted from visii (NeurIPS 2023) and DreamFusion (ICLR 2023), source code can be found here and here. You are more than welcome to use this website's source code for your own project, just add credit back to here. Thank you! (.❛ ᴗ ❛.).