From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

How does step-by-step editing look?

Abstract

Image editing systems are getting very good at making realistic local changes. However, they still struggle when the request is more like an abstract goal than a single concrete edit: "make this advertisement more eco-friendly." Such instructions require planning, judgment, and several coordinated transformations rather than a single model invocation.

We introduce an agentic tool-calling system for open-ended image editing. A planner breaks the user goal into executable steps, while an orchestrator decides which tools to use and where to apply them in the image to satisfy each planned step.

Because these abstract goals do not come with a single prescribed edit sequence, our system learns from experience. After each attempt, a multimodal LLM judge scores whether the result follows the instruction and remains visually plausible; those scores improve the orchestrator. Directly training from exhaustive rollouts is intractable, so we introduce reward approximations that make learning practical.

The result is a system that can handle longer, more open-ended edits and produce more coherent, reliable visual transformations than single-step or rule-based editing pipelines.

Key Ideas

Checklist Driven Plan Generation

Abstract edit requests rarely have one canonical plan, but they do have concrete requirements a good plan should satisfy.

A request like "target business travelers with added corporate benefits" may require loyalty perks, productivity messaging, premium support, or other valid editing choices.
Instead of supervising with one prescribed edit sequence, we specify the criteria that any acceptable plan must satisfy.
These criteria form a checklist, written by a human or generated by a strong multimodal LLM, that conditions our planner, Qwen3-VL-8B, as it proposes the edit sequence.
We then train this planner to reproduce its own checklist-conditioned plans, using self-distillation to teach the model this planning behavior.

Checklist driven plan generation diagram

Tool Orchestration From Experience

Different editing tools specialize in different subtasks, so the orchestrator must learn when each tool should be used.

A local editing tool may be good at replacing objects, while another may be better at text edits, style changes, or background modifications.
Determining the best tool for a subtask is difficult from the instruction alone; it has to be learned from experience.
We therefore try candidate tools, inspect the resulting edited images, and use a strong multimodal LLM judge to score how well each edit satisfies the subtask.
The orchestrator learns from these judged outcomes, preferring the tools and regions that produce the strongest edits.
However, naive exploration over full multi-step edits is intractable, which motivates the reward approximations below.

Tool orchestration from experience diagram

Reward Approximations

To train the orchestrator, we need to compare tools by running them and judging the results. Exhaustively doing this over full multi-step edits is intractable, so we use two approximations.

Approximation 1

Additive Subtask Rewards

Given a multi-step plan, one rollout requires calling a tool for each subtask and waiting until all edits finish, making exhaustive training intractable.
Instead, we use subtask-level supervision: when a tool performs one subtask, we score that subtask immediately.
This avoids waiting for the full rollout before assigning feedback to each tool decision.

total reward ≈ sum of subtask rewards the final edit score is approximated by adding the scores of its subtasks

Note that this is an approximation and does not always hold: later steps can undo progress made in earlier steps, for instance a background change can remove text added earlier.

Approximation 2

Original-Image Criterion

Subtask-level rewards help, but they still do not make exploration tractable.
To score a later subtask exactly, we would need to first run every earlier edit that produces its input image.
We therefore use a second approximation: evaluate each candidate tool directly on the original image, rather than inside every possible partial edit sequence.
This is reasonable for our setting because plan subtasks usually refer to objects or regions already present in the original image.
The subtasks also tend to target different semantic concepts or spatial regions, except for global edits such as background or color palette changes.

Conjecture: if a tool performs a subtask well on the original image, it will usually perform well when the same subtask appears inside the full sequential edit.

Further details, experiments, and implementation choices are described in the paper.

From Plans to Pixels
Learning to Plan and Orchestrate for Open-Ended Image Editing

Brief Description

🎨 Qualitative Gallery

How does step-by-step editing look?

Abstract

Key Ideas

Checklist Driven Plan Generation

Tool Orchestration From Experience

Reward Approximations

Additive Subtask Rewards

Original-Image Criterion

📚 Citation

From Plans to Pixels Learning to Plan and Orchestrate for Open-Ended Image Editing

Brief Description

🎨 Qualitative Gallery

How does step-by-step editing look?

Abstract

Key Ideas

Checklist Driven Plan Generation

Tool Orchestration From Experience

Reward Approximations

Additive Subtask Rewards

Original-Image Criterion

📚 Citation

From Plans to Pixels
Learning to Plan and Orchestrate for Open-Ended Image Editing