Abstract
Image-CoT methods are extended to image editing with ADE-CoT, which improves efficiency and performance through adaptive resource allocation, edit-specific verification, and opportunistic stopping mechanisms.
Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
Community
Adaptive Test-Time Scaling for Image Editing CVPR26
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss (2026)
- DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model (2026)
- Iterative Refinement Improves Compositional Image Generation (2026)
- LAMS-Edit: Latent and Attention Mixing with Schedulers for Improved Content Preservation in Diffusion-Based Image and Style Editing (2026)
- CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation (2026)
- VIBE: Visual Instruction Based Editor (2026)
- MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/from-scale-to-speed-adaptive-test-time-scaling-for-image-editing-87-1b1459c5
- Executive Summary
- Detailed Breakdown
- Practical Applications
It is really the best paper I've seen so far
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper