Abstract
Semantic image editing methods employing large-scale diffusion models have made significant strides in precise and controlled image editing with text prompts as guidance. However, these models struggle to handle complex images containing hard-described objects and/or multiple objects. In this work, we introduce a novel inference-time multi-object image editing strategy, Point2pix-Zero, editing a single object with the simple guidance of clicked points and the text of target objects. We employ an interactive methodology, point-discovery, as text-free guidance to identify the semantic information of intended edited objects and generate text prompts automatically. Instead of exploiting internal cross-attention maps of diffusion models as a guide, we inject external attention maps to rectify the visual-and-semantic pairing mismatches in cross-attention maps during the denoising process. Extensive empirical evaluations demonstrate the effectiveness of our proposed inference-time method in ensuring precise editing while maintaining image fidelity. Our method showcases superior performance in single- and multi-object image editing, positioning it as a new state-of-the-art.
| Original language | English |
|---|---|
| Article number | 112041 |
| Journal | Pattern Recognition |
| Volume | 170 |
| DOIs | |
| Publication status | Published - Feb 2026 |
Keywords
- Diffusion model
- Image editing