arXiv 论文速递

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Authors: Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

First: 2025-11-26T18:59:56+00:00 · Latest: 2025-11-26T18:59:56+00:00

Comments: 24 pages; webpage: https://snap-research.github.io/canvas-to-image/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

中文标题/摘要

标题：画布到图像：多模态控制下的组合图像生成

虽然现代扩散模型在生成高质量和多样化图像方面表现出色，但在高保真组合和多模态控制方面仍然存在困难，尤其是在用户同时指定文本提示、主题参考、空间布局、姿态约束和布局注释时。我们引入了画布到图像，这是一种统一框架，将这些异构控制合并到一个画布界面中，使用户能够生成忠实反映其意图的图像。我们的主要想法是将各种控制信号编码到单个复合画布图像中，该图像可以直接被模型解释以进行综合的空间视觉推理。我们进一步整理了一组多任务数据集，并提出了一种多任务画布训练策略，该策略优化了扩散模型，使其能够在统一的学习框架内同时理解和整合异构控制到文本到图像生成中。这种联合训练使画布到图像能够在多个控制模态之间进行推理，而不是依赖于特定任务的启发式方法，并且在推理过程中能够很好地泛化到多控制场景。广泛的实验表明，画布到图像在多个具有挑战性的基准测试中，在身份保留和控制一致性方面显著优于最先进的方法，包括多人组合、姿态控制组合、布局约束生成和多控制生成。

Summary / 总结

Canvas-to-Image is a unified framework that integrates various multimodal controls into a single canvas interface for generating high-fidelity images. It encodes diverse control signals into a composite canvas image, enabling the model to perform integrated visual-spatial reasoning. Experiments demonstrate that Canvas-to-Image outperforms existing methods in identity preservation and control adherence across multiple benchmarks, including multi-person composition and layout-constrained generation.

研究旨在通过引入Canvas-to-Image，一种将各种控制信号整合到单一画布界面的统一框架，来提高图像生成中的组成和多模态控制。该方法包括将多样化的控制信号编码到复合画布图像中供模型解释，并提出了一种多任务画布训练策略来优化扩散模型。实验表明，Canvas-to-Image 在多个人物组成、布局受限生成等多种基准测试中，在身份保留和控制一致性方面优于现有方法。

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

Authors: Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang

First: 2025-11-26T18:59:55+00:00 · Latest: 2025-11-26T18:59:55+00:00