arXiv 论文速递

Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

Authors: Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang

First: 2026-01-06T18:59:57+00:00 · Latest: 2026-01-06T18:59:57+00:00

Comments: Project page: https://luhexiao.github.io/Muses.github.io/

Abstract

We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.

中文标题/摘要

标题：缪斯：无需训练的前馈范式下幻想3D生物生成设计

我们提出了缪斯，这是首个无需训练的用于生成幻想3D生物的前馈方法。以往的方法依赖于部分感知优化、手动组装或2D图像生成，由于精细部分级操作的挑战和域外生成的限制，往往会产生不现实或不连贯的3D资产。相比之下，缪斯利用了3D骨架，这是一种生物形态的基本表示，以明确和理性的方式组合多样元素。这种骨骼基础将3D内容创作形式化为一种结构感知的设计、组合和生成流水线。缪斯首先通过图约束推理构建一个具有连贯布局和比例的创造性3D骨架。然后，该骨架指导在结构化潜在空间内的体素组装过程，整合来自不同对象的区域。最后，在骨骼条件下应用图像引导的外观建模，以生成与组装形状风格一致且和谐的纹理。大量实验表明，缪斯在视觉保真度和与文本描述的一致性方面达到了最先进的性能，并且在灵活的3D对象编辑方面具有潜力。项目页面：https://luhexiao.github.io/Muses.github.io/

Summary / 总结

Muses is a training-free method for generating fantastic 3D creatures in a feed-forward manner. It uses a 3D skeleton to compose and generate diverse elements, addressing the limitations of previous methods that often produce unrealistic 3D assets. Muses constructs a coherent 3D skeleton through graph-constrained reasoning, guides voxel-based assembly in a structured latent space, and applies image-guided appearance modeling to generate a harmonious texture. Experiments show Muses outperforms existing methods in visual fidelity and alignment with textual descriptions, and demonstrates potential for flexible 3D object editing.

Muses 是一种无需训练的 3D 幻想生物生成方法，采用前馈方式。它利用 3D 骨架来组合和生成多样元素，避免了以往方法在部件级操作和跨域生成方面的问题。Muses 通过图约束推理构建一个连贯的 3D 骨架，指导基于体素的组装过程，并应用图像引导的外观建模来生成风格一致的纹理。实验表明，Muses 在视觉保真度和与文本描述的一致性方面优于以往方法，并展示了灵活的 3D 对象编辑潜力。

Aligning Text, Images, and 3D Structure Token-by-Token

Authors: Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari

First: 2025-06-09T17:59:37+00:00 · Latest: 2026-01-06T18:58:50+00:00

Comments: Project webpage: https://glab-caltech.github.io/kyvo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

中文标题/摘要

标题：逐个对齐文本、图像和3D结构

在帮助设计师构建和编辑3D环境以及机器人在三维空间中导航和互动方面，理解3D世界的机器是必不可少的。受语言和图像建模进展的启发，我们研究了自回归模型在新模态——结构化3D场景中的潜力。为此，我们提出了一种统一的LLM框架，将语言、图像和3D场景对齐，并详细阐述了实现最佳训练和性能的关键设计选择，包括数据表示、模态特定目标等。我们展示了如何对复杂3D对象进行分词，以纳入我们的结构化3D场景模态。我们在四个核心3D任务——渲染、识别、指令跟随和问答——以及四个3D数据集（合成和真实世界）上评估了性能。我们展示了我们的模型在从单张图像重建包含复杂对象的完整3D场景以及真实世界3D对象识别任务上的有效性。项目网页：https://glab-caltech.github.io/kyvo/

Summary / 总结

The research aims to develop machines capable of understanding 3D environments, which is crucial for designers and robots. The authors propose a unified LLM framework that aligns text, images, and 3D structures, and evaluate its performance on four core 3D tasks using both synthetic and real-world datasets. The model effectively reconstructs complete 3D scenes from a single image and performs well in real-world 3D object recognition tasks.

研究旨在开发能够理解3D环境的机器，这对于设计师和机器人至关重要。作者提出了一种统一的LLM框架，将文本、图像和3D结构对齐，并使用合成和真实世界的数据集评估其在四个核心3D任务上的性能。该模型能够从单张图像重建完整的3D场景，并在3D物体识别任务中表现出色。

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Authors: Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng

First: 2026-01-06T18:57:06+00:00 · Latest: 2026-01-06T18:57:06+00:00

Comments: 19 pages, 13 figures