Visual Spatial Tuning
Authors: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
First: 2025-11-07T18:59:16+00:00 · Latest: 2025-11-07T18:59:16+00:00
Abstract
Capturing spatial relationships from visual inputs is a cornerstone of
human-like general intelligence. Several previous studies have tried to enhance
the spatial awareness of Vision-Language Models (VLMs) by adding extra expert
encoders, which brings extra overhead and usually harms general capabilities.
To enhance the spatial ability in general architectures, we introduce Visual
Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with
human-like visuospatial abilities, from spatial perception to reasoning. We
first attempt to enhance spatial perception in VLMs by constructing a
large-scale dataset termed VST-P, which comprises 4.1 million samples spanning
19 skills across single views, multiple images, and videos. Then, we present
VST-R, a curated dataset with 135K samples that instruct models to reason in
space. In particular, we adopt a progressive training pipeline: supervised
fine-tuning to build foundational spatial knowledge, followed by reinforcement
learning to further improve spatial reasoning abilities. Without the
side-effect to general capabilities, the proposed VST consistently achieves
state-of-the-art results on several spatial benchmarks, including $34.8\%$ on
MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the
Vision-Language-Action models can be significantly enhanced with the proposed
spatial tuning paradigm, paving the way for more physically grounded AI.
中文标题/摘要
标题:视觉空间调谐
从视觉输入中捕捉空间关系是人类类似通用智能的基础。多项先前研究试图通过添加额外的专家编码器来增强视觉语言模型(VLMs)的空间意识,这带来了额外的开销并且通常损害了通用能力。为了在通用架构中增强空间能力,我们引入了视觉空间调谐(VST),这是一个全面的框架,旨在培养具有人类类似视觉空间能力的VLMs,从空间感知到推理。我们首先尝试通过构建一个名为VST-P的大规模数据集来增强VLMs的空间感知,该数据集包含410万样本,跨越单视角、多张图像和视频的19项技能。然后,我们提出了VST-R,一个包含13.5万样本的精编数据集,指导模型在空间中进行推理。特别是,我们采用了一种渐进式训练管道:监督微调以构建基础的空间知识,随后是强化学习以进一步提高空间推理能力。在不损害通用能力的情况下,所提出的VST在多个空间基准测试中始终取得了最先进的结果,包括MMSI-Bench上的34.8%和VSIBench上的61.2%。结果表明,所提出的空间调谐范式可以显著增强视觉语言行动模型,为更物理化的AI铺平了道路。
Summary / 总结
The research aims to enhance the spatial awareness of Vision-Language Models (VLMs) to achieve human-like visuospatial abilities. The authors introduce Visual Spatial Tuning (VST), which includes a large-scale dataset VST-P for spatial perception and a curated dataset VST-R for spatial reasoning. Through a progressive training pipeline, the models are first supervised fine-tuned and then reinforced to improve spatial reasoning. The VST framework consistently achieves state-of-the-art results on spatial benchmarks, such as 34.8% on MMSI-Bench and 61.2% on VSIBench, without harming general capabilities. This approach significantly enhances Vision-Language-Action models, advancing physically grounded AI systems.
研究旨在通过增强视觉语言模型(VLM)的空间意识来提升其类人通用智能。作者引入了视觉空间调优(VST)框架,包括用于空间感知的大规模数据集(VST-P)和用于空间推理的精选数据集(VST-R)。通过渐进式训练管道,VLMs被微调并进一步强化以提高其空间能力。所提出的VST在多个空间基准测试中取得了最先进的成果,如MMSI-Bench的34.8%和VSIBench的61.2%,且未损害其通用能力。这种方法显著增强了视觉语言行动模型,推动了物理上更可信的AI系统的发展。