arXiv 论文速递

2025-11-11 03:18
Snapshot: 20251111_0318
Visual Spatial Tuning
Authors: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
First: 2025-11-07T18:59:16+00:00 · Latest: 2025-11-07T18:59:16+00:00
Abstract
Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
中文标题/摘要
标题:视觉空间调谐
从视觉输入中捕捉空间关系是人类类似通用智能的基础。多项先前研究试图通过添加额外的专家编码器来增强视觉语言模型(VLMs)的空间意识,这带来了额外的开销并且通常损害了通用能力。为了在通用架构中增强空间能力,我们引入了视觉空间调谐(VST),这是一个全面的框架,旨在培养具有人类类似视觉空间能力的VLMs,从空间感知到推理。我们首先尝试通过构建一个名为VST-P的大规模数据集来增强VLMs的空间感知,该数据集包含410万样本,跨越单视角、多张图像和视频的19项技能。然后,我们提出了VST-R,一个包含13.5万样本的精编数据集,指导模型在空间中进行推理。特别是,我们采用了一种渐进式训练管道:监督微调以构建基础的空间知识,随后是强化学习以进一步提高空间推理能力。在不损害通用能力的情况下,所提出的VST在多个空间基准测试中始终取得了最先进的结果,包括MMSI-Bench上的34.8%和VSIBench上的61.2%。结果表明,所提出的空间调谐范式可以显著增强视觉语言行动模型,为更物理化的AI铺平了道路。
Summary / 总结
The research aims to enhance the spatial awareness of Vision-Language Models (VLMs) to achieve human-like visuospatial abilities. The authors introduce Visual Spatial Tuning (VST), which includes a large-scale dataset VST-P for spatial perception and a curated dataset VST-R for spatial reasoning. Through a progressive training pipeline, the models are first supervised fine-tuned and then reinforced to improve spatial reasoning. The VST framework consistently achieves state-of-the-art results on spatial benchmarks, such as 34.8% on MMSI-Bench and 61.2% on VSIBench, without harming general capabilities. This approach significantly enhances Vision-Language-Action models, advancing physically grounded AI systems.
研究旨在通过增强视觉语言模型(VLM)的空间意识来提升其类人通用智能。作者引入了视觉空间调优(VST)框架,包括用于空间感知的大规模数据集(VST-P)和用于空间推理的精选数据集(VST-R)。通过渐进式训练管道,VLMs被微调并进一步强化以提高其空间能力。所提出的VST在多个空间基准测试中取得了最先进的成果,如MMSI-Bench的34.8%和VSIBench的61.2%,且未损害其通用能力。这种方法显著增强了视觉语言行动模型,推动了物理上更可信的AI系统的发展。
History
20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553