arXiv 论文速递

Snapshot: 20260215_0332

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone

First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00

Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

中文标题/摘要

标题：扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐

通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动（VLA）模型在这一目标上取得了显著进展，但它们生成的动作仍然可能与给定的指令不一致。在本文中，我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指示的物理任务的测试时扩展定律，证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性，通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律，我们提出了CoVer，一种对比验证器，用于视觉-语言-行动对齐，并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后，我们介绍了“启动时计算”和一个分层验证推理流水线，用于VLAs。在部署时，我们的框架从视觉语言模型（VLM）预计算一组多样化的重述指令，反复为每条指令生成动作候选，然后使用验证器选择最优的高级提示和低级动作片段。与在相同数据上扩展策略预训练相比，我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进，在实际实验中进一步提高了45%。在PolaRiS基准测试中，CoVer实现了14%的任务进展和9%的成功率提升。

Summary / 总结

This paper explores test-time verification as a method to improve alignment between actions and natural language instructions in vision-language-action models. It demonstrates that jointly scaling the number of rephrased instructions and generated actions increases test-time sample diversity, leading to more efficient recovery of correct actions. The proposed CoVer architecture scales gracefully with additional resources, and the framework precomputes diverse rephrased instructions and uses a verifier to select optimal actions, resulting in significant improvements in both in-distribution and out-of-distribution performance on the SIMPLER benchmark and real-world experiments. On the PolaRiS benchmark, CoVer shows 14% gains in task progress and 9% in success rate.

本文探讨了测试时验证作为提高视觉-语言-动作模型中动作与自然语言指令之间对齐的方法。研究表明，同时增加重述指令的数量和生成动作的数量可以提高测试时样本多样性，从而更有效地恢复正确的动作。提出的CoVer架构能够随着资源的增加而平滑扩展，并且该框架预计算多样化的重述指令，使用验证器选择最优的动作，从而在SIMPLER基准测试中的分布内和分布外性能上取得了显著的改进。在PolaRiS基准测试中，CoVer在任务进度上取得了14%的提升，在成功率上取得了9%的提升。

Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

Authors: Huai-Hsun Cheng, Siang-Ling Zhang, Yu-Lun Liu

First: 2026-02-12T18:59:54+00:00 · Latest: 2026-02-12T18:59:54+00:00

Comments: Project page: https://stroke-of-surprise.github.io/ Code: https://github.com/stroke-of-surprise/Stroke-Of-Surprise

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/

中文标题/摘要

标题：惊喜一击：渐进语义错觉在矢量素描中的应用

视觉错觉传统上依赖于空间操作，如多视角一致性。在本工作中，我们引入了渐进语义错觉，这是一种新颖的矢量素描任务，其中单个素描通过逐步添加线条经历剧烈的语义转变。我们提出了惊喜一击，这是一种生成框架，优化矢量线条以在不同的绘画阶段满足不同的语义解释。核心挑战在于“双重约束”：初始前缀线条必须形成一个连贯的对象（例如，一只鸭子），同时作为添加增量线条后第二个概念（例如，一只绵羊）的结构基础。为了解决这一问题，我们提出了一种基于双重分支评分蒸馏采样（SDS）机制的序列感知联合优化框架。与顺序方法冻结初始状态不同，我们的方法动态调整前缀线条以发现适用于两个目标的“共同结构子空间”。此外，我们引入了一种新颖的叠加损失，以确保空间互补性，而不是遮挡。大量实验表明，我们的方法在可识别性和错觉强度方面显著优于最先进的基线方法，成功地将视觉异文从空间扩展到时间维度。项目页面：https://stroke-of-surprise.github.io/

Summary / 总结

This work introduces Progressive Semantic Illusions in vector sketching, where a single sketch transforms dramatically through the addition of strokes. The Stroke of Surprise framework optimizes vector strokes to satisfy different semantic interpretations at various drawing stages, addressing the dual-constraint of coherence and structural foundation. Experiments show that this method outperforms existing techniques in recognizability and illusion strength, expanding visual anagrams from spatial to temporal dimensions.

该研究引入了矢量素描中的渐进语义幻象，通过顺序添加线条使单个素描在不同阶段发生剧烈变化。Stroke of Surprise框架优化线条以满足不同语义解释的需求，通过动态调整初始线条来同时满足初始和后续的概念。实验表明，该方法在可识别性和幻象强度上优于现有技术，将视觉异文从空间维度扩展到时间维度。

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Authors: Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

First: 2026-02-12T18:59:49+00:00 · Latest: 2026-02-12T18:59:49+00:00