arXiv 论文速递

2026-03-31 04:08
Snapshot: 20260331_0408
Detailed Geometry and Appearance from Opportunistic Motion
Authors: Ryosuke Hirai, Kohei Yamashita, Antoine Guédon, Ryo Kawahara, Vincent Lepetit, Ko Nishino
First: 2026-03-27T17:59:16+00:00 · Latest: 2026-03-27T17:59:16+00:00
Abstract
Reconstructing 3D geometry and appearance from a sparse set of fixed cameras is a foundational task with broad applications, yet it remains fundamentally constrained by the limited viewpoints. We show that this bound can be broken by exploiting opportunistic object motion: as a person manipulates an object~(e.g., moving a chair or lifting a mug), the static cameras effectively ``orbit'' the object in its local coordinate frame, providing additional virtual viewpoints. Harnessing this object motion, however, poses two challenges: the tight coupling of object pose and geometry estimation and the complex appearance variations of a moving object under static illumination. We address these by formulating a joint pose and shape optimization using 2D Gaussian splatting with alternating minimization of 6DoF trajectories and primitive parameters, and by introducing a novel appearance model that factorizes diffuse and specular components with reflected directional probing within the spherical harmonics space. Extensive experiments on synthetic and real-world datasets with extremely sparse viewpoints demonstrate that our method recovers significantly more accurate geometry and appearance than state-of-the-art baselines.
中文标题/摘要
标题:从偶然运动中获取详细的几何形状和外观
从稀疏的固定摄像头集重构3D几何形状和外观是一项基础任务,具有广泛的应用前景,但仍然受到有限视角的限制。我们展示了通过利用偶然的物体运动可以打破这一限制:当一个人操作物体(例如移动椅子或举起杯子)时,静态摄像头实际上在物体的局部坐标系中“环绕”物体,提供了额外的虚拟视角。然而,利用这种物体运动也带来了两个挑战:物体姿态和几何形状估计的紧密耦合以及在静态照明下移动物体的复杂外观变化。我们通过使用2D高斯点积的联合姿态和形状优化,并交替最小化6自由度轨迹和基本参数,以及引入一种新颖的外观模型(该模型在球谐空间内将漫反射和镜面反射成分因子化),来解决这些问题。在合成和真实世界数据集上的大量实验表明,我们的方法比最先进的基线方法恢复了更准确的几何形状和外观。
Summary / 总结
The research aims to reconstruct detailed 3D geometry and appearance from a limited set of static cameras by leveraging opportunistic object motion. The method addresses the challenges of tightly coupled pose and shape estimation and complex appearance variations through joint optimization and a novel appearance model. Experiments show that the proposed method outperforms existing techniques in recovering accurate 3D geometry and appearance even with very sparse viewpoints.
本文通过利用物体的偶然运动,解决了仅使用少量静态摄像头重建详细3D几何和外观的问题。该方法通过联合优化克服了物体姿态和几何估计之间的紧密耦合,并引入了一种新的外观模型,该模型在球谐空间内分解漫反射和镜面反射成分。实验表明,所提出的方法在极少数视角的情况下,比现有技术更准确地恢复了3D几何和外观。
GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation
Authors: Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner
First: 2026-03-27T17:58:05+00:00 · Latest: 2026-03-27T17:58:05+00:00
Comments: Project page: https://nicolasvonluetzow.github.io/GaussianGPT/ - Project video: https://youtu.be/zVnMHkFzHDg
Abstract
Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.
中文标题/摘要
标题:GaussianGPT:迈向自回归的3D高斯场景生成
最近的3D生成建模进展主要依赖于扩散或流匹配形式。我们相反地探索了一种完全自回归的替代方案,并引入了GaussianGPT,这是一种基于变换器的模型,可以直接通过下一个标记预测生成3D高斯分布,从而促进完整的3D场景生成。我们首先使用稀疏3D卷积自编码器和向量量化将高斯原语压缩到离散的潜在网格中。生成的标记被序列化并使用具有3D旋转位置嵌入的因果变换器建模,从而实现空间结构和外观的顺序生成。与整体细化场景的扩散方法不同,我们的形式逐步构建场景,自然支持完成、出画、通过温度控制的采样以及灵活的生成时间范围。该形式利用了自回归建模的组合归纳偏见和可扩展性,同时在与现代神经渲染管道兼容的显式表示上操作,将自回归变换器定位为控制性和上下文感知3D生成的补充范式。
Summary / 总结
GaussianGPT is a transformer-based model that generates 3D scenes by predicting the next token in an autoregressive manner, using a sparse 3D convolutional autoencoder for compression and a causal transformer with 3D rotary positional embedding for modeling. This approach allows for step-by-step scene construction, supporting completion, outpainting, and controllable sampling, offering a complementary paradigm to diffusion-based methods for 3D generation.
GaussianGPT 是一个全自回归模型,通过预测序列中的下一个标记来生成 3D 场景,使用带有 3D 旋转位置嵌入的因果变压器。它首先使用稀疏 3D 卷积自编码器和向量量化将 3D 高斯基元压缩到一个离散的潜在网格中,然后序列化和建模这些标记以生成空间结构和外观。关键发现包括逐步骤的场景构建、支持完成和出画、通过温度进行可控采样以及灵活的生成时间范围,展示了该模型在 3D 场景生成中的有效性。
Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning
Authors: Xinqi, Liu, Ruoxi Hu, Alejandro Ojeda Olarte, Zhuoran Chen, Kenny Ma, Charles Cheng Ji, Lerrel Pinto, Raunaq Bhirangi, Irmak Guzey
First: 2026-03-27T17:58:03+00:00 · Latest: 2026-03-27T17:58:03+00:00
Abstract
Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under $1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at https://ruka-hand-v2.github.io/ .
中文标题/摘要
标题:Ruka-v2:具有腕部和 abduction 的开放源代码肌腱驱动灵巧手以供机器人学习
缺乏可访问且灵巧的机器人硬件一直是实现机器人人类水平灵巧性的主要瓶颈。去年,我们发布了Ruka,一个完全开源的肌腱驱动类人手,自由度为11个——每个手指2个,拇指3个,成本低于1300美元。它是第一个完全开源的类人手之一,并引入了一种新颖的数据驱动的指尖控制方法,捕捉了控制系统中的肌腱动力学。尽管做出了这些贡献,但Ruka缺乏两个关键的自由度,即腕部的灵活性和手指的adduction/abduction,这对于模仿人类行为至关重要。在本文中,我们介绍了Ruka-v2:一个完全开源的肌腱驱动类人手,具有解耦的2-DOF并联腕部和手指的adduction/adduction。并联腕部增加了平滑的独立弯曲/伸展和桡侧/尺侧偏移,使在柜子等受限环境中进行操作成为可能。adduction使抓取细长物体、在手内旋转和书法等动作成为可能。我们介绍了Ruka-v2的设计,并通过用户研究将其与Ruka进行比较,在远程操作任务中发现完成时间减少了51.3%,成功率提高了21.2%。我们进一步展示了其在机器人学习中的全部应用范围:双臂和单臂远程操作13个灵巧任务,以及在3个任务上的自主策略学习。所有3D打印文件、组装说明、控制器软件和视频均可在https://ruka-hand-v2.github.io/ 获取。
Summary / 总结
Ruka-v2 is an open-source, tendon-driven dexterous hand that adds wrist mobility and finger abduction/adduction to the original Ruka, enhancing its ability to imitate human behavior. Through user studies, Ruka-v2 showed a 51.3% reduction in task completion time and a 21.2% increase in success rate compared to Ruka. It supports various applications including bimanual and single-arm teleoperation and autonomous policy learning on multiple tasks.
研究通过引入Ruka-v2,一种具有额外自由度(腕部移动和手指外展/内收)的开源肌腱驱动手,解决了可获取且灵巧的机器人硬件不足的问题。通过用户研究将Ruka-v2与原始Ruka进行比较,结果显示任务完成时间减少了51.3%,成功率提高了21.2%。Ruka-v2还被证明适用于各种学习任务。
Zero-Shot Depth from Defocus
Authors: Yiming Zuo, Hongyu Wen, Venkat Subramanian, Patrick Chen, Karhan Kayan, Mario Bijelic, Felix Heide, Jia Deng
First: 2026-03-27T17:56:26+00:00 · Latest: 2026-03-27T17:56:26+00:00
Abstract
Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.
中文标题/摘要
标题:零样本离焦深度估计
离焦深度估计(DfD)是从焦距堆栈中估计密集度量深度图的任务。不同于以往工作对特定数据集的过度拟合,本文关注零样本泛化的具有挑战性和实用性的设置。我们首先提出一个新的真实世界DfD基准ZEDD,其包含8.3倍更多的场景,并且具有显著更高的质量图像和地面真实深度图,相比之前的基准。我们还设计了一种新的网络架构FOSSA。FOSSA是一种基于Transformer的架构,针对DfD任务具有新颖的设计。关键贡献是一个堆栈注意力层,带有焦距嵌入,允许在焦距堆栈中高效地交换信息。最后,我们开发了一种新的训练数据管道,使我们能够利用现有的大规模RGBD数据集生成合成的焦距堆栈。在ZEDD和其他基准上的实验结果表明,与基线相比有显著改进,错误率最多降低了55.7%。ZEDD基准发布在https://zedd.cs.princeton.edu。代码和检查点发布在https://github.com/princeton-vl/FOSSA。
Summary / 总结
This paper addresses the Depth from Defocus (DfD) task, focusing on zero-shot generalization. It introduces a new benchmark ZEDD with higher quality images and more scenes than previous benchmarks. The authors propose a novel network architecture called FOSSA, which includes a stack attention layer with a focus distance embedding. The method significantly improves performance, reducing errors by up to 55.7% compared to baselines.
该论文解决了零样本深度从焦距差的挑战,提出了一个新的基准ZEDD,包含更多场景和更高质量的图像。它引入了一种名为FOSSA的新网络架构,其中包括一个带有焦距嵌入的堆栈注意力层,并开发了一种生成合成焦距堆栈的训练数据管道。实验结果表明,与基线相比,在ZEDD和其他基准上的深度估计误差降低了高达55.7%。
Tunable Soft Equivariance with Guarantees
Authors: Md Ashiqur Rahman, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh
First: 2026-03-27T17:56:25+00:00 · Latest: 2026-03-27T17:56:25+00:00
Abstract
Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.
中文标题/摘要
标题:可调软等变性及其保证
等变性是计算机视觉模型中的一个基本属性,但在实际数据中严格等变性很少被满足,这可能会限制模型的性能。因此,控制等变性的程度是可取的。我们提出了一种通用框架,通过将模型权重投影到设计的子空间中来构建软等变模型。该方法适用于任何预训练架构,并提供了诱导等变性误差的理论界。实验上,我们在图像分类、语义分割和人体轨迹预测等多个预训练骨干网络上展示了我们方法的有效性,包括ViT和ResNet。值得注意的是,我们的方法在具有竞争力的ImageNet基准测试中提高了性能并同时减少了等变性误差。
Summary / 总结
The paper addresses the challenge of achieving strict equivariance in real-world data, which can limit model performance. It introduces a framework to construct soft equivariant models by projecting model weights into a designed subspace, applicable to various pre-trained architectures. Experiments on image classification, semantic segmentation, and human-trajectory prediction show that the method improves performance while reducing equivariance error on the ImageNet benchmark.
论文针对在实际数据中难以实现严格的不变性而限制模型性能的问题,提出了一种框架,通过将模型权重投影到设计的子空间来构建软不变性模型,适用于多种预训练架构。实验结果显示,该方法在图像分类、语义分割和人体轨迹预测任务上提高了性能,并在ImageNet基准测试中减少了不变性误差。
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Authors: Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna
First: 2026-03-27T17:54:36+00:00 · Latest: 2026-03-27T17:54:36+00:00
Comments: Project Page: https://perceptioncomp.github.io
Abstract
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
中文标题/摘要
标题:PerceptionComp:一种复杂感知中心推理的视频基准
我们介绍了PerceptionComp,一种手工标注的复杂、长时序、感知中心的视频推理基准。PerceptionComp设计成没有单一时刻是足够的:回答每个问题需要多个时间上分离的视觉证据和组合约束,涵盖合取和序列逻辑下的感知子任务,如物体、属性、关系、位置、动作和事件,并要求包括语义识别、视觉对应、时间推理和空间推理等技能。基准数据集包含1,114个高度复杂的视频问题,来自包括城市步行旅游、室内别墅旅游、视频游戏和极限户外运动等多样领域的279个视频,100%由人工标注。人类研究表明,PerceptionComp需要大量的测试时思考和重复感知步骤:参与者在PerceptionComp上的花费时间远超于先前基准,且不允许重新观看时准确率下降到接近随机(18.97%)。最先进的MLLMs在PerceptionComp上的表现也远逊于现有基准:在我们的评估中表现最好的模型Gemini-3-Flash在五选一设置中仅达到45.96%的准确率,开源模型的准确率仍低于40%。这些结果表明,感知中心的长时序视频推理仍然是一个主要瓶颈,我们希望PerceptionComp能够推动感知推理的进步。
Summary / 总结
PerceptionComp is a benchmark for complex perception-centric video reasoning, requiring multiple pieces of visual evidence and compositional constraints. It contains 1,114 questions on 279 diverse videos, with 100% manual annotation. Human studies show that answering these questions requires substantial thinking and repeated perception steps, with accuracy dropping to near chance when rewatching is disallowed. State-of-the-art models perform poorly, reaching only 45.96% accuracy, indicating a major bottleneck in perceptual reasoning.
PerceptionComp 是一个复杂的感知中心视频推理基准,需要多段视觉证据和组合约束。它包含 1,114 个问题和 279 个多样化的视频,100% 手动标注。人类研究显示,回答这些问题需要大量的思考和重复的感知步骤,不允许重新观看时准确率会降至接近随机水平。最先进的模型表现不佳,仅达到 45.96% 的准确率,表明感知推理仍是一个重大瓶颈。
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
Authors: Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang
First: 2026-03-27T17:50:45+00:00 · Latest: 2026-03-27T17:50:45+00:00
Abstract
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
中文标题/摘要
标题:Vision2Web:一种基于代理验证的分层网站开发基准
近年来,大型语言模型的进步提高了编码代理的能力,但对复杂、端到端网站开发的系统性评估仍然有限。为解决这一问题,我们引入了Vision2Web,这是一种分层基准,涵盖了从静态UI到代码生成、交互式多页面前端再现到长期全栈网站开发的全过程。该基准从真实网站构建,包含16个类别共计193项任务,918张原型图和1255个测试案例。为支持灵活、全面和可靠的评估,我们提出了基于两种互补组件的工作流代理验证框架:GUI代理验证器和基于VLM的评判者。我们评估了在不同编码代理框架下实例化的多种视觉语言模型,揭示了所有任务级别上存在显著的性能差距,最先进的模型在全栈开发上仍然面临挑战。
Summary / 总结
Vision2Web is a hierarchical benchmark for evaluating visual website development, covering static UI-to-code generation, interactive multi-page frontend reproduction, and full-stack website development. It includes 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. The benchmark uses a workflow-based agent verification paradigm, combining a GUI agent verifier and a VLM-based judge to support flexible, thorough, and reliable evaluation. Evaluations of multiple visual language models show significant performance differences across all task levels, with state-of-the-art models still facing challenges in full-stack development.
Vision2Web 是一个分层基准,用于评估视觉网站开发,涵盖静态 UI 到代码生成、交互式多页面前端再现和全栈网站开发。它包含 16 个类别中的 193 个任务,有 918 张原型图和 1,255 个测试案例。基准使用基于工作流的代理验证范式,结合 GUI 代理验证器和 VLM 基准裁判,以支持灵活、全面和可靠的评估。对多个视觉语言模型的评估显示,所有任务级别上存在显著的性能差异,最先进的模型在全栈开发中仍然面临挑战。
An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability
Authors: Ashutosh Soni, Peizhong Ju, Atilla Eryilmaz, Ness B. Shroff
First: 2026-03-27T17:50:42+00:00 · Latest: 2026-03-27T17:50:42+00:00
Abstract
We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the "activation set") varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers' preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.
中文标题/摘要
标题:基于LP的多臂老虎机采样策略,考虑旁观观察和随机可用性
我们研究了具有潜在网络结构的随机多臂老虎机(MAB)问题,该结构允许相关动作之间的旁观观察。我们使用二分图将动作与一组未知数链接起来,选择一个动作会揭示所有与其相连的未知数的观察结果。尽管先前的工作依赖于所有动作永久可访问的假设,但我们探讨了更实际的随机可用性设置,在这种设置中,可行动作集(“激活集”)在每一轮中动态变化。该框架模拟了具有结构依赖性和波动性的现实世界系统,例如用户在社交网络中提供有关其同龄人偏好的旁观信息,但并不总是在线可供查询。为了解决这一挑战,我们提出了UCB-LP-A,这是一种新颖的策略,利用线性规划(LP)方法在随机可用性下优化探索与利用之间的权衡。与假设恒定访问的常规网络老虎机算法不同,UCB-LP-A 计算出在可实现的激活集上的最优采样分布,确保仅使用当前活动臂收集必要的观察结果。我们推导出我们策略的理论后悔上限,描述了网络结构和激活概率的影响。最后,通过数值模拟表明,UCB-LP-A 显著优于忽略旁观信息或可用性约束的现有启发式方法。
Summary / 总结
The paper addresses the stochastic multi-armed bandit problem with side-observations and stochastic availability, proposing UCB-LP-A, which uses a Linear Programming approach to optimize exploration-exploitation trade-offs. The policy computes an optimal sampling distribution over the realizable activation sets, ensuring efficient gathering of necessary observations. Theoretical analysis and numerical simulations show that UCB-LP-A outperforms existing heuristics that ignore either side-information or availability constraints.
该论文研究具有侧观测和随机可用性的随机多臂老虎机问题,提出了一种名为UCB-LP-A的新策略,该策略利用线性规划方法优化探索与利用之间的权衡。该策略计算可实现激活集上的最优采样分布,确保高效收集必要的观测信息。理论分析和数值模拟表明,UCB-LP-A通过有效处理侧信息和可用性约束,优于现有忽略这些因素的启发式方法。
Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision
Authors: Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, Zhidong Deng
First: 2026-03-27T17:49:56+00:00 · Latest: 2026-03-27T17:49:56+00:00
Abstract
Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.
中文标题/摘要
标题:超越语言:基于手部指向的主观视角引用表达 grounding
传统的视觉定位(VG)主要依赖于文本描述来定位物体,这一范式固有地难以应对语言的歧义性,并且经常忽视真实世界互动中常见的非言语指示性线索。在自然的主观视角互动中,手势指向与言语结合形成了最直观的引用机制。为弥合这一差距,我们引入了EgoPoint-Ground,这是首个专注于主观视角指示性视觉定位的大规模多模态数据集。该数据集包含超过15000个交互样本,涵盖了复杂场景,提供了丰富的多层次注释,包括手部-目标边界框对和密集语义描述。我们建立了一个全面的手部指向引用表达解析基准,评估了主流的多模态大型语言模型(MLLMs)和最先进的VG架构。此外,我们提出了SV-CoT,这是一种新颖的基线框架,将grounding重新定义为结构化的推理过程,通过视觉链式思考范式协同手势和语言线索。大量实验表明,SV-CoT相比现有方法实现了11.7%的绝对改进,有效缓解了语义歧义,提升了代理理解多模态物理意图的能力。该数据集和代码将公开提供。
Summary / 总结
The research aims to address the limitations of traditional Visual Grounding methods that rely on textual descriptions, which struggle with linguistic ambiguity and neglect non-verbal cues. The study introduces EgoPoint-Ground, a large multimodal dataset focusing on egocentric deictic visual grounding, comprising over 15,000 samples with detailed annotations. The SV-CoT framework is proposed, which improves grounding performance by 11.7% through a structured inference process that integrates gestural and linguistic cues, effectively reducing semantic ambiguity.
本文针对传统视觉定位方法依赖文本描述,常面临语义模糊的问题。它引入了EgoPoint-Ground,这是一个专注于自中心指示视觉定位的大规模多模态数据集,包含超过15,000个样本,并附有详细的注释。研究评估了多种多模态大型语言模型和视觉定位架构,提出了一种名为SV-CoT的新颖框架,通过视觉链式思考方法提高了11.7%的定位准确性,有效减少了语义模糊性。
Automatic Laplace Collapsed Sampling: Scalable Marginalisation of Latent Parameters via Automatic Differentiation
Authors: Toby Lovick, David Yallup, Will Handley
First: 2026-03-27T17:47:45+00:00 · Latest: 2026-03-27T17:47:45+00:00
Comments: 28 Pages, 7 Figures. Comments welcome
Abstract
We present Automatic Laplace Collapsed Sampling (ALCS), a general framework for marginalising latent parameters in Bayesian models using automatic differentiation, which we combine with nested sampling to explore the hyperparameter space in a robust and efficient manner. At each nested sampling likelihood evaluation, ALCS collapses the high-dimensional latent variables $z$ to a scalar contribution via maximum a posteriori (MAP) optimisation and a Laplace approximation, both computed using autodiff. This reduces the effective dimension from $d_θ+ d_z$ to just $d_θ$, making Bayesian evidence computation tractable for high-dimensional settings without hand-derived gradients or Hessians, and with minimal model-specific engineering. The MAP optimisation and Hessian evaluation are parallelised across live points on GPU-hardware, making the method practical at scale. We also show that automatic differentiation enables local approximations beyond Laplace to parametric families such as the Student-$t$, which improves evidence estimates for heavy-tailed latents. We validate ALCS on a suite of benchmarks spanning hierarchical, time-series, and discrete-likelihood models and establish where the Gaussian approximation holds. This enables a post-hoc ESS diagnostic that localises failures across hyperparameter space without expensive joint sampling.
中文标题/摘要
标题:自动拉普拉斯塌缩采样:通过自动微分对潜在参数的可扩展边际化
我们提出了自动拉普拉斯塌缩采样(ALCS),这是一种使用自动微分对贝叶斯模型中的潜在参数进行边际化的通用框架,我们将其与嵌套采样结合使用,以在鲁棒且高效的方式探索超参数空间。在每次嵌套采样似然性评估中,ALCS通过最大后验(MAP)优化和拉普拉斯近似将高维潜在变量 $z$ 塌缩为一个标量贡献,两者均使用自动微分计算。这将有效维度从 $d_θ+ d_z$ 减少到仅 $d_θ$,使得在无需手动导数或海森矩阵的情况下,高维设置下的贝叶斯证据计算变得可行,并且只需最少的模型特定工程。MAP优化和海森矩阵评估在GPU硬件上并行化,使该方法在大规模应用中具有可行性。我们还展示了自动微分如何使拉普拉斯之外的局部近似适用于参数族,如学生-$t$ 分布,这可以改善重尾潜在变量的证据估计。我们在涵盖分层、时间序列和离散似然模型的一系列基准测试上验证了ALCS,并确定了高斯近似成立的地方。这使得可以进行后验ESS诊断,无需昂贵的联合采样即可定位超参数空间中的失败。
Summary / 总结
ALCS is a framework for marginalizing latent parameters in Bayesian models using automatic differentiation combined with nested sampling. It reduces the effective dimensionality from $d_θ + d_z$ to $d_θ$ by collapsing high-dimensional latent variables through MAP optimization and Laplace approximation, making Bayesian evidence computation scalable for high-dimensional settings. ALCS also enables parallel processing and local approximations beyond Laplace, improving evidence estimates for heavy-tailed latents. Experiments on various models validate ALCS and provide a post-hoc ESS diagnostic for localizing failures in hyperparameter space.
ALCS 是一种使用自动微分结合巢式采样来消除贝叶斯模型中潜在参数的方法。通过 MAP 最优化和拉普拉斯近似,它将有效维度从 $d_θ + d_z$ 减少到 $d_θ$,使高维度设置下的贝叶斯证据计算可扩展。ALCS 还支持在 GPU 上并行处理,并允许超越拉普拉斯的局部近似,从而提高重尾潜在变量的证据估计。实验结果显示 ALCS 在各种模型上的有效性,并提供后验 ESS 对诊工具来定位超参数空间中的失败情况。
Make Geometry Matter for Spatial Reasoning
Authors: Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang
First: 2026-03-27T17:45:12+00:00 · Latest: 2026-03-27T17:45:12+00:00
Abstract
Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.
中文标题/摘要
标题:让几何学在空间推理中发挥作用
得益于大规模训练,视觉-语言模型(VLMs)在图像和视频理解方面表现出色,但在静态场景和动态视频中的空间推理能力仍然有限。最近的研究试图通过将预训练的3D基础模型中的几何标记注入VLMs来解决这一限制。然而,我们观察到,在这一领域的研究中,简单的标记融合加上标准的微调往往未能充分利用这些几何线索进行空间推理,因为VLMs倾向于依赖2D视觉线索。在本文中,我们提出了GeoSR框架,旨在通过鼓励VLMs积极使用几何标记来让几何学在空间推理任务中发挥作用。GeoSR引入了两个关键组件:(1)几何释放掩码,在训练过程中战略性地屏蔽2D视觉标记的部分,以削弱非几何捷径并迫使模型在空间推理时咨询几何标记;(2)几何引导融合,这是一种门控路由机制,能够自适应地放大几何标记在关键几何证据区域的贡献。这些设计共同释放了几何标记在空间推理任务中的潜力。在静态和动态空间推理基准上的广泛实验表明,GeoSR始终优于先前的方法,并通过有效利用几何信息建立了新的最佳性能。项目页面可在https://suhzhang.github.io/GeoSR/获取。
Summary / 总结
This paper addresses the limitation of vision-language models in performing spatial reasoning by proposing GeoSR, a framework that encourages models to actively reason with geometry tokens. GeoSR includes Geometry-Unleashing Masking to weaken non-geometric shortcuts and Geometry-Guided Fusion to adaptively amplify geometry token contributions. Experiments show that GeoSR outperforms previous methods and sets new state-of-the-art performance on both static and dynamic spatial reasoning benchmarks.
本文提出GeoSR框架以解决视觉语言模型(VLMs)在空间推理中的局限性,该框架鼓励VLMs利用几何线索。GeoSR包括几何释放掩码以削弱非几何捷径,并使用几何引导融合机制在关键区域适当地放大几何令牌的贡献。实验表明,GeoSR在静态和动态空间推理基准上均优于先前的方法,并建立了新的最佳性能。
Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Authors: Zakaria Mhammedi, James Cohan
First: 2026-03-23T17:56:52+00:00 · Latest: 2026-03-27T17:44:46+00:00
Abstract
The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.
中文标题/摘要
标题:解耦探索与策略优化:基于不确定性引导的树搜索方法在困难探索中的应用
探索过程需要主动探索——即收集新的和有信息量的数据。然而,高效的自主探索仍然是一个主要未解决的问题。主流方法通过使用强化学习(RL)训练具有内在动机的代理,最大化外在奖励和内在奖励的复合目标来应对这一挑战。我们认为这种方法带来了不必要的开销:虽然策略优化对于精确执行任务是必要的,但仅为了扩展状态覆盖范围而使用这种机制可能是低效的。在本文中,我们提出了一种新的范式,明确地将探索与利用分离,并在探索阶段绕过RL。我们的方法使用受Go-With-The-Winner算法启发的树搜索策略,并配以表征论域不确定性来系统地驱动探索。通过去除策略优化的开销,我们的方法在困难的Atari基准测试中比标准的内在动机基线更高效地探索了数量级。此外,我们证明了发现的轨迹可以使用现有的监督反向学习算法提炼成可部署的策略,在Montezuma’s Revenge、Pitfall!和Venture上取得了显著优于现有技术水平的得分,而无需依赖领域特定知识。最后,我们展示了在高维连续动作空间中该框架的通用性,通过直接从图像观察中解决MuJoCo Adroit灵巧操作和AntMaze任务,在稀疏奖励设置中无需专家演示或离线数据集。据我们所知,这在Adroit任务中是前所未有的。
Summary / 总结
This paper addresses the challenge of efficient autonomous exploration in reinforcement learning by proposing a new paradigm that decouples exploration from policy optimization. The method uses a tree-search strategy with epistemic uncertainty to guide exploration, bypassing RL during the exploration phase. This approach significantly improves exploration efficiency on hard Atari benchmarks compared to standard intrinsic motivation baselines. Additionally, the discovered trajectories can be distilled into deployable policies, achieving state-of-the-art scores on Montezuma's Revenge, Pitfall!, and Venture. The framework is also demonstrated to be effective in high-dimensional continuous action spaces for solving MuJoCo Adroit and AntMaze tasks directly from image observations without expert demonstrations or offline datasets.
本文提出了一种新的范式,通过将探索与策略优化分离来解决强化学习中的自主高效探索挑战。该方法使用带有表征不确定性指导的树搜索策略,在探索阶段绕过RL。与标准内在动机基线相比,这种方法在硬币Atari基准测试上的探索效率显著提高。此外,发现的轨迹可以被提炼成可部署的策略,在Montezuma的复仇、Pitfall!和Venture上实现了最先进的得分。该框架还在高维连续动作空间中,直接从图像观察中解决了MuJoCo Adroit和AntMaze任务,无需专家演示或离线数据集。据我们所知,这是首次在Adroit任务中实现这一点。
Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting
Authors: Nitin Kulkarni, Akhil Devarashetti, Charlie Cluss, Livio Forte, Philip Schneider, Chunming Qiao, Alina Vereshchaka
Venue: IROS 2026
First: 2026-03-27T17:42:42+00:00 · Latest: 2026-03-27T17:42:42+00:00
Comments: 8 pages, 7 figures, Submitted to IEEE IROS 2026 (under review)
Abstract
High-fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive-throughs presents severe technical challenges. Unlike static-scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide-angle lens distortion, specular automotive paint, and non-rigid wheel rotations that violate classical epipolar constraints. We propose an end-to-end pipeline utilizing a two-pillar camera rig. First, we resolve dynamic-scene ambiguities by coupling SAM 3 for instance segmentation with motion-gating to cleanly isolate the moving vehicle, explicitly masking out non-rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig-aware SfM optimization that utilizes CAD-derived relative pose priors to eliminate scale drift. Finally, we use a distortion-aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real-world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held-out views, representing a 3.85 dB improvement over standard 3D-GS, delivering inspection-grade interactive 3D models without controlled studio infrastructure.
中文标题/摘要
标题:通过动态场景结构光度法和畸变感知高斯点云绘制的汽车外部驱动通过式3D重建
高保真汽车外部3D重建可提高在线汽车市场买家的信心,但在拥挤的经销商车行道上生成这些模型面临着严重的技术挑战。与静态场景光度测量不同,此设置中存在动态车辆在高度杂乱的静态背景前移动的情况。此外,广角镜头畸变、汽车镜面反光油漆和非刚性车轮旋转违反了经典极线约束。我们提出了一种端到端的流水线,利用双支柱相机架。首先,我们通过结合SAM 3实例分割与运动门控来解决动态场景的歧义性,清晰地隔离移动车辆,明确遮挡非刚性车轮以严格约束极线几何。其次,我们直接在原始畸变的4K图像上提取鲁棒对应关系,使用由语义置信度掩模引导的RoMa v2学习匹配器。第三,这些匹配被整合到一个考虑相机架的结构光度法优化中,利用CAD导出的相对姿态先验来消除尺度漂移。最后,我们使用一种畸变感知的3D高斯点云绘制框架(3DGUT)结合随机马尔可夫链蒙特卡洛(MCMC)稠密化策略来渲染反射表面。在10家经销商的25辆真实车辆上进行的评估表明,我们的完整流水线在保留视图上实现了28.66 dB的PSNR、0.89的SSIM和0.21的LPIPS,比标准3D-GS提高了3.85 dB,无需受控的摄影棚基础设施即可提供检查级的交互式3D模型。
Summary / 总结
The paper addresses the challenge of 3D vehicle exterior reconstruction in cluttered dealership drive-throughs, where dynamic scenes and wide-angle lens distortions complicate traditional photogrammetry methods. It proposes an end-to-end pipeline that first uses instance segmentation and motion-gating to isolate the moving vehicle, then extracts robust correspondences from raw, distorted imagery, integrates these into a rig-aware Structure-from-Motion (SfM) optimization, and finally renders reflective surfaces using a distortion-aware Gaussian Splatting framework. The pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21, demonstrating a 3.85 dB improvement over standard 3D-GS methods.
该论文解决了在拥挤的经销商车行道上进行3D车辆外部重建的挑战,其中动态场景和广角镜头失真使过程复杂化。作者提出了一种端到端的流水线,包括实例分割、运动门控、稳健对应关系提取、刚体感知结构从运动(SfM)优化以及失真感知的高斯点云渲染框架。该流水线在保留视图上实现了28.66 dB的PSNR、0.89的SSIM和0.21的LPIPS,比标准3D-GS高出3.85 dB。
INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation
Authors: Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, Xianfeng Terry Yang
First: 2025-02-01T01:43:53+00:00 · Latest: 2026-03-27T17:40:45+00:00
Abstract
Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.
中文标题/摘要
标题:INSIGHT:通过上下文感知危害检测和边缘案例评估中的视觉-语言模型提升自动驾驶安全性
自动驾驶系统在处理不可预测的边缘案例场景时面临重大挑战,如对抗性行人的运动、危险的车辆操作以及突然的环境变化。当前的端到端驾驶模型由于传统检测和预测方法的局限性,在这些罕见事件上的泛化能力有限。为了解决这一问题,我们提出了INSIGHT(语义和视觉输入的综合用于泛化危害跟踪),这是一种分层的视觉-语言模型(VLM)框架,旨在增强危害检测和边缘案例评估。通过多模态数据融合,我们的方法将语义和视觉表示相结合,使驾驶场景的精确解释和潜在危险的准确预测成为可能。通过监督微调VLMs,我们使用基于注意力的机制和坐标回归技术优化了空间危害定位。在BDD100K数据集上的实验结果表明,与现有模型相比,我们的方法在危害预测的清晰度和准确性上有了显著提高,泛化性能也得到了显著提升。这一进步增强了自动驾驶系统的稳健性和安全性,确保在复杂的真实世界场景中提高了态势感知和潜在决策能力。
Summary / 总结
The paper addresses the challenge of autonomous driving systems in handling unpredictable edge-case scenarios by proposing INSIGHT, a hierarchical vision-language model framework. This model integrates semantic and visual data to enhance hazard detection and edge-case evaluation, using attention-based mechanisms and coordinate regression for precise localization. Experiments on the BDD100K dataset show significant improvements in hazard prediction accuracy and generalization performance compared to existing models.
论文提出了一种名为INSIGHT的层次视觉-语言模型框架,以应对自动驾驶系统在处理不可预测的边缘案例场景时的挑战。该框架通过多模态数据融合和注意力机制整合语义和视觉表示,以增强危险检测和边缘案例评估。实验结果表明,该方法在BDD100K数据集上的危险预测准确性和泛化性能显著优于现有模型。
Large Language Models Can Perform Automatic Modulation Classification via Discretized Self-supervised Candidate Retrieval
Authors: Mohammad Rostami, Atik Faysal, Reihaneh Gh. Roshan, Huaxia Wang, Nikhil Muralidhar, Yu-Dong Yao
First: 2025-09-30T22:20:57+00:00 · Latest: 2026-03-27T17:33:28+00:00
Abstract
Identifying wireless modulation schemes is essential for cognitive radio, but standard supervised models often degrade under distribution shift, and training domain-specific wireless foundation models from scratch is computationally prohibitive. Large Language Models (LLMs) offer a promising training-free alternative via in-context learning, yet feeding raw floating-point signal statistics into LLMs overwhelms models with numerical noise and exhausts token budgets. We introduce DiSC-AMC, a framework that reformulates Automatic Modulation Classification (AMC) as an LLM reasoning task by combining aggressive feature discretization with nearest-neighbor retrieval over self-supervised embeddings. By mapping continuous features to coarse symbolic tokens, DiSC-AMC aligns abstract signal patterns with LLM reasoning capabilities and reduces prompt length by over $50$\%. Simultaneously, utilizing a DINOv2 visual encoder to retrieve the $k_\text{NN}$ most similar labeled exemplars provides highly relevant, query-specific context rather than generic class averages. On a 10-class benchmark, a fine-tuned 7B-parameter LLM using DiSC-AMC achieves $83.0$\% in-distribution accuracy ($-10$\,to\,$+10$\,dB) and $82.50$\% out-of-distribution (OOD) accuracy ($-11$\,to\,$-15$\,dB), outperforming supervised baselines. Comprehensive ablations on vanilla LLMs demonstrate the token efficiency of DiSC-AMC. A training-free $7$B LLM achieves $71$\% accuracy using only $0.5$\,K-token prompt,surpassing a $200$B-parameter baseline that relies on a $2.9$K-token prompt. Furthermore, similarity-based exemplar retrieval outperforms naive class-average selection by over $20$\%. Finally, we identify a fundamental limitation of this pipeline. At extreme OOD noise levels ($-30$\,dB), the underlying self-supervised representations collapse, degrading retrieval quality and reducing classification to random chance.
中文标题/摘要
标题:大型语言模型可以通过离散化自监督候选检索自动执行调制分类
识别无线调制方案对于认知无线电至关重要,但标准的监督模型在分布偏移下往往会退化,从头训练特定领域的无线基础模型在计算上是不可行的。大型语言模型(LLMs)通过上下文学习提供了一种无训练的替代方案,然而将原始浮点信号统计直接输入LLMs会使模型受到数值噪声的困扰并耗尽令牌预算。我们提出了DiSC-AMC框架,通过结合激进的特征离散化和基于自监督嵌入的最近邻检索,将自动调制分类(AMC)重新表述为LLM推理任务。通过将连续特征映射为粗粒度的符号令牌,DiSC-AMC使抽象的信号模式与LLM的推理能力相匹配,并将提示长度减少了超过50%。同时,利用DINOv2视觉编码器检索$k_\text{NN}$个最相似的标记示例,提供了高度相关且查询特定的上下文,而不是通用的类别平均值。在10类基准测试中,使用DiSC-AMC微调的7B参数LLM实现了83.0%的分布内准确率(-10至+10 dB)和82.50%的分布外准确率(OOD,-11至-15 dB),超过了监督基线。对标准LLM的全面消融实验表明了DiSC-AMC的令牌效率。一个无训练的7B LLM仅使用0.5 K令牌提示就达到了71%的准确率,超过了依赖2.9 K令牌提示的200B参数基线。此外,基于相似性的示例检索比简单的类别平均选择高出超过20%。最后,我们确定了该管道的一个基本局限性。在极端的分布外噪声水平(-30 dB)下,底层的自监督表示会崩溃,导致检索质量下降,分类退化为随机猜测。
Summary / 总结
The paper addresses the challenge of automatic modulation classification in cognitive radio by proposing DiSC-AMC, a framework that leverages Large Language Models (LLMs) for reasoning tasks. It reformulates the problem by discretizing continuous signal features and using nearest-neighbor retrieval over self-supervised embeddings, reducing prompt length by over 50%. Experiments on a 10-class benchmark show that a fine-tuned 7B-parameter LLM using DiSC-AMC achieves 83.0% in-distribution accuracy and 82.50% out-of-distribution accuracy, outperforming supervised baselines. Ablations demonstrate the token efficiency of DiSC-AMC, and similarity-based exemplar retrieval outperforms class-average selection by over 20%. However, at extreme OOD noise levels, the performance degrades due to the collapse of self-supervised representations.
论文通过一种名为DiSC-AMC的新框架,利用大型语言模型(LLMs)解决认知无线电中的自动调制分类问题。该框架将调制分类重新表述为LLM的推理任务,通过激进的特征离散化和基于自监督嵌入的最近邻检索,减少了提示长度超过50%。在10类基准测试中,使用DiSC-AMC微调的7B参数LLM实现了83.0%的分布内准确率和82.50%的分布外准确率,优于监督基线。全面的消融实验显示了DiSC-AMC的令牌效率以及基于相似性的示例检索优于类平均选择的优越性。
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
Authors: Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal
Venue: CVPR 2026
First: 2025-12-01T14:15:44+00:00 · Latest: 2026-03-27T17:30:08+00:00
Comments: Accepted to CVPR 2026, Project page: https://streamgaze.github.io/
Abstract
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.
中文标题/摘要
标题:StreamGaze:基于凝视的时序推理与前瞻理解在流式视频中的应用
流式视频理解不仅需要模型处理按时间顺序到来的帧,还需要预测用户的意图以实现增强现实(AR)眼镜等现实应用。虽然先前的流式基准测试评估了时序推理,但没有衡量多模态大型语言模型(MLLMs)是否能在流式环境中解释或利用人类的凝视信号。为填补这一空白,我们引入了StreamGaze,这是第一个旨在评估MLLMs如何有效利用凝视进行流式视频中的时序和前瞻推理的基准测试。StreamGaze引入了凝视引导的过去、现在和前瞻任务,全面评估流式视频理解。这些任务评估模型是否能利用实时凝视信号跟随注意力的转移,并仅基于过去和当前观察到的帧推断用户意图。为了构建StreamGaze,我们开发了一种凝视-视频问答(QA)生成流水线,通过凝视提取、区域特定视觉提示和扫描路径构建,将第一人称视频与原始凝视轨迹对齐。该流水线生成时空定位的QA对,反映人类知觉动态。在所有StreamGaze任务中,我们观察到最先进的MLLMs与人类表现之间存在显著的性能差距,突显了基于凝视的时序推理、意图建模和前瞻预测的关键局限性。我们进一步详细分析了凝视提示策略、推理行为和任务特定的失败模式,提供了当前局限性和未来研究方向的见解。所有数据和代码均已公开,以支持持续的基于凝视的流式视频理解研究。
Summary / 总结
StreamGaze is a benchmark for evaluating how well MLLMs utilize gaze signals for temporal and proactive reasoning in streaming videos. It introduces gaze-guided past, present, and proactive tasks to assess models' ability to follow shifting attention and infer user intentions. The benchmark reveals significant performance gaps between state-of-the-art MLLMs and human performance, highlighting limitations in gaze-based temporal reasoning and proactive prediction. Detailed analyses of gaze prompting strategies and reasoning behaviors are provided to guide future research.
StreamGaze 是一个用于评估多模态大型语言模型 (MLLM) 如何利用人类注视信号进行流媒体视频中的时间与前瞻推理的基准。它引入了注视引导的过去、现在和前瞻任务,以评估模型跟随注意力转移和推断用户意图的能力。该基准揭示了最先进的 MLLM 在注视基于的时间推理、意图建模和前瞻预测方面的显著性能差距,提供了关于注视提示策略和推理行为的详细分析,为当前研究局限性和未来方向提供了见解。
Context-specific Credibility-aware Multimodal Fusion with Conditional Probabilistic Circuits
Authors: Pranuthi Tenali, Sahil Sidheekh, Saurabh Mathur, Erik Blasch, Kristian Kersting, Sriraam Natarajan
First: 2026-03-27T17:29:08+00:00 · Latest: 2026-03-27T17:29:08+00:00
Abstract
Multimodal fusion requires integrating information from multiple sources that may conflict depending on context. Existing fusion approaches typically rely on static assumptions about source reliability, limiting their ability to resolve conflicts when a modality becomes unreliable due to situational factors such as sensor degradation or class-specific corruption. We introduce C$^2$MF, a context-specfic credibility-aware multimodal fusion framework that models per-instance source reliability using a Conditional Probabilistic Circuit (CPC). We formalize instance-level reliability through Context-Specific Information Credibility (CSIC), a KL-divergence-based measure computed exactly from the CPC. CSIC generalizes conventional static credibility estimates as a special case, enabling principled and adaptive reliability assessment. To evaluate robustness under cross-modal conflicts, we propose the Conflict benchmark, in which class-specific corruptions deliberately induce discrepancies between different modalities. Experimental results show that C$^2$MF improves predictive accuracy by up to 29% over static-reliability baselines in high-noise settings, while preserving the interpretability advantages of probabilistic circuit-based fusion.
中文标题/摘要
标题:基于条件概率电路的上下文特定可信度感知多模态融合
多模态融合需要整合来自多个可能在上下文中冲突的信息源。现有融合方法通常依赖于关于信息源可靠性的静态假设,限制了它们在由于传感器退化或类别特定的污染等因素导致模态变得不可靠时解决冲突的能力。我们引入了C$^2$MF,一种上下文特定的可信度感知多模态融合框架,使用条件概率电路(CPC)建模每个实例的信息源可靠性。我们通过上下文特定信息可信度(CSIC),一种基于KL散度的度量,精确计算CPC来形式化实例级别的可靠性。CSIC作为一种特殊情况,推广了传统的静态可信度估计,使可靠性评估更加原则化和适应性。为了在跨模态冲突下评估鲁棒性,我们提出了冲突基准,在该基准中,类别特定的污染故意导致不同模态之间的差异。实验结果表明,在高噪声环境中,C$^2$MF相比静态可靠性基线提高了高达29%的预测准确性,同时保持了基于概率电路融合的可解释性优势。
Summary / 总结
The research aims to address the limitations of static credibility assumptions in multimodal fusion by introducing C$^2$MF, which uses Conditional Probabilistic Circuits (CPC) to model per-instance source reliability. The framework computes Context-Specific Information Credibility (CSIC) to adaptively assess reliability, improving predictive accuracy by up to 29% in high-noise settings compared to static-reliability baselines. The Conflict benchmark evaluates robustness under cross-modal conflicts, showing that C$^2$MF outperforms static methods while maintaining interpretability.
研究旨在通过引入C$^2$MF框架解决静态可信度假设在多模态融合中的局限性,该框架使用条件概率电路(CPC)来建模实例级别的源可靠性。框架通过计算上下文特定信息可信度(CSIC)来适应性地评估可靠性,从而在高噪声环境中将预测准确性提高高达29%,优于静态可靠性基线。冲突基准测试了在跨模态冲突下的鲁棒性,表明C$^2$MF在保持解释性的同时优于静态方法。
EOGS++: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering
Authors: Pierrick Bournez, Luca Savant Aira, Thibaud Ehret, Gabriele Facciolo
First: 2025-11-20T16:54:17+00:00 · Latest: 2026-03-27T17:28:11+00:00
Comments: 8 pages, ISPRS
Abstract
Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering competitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and efficiency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
中文标题/摘要
标题:EOGS++:地球观测高斯点云渲染,内部相机校正和直接彩色渲染
最近,3D高斯点云渲染被引入作为地球观测领域中NeRF的有竞争力的替代方案,提供了与显著减少的训练时间相竞争的重建质量。在本文中,我们扩展了地球观测高斯点云渲染(EOGS)框架,提出了一种新的方法EOGS++,该方法专门针对卫星图像,可以直接在原始高分辨率彩色图像上操作,无需外部预处理。此外,利用光流技术,我们将束调整直接嵌入到训练过程中,避免依赖外部优化工具,同时提高相机姿态估计。我们还对原始实现进行了多项改进,包括早期停止和TSDF后处理,所有这些改进都促进了更清晰的重建和更好的几何精度。在IARPA 2016和DFC2019数据集上的实验表明,EOGS++在重建质量和效率方面达到了最先进的性能,优于原始EOGS方法和其他基于NeRF的方法,同时保持了高斯点云渲染的计算优势。我们的模型在建筑物上的平均MAE误差从1.33提高到1.19
Summary / 总结
EOGS++ extends the Earth Observation Gaussian Splatting framework to directly process raw panchromatic satellite imagery, incorporating internal camera refinement and direct panchromatic rendering. It uses optical flow techniques to embed bundle adjustment within the training process, improving camera pose estimation. Experiments show EOGS++ outperforms the original EOGS and other NeRF-based methods with better reconstruction quality and efficiency, reducing mean MAE errors from 1.33 to 1.19 on buildings.
EOGS++ 扩展了地球观测高斯点云框架,直接处理原始高分辨率单色卫星图像,结合内部相机校正和直接单色渲染。通过在训练过程中嵌入束调整,并引入早期停止和TSDF后处理等改进,EOGS++ 达到了最先进的性能,几何精度更高,并且在建筑物上的平均 MAE 错误率从 1.33 提高到 1.19,优于原始 EOGS 和其他基于 NeRF 的方法。
Towards single-shot coherent imaging via overlap-free ptychography
Authors: Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg
First: 2026-02-24T20:45:24+00:00 · Latest: 2026-03-27T17:27:45+00:00
Abstract
Ptychographic imaging at synchrotron and XFEL sources requires dense overlapping scans, limiting throughput and increasing dose. Extending coherent diffractive imaging to overlap-free operation on extended samples remains an open problem. Here, we extend PtychoPINN (O. Hoidn \emph{et al.}, \emph{Scientific Reports} \textbf{13}, 22789, 2023) to deliver \emph{overlap-free, single-shot} reconstructions in a Fresnel coherent diffraction imaging (CDI) geometry while also accelerating conventional multi-shot ptychography. The framework couples a differentiable forward model of coherent scattering with a Poisson photon-counting likelihood; real-space overlap enters as a tunable parameter via coordinate-based grouping rather than a hard requirement. On synthetic benchmarks, reconstructions remain accurate at low counts ($\sim\!10^4$ photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with 0.968 for overlap-constrained reconstruction. Against a data-saturated supervised model with the same backbone (16,384 training images), PtychoPINN achieves higher SSIM with only 1,024 images and generalizes to unseen illumination profiles. Per-graphics processing unit (GPU) throughput is approximately $40\times$ that of least-squares maximum-likelihood (LSQ-ML) reconstruction at matched $128\times128$ resolution. These results, validated on experimental data from the Advanced Photon Source and the Linac Coherent Light Source, unify single-exposure Fresnel CDI and overlapped ptychography within one framework, supporting dose-efficient, high-throughput imaging at modern light sources.
中文标题/摘要
标题:通过无重叠ptychography实现单次曝光相干成像
同步辐射和XFEL源的ptychographic成像需要密集的重叠扫描,限制了吞吐量并增加了剂量。将相干衍射成像扩展到扩展样本上的无重叠操作仍然是一个开放问题。在这里,我们将PtychoPINN(O. Hoidn等人,《科学报告》13卷,22789,2023)扩展到在Fresnel相干衍射成像(CDI)几何结构中提供无重叠、单次曝光重构,同时加速传统的多次曝光ptychography。该框架将相干散射的可微分前向模型与泊松光子计数似然性相结合;实空间重叠作为通过坐标分组的可调参数进入,而不是硬性要求。在合成基准测试中,即使在低计数(约10^4光子/帧)下重构仍然准确,无重叠单次曝光重构与实验探针达到幅度结构相似性(SSIM)0.904,而重叠约束重构为0.968。与具有相同骨干的饱和监督模型(16,384训练图像)相比,PtychoPINN仅使用1,024图像即可达到更高的SSIM,并且能够泛化到未见过的照明配置。每图形处理单元(GPU)的吞吐量大约是匹配128×128分辨率的最小二乘最大似然(LSQ-ML)重构的40倍。这些结果在先进光子源和林克加速相干光源的实验数据上得到验证,将单曝光Fresnel CDI和重叠ptychography统一在一个框架中,支持现代光源的剂量高效、高通量成像。
Summary / 总结
This study addresses the challenge of dense overlapping scans in ptychographic imaging, which limit throughput and increase dose. The authors extend PtychoPINN to achieve overlap-free, single-shot reconstructions in Fresnel coherent diffraction imaging (CDI) while also accelerating conventional multi-shot ptychography. On synthetic benchmarks, the method maintains accuracy at low photon counts and achieves an SSIM of 0.904 for overlap-free single-shot reconstruction, compared to 0.968 for overlap-constrained reconstruction. The framework also demonstrates higher SSIM with fewer training images and better generalization to unseen illumination profiles, with processing throughput approximately 40 times faster than least-squares maximum-likelihood reconstruction.
该研究解决了密集重叠扫描在ptychographic成像中的限制,这些限制降低了吞吐量并增加了剂量。作者扩展了PtychoPINN,以实现Fresnel相干衍射成像(CDI)中的无重叠单曝光重建,并同时加速了传统的多曝光ptychography。在合成基准测试中,该方法在低光子计数下保持了准确性,并且无重叠单曝光重建的SSIM为0.904,而重叠约束重建的SSIM为0.968。该框架还展示了使用更少训练图像获得更高SSIM并更好地泛化到未见过的照明配置的能力,处理吞吐量大约是最小二乘最大似然重建的40倍。
Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing
Authors: Bin Chen, Wenbo Yu, Qinshan Zhang, Tianqu Zhuang, Hao Wu, Yong Jiang, Shu-Tao Xia
First: 2024-11-24T04:07:33+00:00 · Latest: 2026-03-27T17:19:35+00:00
Abstract
Interactive computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of interactive CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important interactive CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.
中文标题/摘要
标题:Editable-DeepSC:可靠的跨模态语义通信在面部编辑中的应用
交互式计算机视觉(CV)在各种实际应用中发挥着重要作用,其性能高度依赖于通信网络。然而,传统通信的数据导向特性往往与交互CV任务的特殊需求不匹配。为了解决这一问题,新兴的语义通信仅传输与任务相关的语义信息,并展现出解决这一问题的前景。然而,与社交媒体上最重要的交互CV应用之一——语义面部编辑相关的通信挑战仍然很大程度上未被探索。在本文中,我们通过提出Editable-DeepSC,一种新的跨模态语义通信方法来填补这一空白,用于面部编辑。首先,我们从理论上讨论了不同的传输方案,分别处理通信和编辑,并强调了通过迭代属性匹配实现联合编辑-信道编码(JECC)的必要性,将编辑整合到通信链中以保留更多的语义互信息。为了紧凑地表示高维数据,我们利用预训练的StyleGAN先验进行语义编码。为了应对动态信道噪声条件,我们提出了基于模型微调的信噪比(SNR)感知信道编码。广泛的实验表明,Editable-DeepSC可以在显著节省传输带宽的同时实现更优的编辑,即使在高分辨率和离域(OOD)设置下也是如此。
Summary / 总结
The paper proposes Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing to address the communication challenges in interactive computer vision tasks. It leverages pre-trained StyleGAN priors for semantic coding and SNR-aware channel coding via model fine-tuning to handle dynamic channel noise conditions. The approach achieves superior facial editing results while significantly reducing transmission bandwidth, even in high-resolution and out-of-distribution settings.
论文提出了Editable-DeepSC,一种用于面部编辑任务的新型跨模态语义通信方法。它解决了传统数据导向通信与交互式计算机视觉任务特定需求之间的不匹配问题。该方法涉及联合编辑-信道编码,将编辑过程集成到通信链中,并使用预训练的StyleGAN先验进行语义编码。此外,还提出了SNR感知信道编码以应对动态噪声条件。实验结果表明,Editable-DeepSC可以在高分辨率和离域分布场景下实现高质量的面部编辑,同时显著减少传输带宽。
Attention-Aligned Reasoning for Large Language Models
Authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang
First: 2025-10-03T17:56:33+00:00 · Latest: 2026-03-27T17:14:20+00:00
Abstract
Large Language Models (LLMs) tend to generate a long reasoning chain when solving complex tasks. However, as the reasoning chain extends, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this work, we present ATAR, a novel reasoning method that leverages the inherent reasoning structure to steer LLM attention. Our experiments show that ATAR outperforms SOTA methods across six benchmarks, achieving up to 15.39% absolute improvement. Furthermore, with ATAR, "non-reasoning" models achieve comparable or even better performance compared to reasoning models of the same size in most benchmarks. Finally, our ablation studies show that the attention alignment component contributes significantly, and that these improvements are persist under different attentionsteering backends.
中文标题/摘要
标题:注意力对齐推理用于大型语言模型
大型语言模型(LLMs)在解决复杂任务时倾向于生成长推理链。然而,随着推理链的延长,中间步骤和原始提示将被埋没在上下文中,受到不足的关注并导致错误。在本文中,我们提出了ATAR,这是一种新颖的推理方法,利用内在的推理结构引导LLM的注意力。我们的实验表明,ATAR在六个基准测试中均优于当前最佳方法,绝对改进幅度高达15.39%。此外,使用ATAR,“非推理”模型在大多数基准测试中的表现与相同规模的推理模型相当甚至更好。最后,我们的消融研究显示,注意力对齐组件贡献显著,并且这些改进在不同的注意力引导后端下仍然持续有效。
Summary / 总结
The research aims to address the issue of long reasoning chains in Large Language Models (LLMs) which can lead to errors due to insufficient attention to critical steps. ATAR, a novel reasoning method, is introduced to align attention with the reasoning structure. Experiments demonstrate that ATAR outperforms state-of-the-art methods across six benchmarks, with up to 15.39% improvement. Additionally, non-reasoning models using ATAR achieve comparable or better performance than reasoning models of the same size in most benchmarks. Ablation studies confirm the significant contribution of the attention alignment component and its effectiveness across different attention-steering backends.
研究旨在解决大型语言模型(LLMs)在解决复杂任务时生成长推理链路导致关键步骤得不到足够关注从而产生错误的问题。提出了ATAR,一种新的推理方法,通过与推理结构对齐注意力来解决这一问题。实验表明,ATAR在六个基准测试中优于最先进的方法,绝对改进幅度最高可达15.39%。此外,在大多数基准测试中,使用ATAR的非推理模型的性能与相同规模的推理模型相当甚至更好。消融研究显示,注意力对齐组件对于这些改进至关重要,并且这些改进在不同的注意力引导后端中仍然有效。
When to Think and When to Look: Uncertainty-Guided Lookback
Authors: Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yolo Y. Tang, Luchuan Song, Susan Liang, Zhongfei Mark Zhang, Jason J. Corso, Chenliang Xu
Venue: CVPR 2026
First: 2025-11-19T17:01:02+00:00 · Latest: 2026-03-27T17:10:24+00:00
Comments: Accepted to CVPR 2026
Abstract
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
中文标题/摘要
标题:何时思考何时查看:基于不确定性回顾
测试时的思考(即生成明确的中间推理链)已被证明能提升大型语言模型的性能,并且最近在大型视觉语言模型(LVLM)中也显示出显著的提升。然而,尽管取得了这些有希望的结果,仍然没有系统地分析思考如何影响视觉推理。我们提供了首次此类分析,通过大规模、受控的比较,评估了来自InternVL3.5和Qwen3-VL家族的十个变体在MMMU-val上的表现,使用宽松的令牌预算和多轮解码。我们展示了更多的思考并不总是更好的;长链路往往导致错误的轨迹,忽视了图像,且表现不如标准指令模式运行的相同模型。更深入的分析表明,某些短回顾短语,明确地回溯到图像,强烈地丰富了成功的轨迹,并与更好的视觉定位相关。基于这一洞察,我们提出了基于不确定性回顾,一种无需训练的解码策略,结合不确定性信号和自适应回顾提示及广度搜索。我们的方法整体上提升了MMMU性能,在标准思考较弱的类别中带来了最大的收益,并优于几个强大的解码基线,设定了固定模型家族和令牌预算下的新最佳性能。我们进一步展示了这一解码策略的泛化能力,在五个额外的基准上产生了持续的改进,包括两个广泛的多模态套件和数学聚焦的视觉推理数据集。
Summary / 总结
This study investigates the impact of test-time thinking on visual reasoning in large vision language models (LVLMs). By comparing ten variants of InternVL3.5 and Qwen3-VL models, the researchers found that more thinking is not always beneficial, as long chains often lead to incorrect reasoning. They propose an uncertainty-guided lookback strategy that combines uncertainty signals with adaptive lookback prompts, which improves overall performance and outperforms several strong baselines, setting a new state-of-the-art on multiple benchmarks.
该研究探讨了测试时思考对大型视觉语言模型(LVLMs)视觉推理的影响。通过比较InternVL3.5和Qwen3-VL模型的十种变体,研究者发现更多的思考并不总是有益的,因为长的推理链往往会导致错误的推理。他们提出了一种基于不确定性指导的回溯策略,结合了不确定性信号与自适应回溯提示,这提高了整体性能,并在多个基准上优于多个强基线,设定了新的最先进水平。
Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation
Authors: Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller
Venue: ICRA 2026
First: 2025-12-05T15:32:36+00:00 · Latest: 2026-03-27T17:10:19+00:00
Comments: This is the author's accepted version of a paper to appear in the IEEE International Conference on Robotics & Automation (ICRA 2026)
Abstract
Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.
中文标题/摘要
标题:高效稳健的多智能体驾驶模拟行为模型研究
可扩展的多智能体驾驶模拟需要既现实又计算高效的行为空间模型。我们通过优化控制个体交通参与者的行为空间模型来解决这一问题。为了提高效率,我们采用以实例为中心的场景表示,其中每个交通参与者和地图元素都在自己的局部坐标系中建模。这种设计使得场景编码高效且视角不变,并允许静态地图标记在模拟步骤之间重用。为了建模交互,我们使用以查询为中心的对称上下文编码器,并在局部坐标系之间使用相对位置编码。我们使用对抗逆向强化学习来学习行为空间模型,并提出了一种自适应奖励转换,以在训练过程中自动平衡稳健性和现实性。实验表明,我们的方法在模拟步骤数量增加时能高效扩展,显著减少了训练和推理时间,并在位置准确性及稳健性方面优于几种以智能体为中心的基线方法。
Summary / 总结
The research aims to develop efficient and robust behavior models for multi-agent driving simulation. The method involves an instance-centric scene representation and a query-centric symmetric context encoder with relative positional encodings. The approach uses Adversarial Inverse Reinforcement Learning and an adaptive reward transformation to balance robustness and realism. Key findings show that the model scales efficiently with the number of tokens, reducing training and inference times, and outperforms agent-centric baselines in positional accuracy and robustness.
研究旨在开发适用于多智能体驾驶模拟的高效且稳健的行为模型。方法包括基于实例的场景表示和基于查询的对称上下文编码器,带有局部帧间的相对位置编码。该方法使用对抗逆强化学习和自适应奖励转换来平衡稳健性和现实性。实验表明,该模型能够高效地扩展到更多的标记物,减少训练和推理时间,并在位置精度和稳健性方面优于基于智能体的基线模型。
Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression
Authors: Rafael Izbicki, Pedro L. C. Rodrigues
First: 2026-03-27T17:07:21+00:00 · Latest: 2026-03-27T17:07:21+00:00
Abstract
Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.
中文标题/摘要
标题:基于表格基础模型的回归条件密度估计基准测试
条件密度估计(CDE)- 在给定表格协变量的情况下恢复响应的完整条件分布 - 在异方差性、多模态或不对称不确定性的情况下至关重要。最近的表格基础模型,如TabPFN和TabICL,自然地生成预测分布,但它们作为通用目的CDE方法的有效性尚未系统评估,与它们在点预测方面的表现相比,后者的研究较为充分。我们使用6个涵盖密度准确性、校准和计算时间的指标,在39个真实世界数据集上,针对训练大小从50到20,000的多种表格基础模型变体,与参数化、树基和神经CDE基线进行基准测试。在所有样本量下,基础模型在大多数测试数据集上实现了最佳的CDE损失、对数似然和CRPS。校准在小样本量下具有竞争力,但在某些指标和数据集上,对于较大的样本量,落后于特定任务的神经基线,这表明事后校准可能是一个有价值的补充。在使用SDSS DR18的光度红移案例研究中,TabPFN在暴露于50,000个训练星系时,优于在完整500,000个星系数据集上训练的所有基线。综上所述,这些结果确立了表格基础模型作为强大的即用型条件密度估计器的地位。
Summary / 总结
This study evaluates the performance of tabular foundation models in conditional density estimation (CDE) across various datasets and sample sizes, comparing them to other parametric, tree-based, and neural CDE methods. The research finds that tabular foundation models outperform other methods in terms of CDE loss, log-likelihood, and CRPS, especially on larger datasets. However, at larger sample sizes, these models lag behind task-specific neural baselines in calibration, suggesting the need for post-hoc recalibration. A case study on photometric redshift using SDSS DR18 data shows that TabPFN can outperform larger baselines with fewer training samples.
研究评估了表格基础模型在条件密度估计(CDE)方面的性能,比较了它们与其他参数、树基和神经CDE方法在多种数据集和样本大小上的表现。研究发现,表格基础模型在CDE损失、对数似然和CRPS等方面优于其他方法,尤其是在大样本数据集上。然而,在大样本大小的情况下,这些模型在校准方面落后于特定任务的神经基线模型,表明可能需要后验校准。SDSS DR18数据上的光谱红移案例研究显示,TabPFN在较少的训练样本下可以超越更大规模的基线模型。
Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling
Authors: Ruixing Zhang, Hanzhang Jiang, Leilei Sun, Liangzhe Han, Jibin Wang, Weifeng Lv
First: 2026-03-27T17:07:13+00:00 · Latest: 2026-03-27T17:07:13+00:00
Abstract
Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.
中文标题/摘要
标题:思考轨迹:利用视频生成从蜂窝信号重建GPS轨迹
移动设备不断与蜂窝基站交互,生成大量信号记录,这些记录广泛覆盖了人类移动的理解。然而,这些记录仅提供粗略的位置线索(例如,服务小区标识符),因此限制了它们在需要高精度GPS轨迹的应用中的直接使用。本文研究了Sig2GPS问题:从蜂窝信号重建GPS轨迹。受领域专家通常将信号轨迹放在地图上并勾勒相应的GPS路线的启发,不同于传统的依赖复杂多阶段工程管道或回归坐标的方法,Sig2GPS重新定义为一个图像到视频生成任务,直接在地图视觉域中操作:信号轨迹在地图上呈现,训练视频生成模型绘制连续的GPS路径。为了支持这一范式,构建了一个配对的信号到轨迹视频数据集,以微调开源视频模型,并引入了一种轨迹感知的强化学习优化方法,通过奖励提高生成精度。在大规模真实世界数据集上的实验显示,与强大的工程和基于学习的基线相比有显著改进,而额外的下一次GPS预测结果表明其可扩展性和跨城市转移性。总体而言,这些结果表明,地图视觉视频生成为轨迹数据挖掘提供了一个实用的接口,通过在地图约束下直接生成和优化连续路径。
Summary / 总结
This paper addresses the Sig2GPS problem, which involves reconstructing high-precision GPS trajectories from coarse cellular signaling records. It reframes this task as an image-to-video generation problem, where signaling traces are rendered on a map and a video generation model is trained to produce continuous GPS paths. The authors construct a paired signaling-to-trajectory video dataset and use a trajectory-aware reinforcement learning method to enhance the model's accuracy. Experiments on large-scale real-world datasets demonstrate significant improvements over existing methods, and the model shows scalability and cross-city transferability in next GPS prediction tasks.
本文解决了从粗略的蜂窝信号记录重建高精度GPS轨迹的Sig2GPS问题。它将任务重新定义为图像到视频生成问题,其中信号轨迹在地图上渲染,然后训练视频生成模型生成连续的GPS路径。作者构建了一个配对的信号到轨迹视频数据集,并引入了一种轨迹感知的强化学习方法来提高模型的准确性。实验结果显示该方法在现有方法上取得了显著改进,并且在下一个GPS预测任务中展示了可扩展性和跨城市转移性。
Hardware-Aware Tensor Networks for Real-Time Quantum-Inspired Anomaly Detection at Particle Colliders
Authors: Sagar Addepalli, Prajita Bhattarai, Abhilasha Dave, Julia Gonski
First: 2026-03-27T17:02:33+00:00 · Latest: 2026-03-27T17:02:33+00:00
Comments: 28 pages, 9 figures
Abstract
Quantum machine learning offers the ability to capture complex correlations in high-dimensional feature spaces, crucial for the challenge of detecting beyond the Standard Model physics in collider events, along with the potential for unprecedented computational efficiency in future quantum processors. Near-term utilization of these benefits can be achieved by developing quantum-inspired algorithms for deployment in classical hardware to enable applications at the "edge" of current scientific experiments. This work demonstrates the use of tensor networks for real-time anomaly detection in collider detectors. A spaced matrix product operator (SMPO) is developed that provides sensitivity to a variety beyond the Standard Model benchmarks, and can be implemented in field programmable gate array hardware with resources and latency consistent with trigger deployment. The cascaded SMPO architecture is introduced as an SMPO variation that affords greater flexibility and efficiency in ways that are key to edge applications in resource-constrained environments. These results reveal the benefit and near-term feasibility of deploying quantum-inspired ML in high energy colliders.
中文标题/摘要
标题:硬件感知张量网络在粒子对撞机实时量子启发异常检测中的应用
量子机器学习能够捕捉高维特征空间中的复杂相关性,这对于在碰撞事件中检测超出标准模型的物理现象至关重要,同时还有可能在未来量子处理器中实现前所未有的计算效率。通过开发可在经典硬件上部署的量子启发算法,可以在当前科学实验的“边缘”实现这些优势的近期内利用。本研究展示了张量网络在碰撞探测器中进行实时异常检测的应用。开发了一种空间矩阵乘积算符(SMPO),能够对超出标准模型的多种基准具有敏感性,并且可以在现场可编程门阵列硬件中实现,其资源和延迟与触发部署一致。介绍了级联SMPO架构作为SMPO的一种变体,以在资源受限环境中实现关键的灵活性和效率。这些结果揭示了在高能对撞机中部署量子启发的ML的益处及其近期内的可行性。
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Authors: Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla
First: 2026-03-27T16:57:51+00:00 · Latest: 2026-03-27T16:57:51+00:00
Comments: Project Page: https://zhaochongan.github.io/projects/VGGRPO
Abstract
Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.
中文标题/摘要
标题:VGGRPO:通过4D潜在奖励实现世界一致的视频生成
大规模视频扩散模型在视觉质量方面取得了显著成就,但往往无法保持几何一致性。先前的方法通过在生成器中增加额外模块或应用几何感知对齐来提高一致性。然而,架构修改可能会损害互联网规模预训练模型的泛化能力,而现有的对齐方法仅适用于静态场景,并依赖于RGB空间奖励,这需要反复进行VAE解码,导致计算开销巨大且无法泛化到高度动态的真实世界场景。为了保持预训练能力的同时提高几何一致性,我们提出了VGGRPO(视觉几何GRPO),这是一种几何感知的视频后训练框架。VGGRPO引入了一个潜在几何模型(LGM),将视频扩散潜在变量与几何基础模型缝合在一起,从而可以直接从潜在空间解码场景几何。通过从具有4D重建能力的几何模型构建LGM,VGGRPO自然地扩展到动态场景,克服了先前方法的静态场景限制。在此基础上,我们使用两种互补奖励进行潜在空间组相对策略优化:一种是摄像机运动平滑奖励,惩罚抖动轨迹;另一种是几何再投影一致性奖励,确保跨视图几何一致性。在静态和动态基准上的实验表明,VGGRPO提高了摄像机稳定性、几何一致性和整体质量,同时消除了昂贵的VAE解码,使潜在空间几何引导强化成为一种高效且灵活的世界一致视频生成方法。
Summary / 总结
VGGRPO is a latent geometry-guided framework that improves geometric consistency in video generation by introducing a Latent Geometry Model (LGM) and using Group Relative Policy Optimization (GRPO) with two rewards: camera motion smoothness and geometry reprojection consistency. This approach enhances camera stability, geometry consistency, and overall quality without the need for VAE decoding, making it efficient and flexible for world-consistent video generation.
VGGRPO 是一种用于提高视频生成中几何一致性的潜空间几何引导框架。它引入了潜空间几何模型(LGM),将视频扩散潜空间与几何基础模型缝合在一起,从而可以直接从潜空间解码场景几何。通过使用具有4D重建能力的几何模型,VGGRPO 自然地扩展到动态场景。该框架还使用潜空间组相对策略优化,并采用两种奖励:一种用于惩罚摇晃的摄像机轨迹,另一种用于确保跨视图几何一致性。实验表明,VGGRPO 提升了摄像机稳定性、几何一致性和整体视频质量,同时无需进行昂贵的 VAE 编码。
From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
Authors: Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang
Venue: CVPR 2026
First: 2026-03-27T16:56:50+00:00 · Latest: 2026-03-27T16:56:50+00:00
Comments: Accepted at CVPR 2026
Abstract
Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.
中文标题/摘要
标题:从静态到动态:探索自监督图像到视频表示迁移学习
近期研究在通过将图像预训练模型转移到视频任务中进行视频表示学习方面取得了显著进展,通常使用复杂的时序模块和视频微调。然而,微调重模块可能会损害跨视频语义可分性,即区分不同视频中对象的基本能力。同时,减少可调参数会妨碍其在视频内的时序一致性,这是在视频内稳定表示同一对象所需的要求。这种困境表明,在图像到视频的转移过程中,时序一致性和跨视频语义可分性之间可能存在潜在的权衡。为此,我们提出了轻量级投影层的Consistency-Separability Trade-off Transfer Learning (Co-Settle)框架,在冻结的图像预训练编码器之上应用,通过时序循环一致性目标和语义可分性约束来调整表示空间。我们进一步提供了理论支持,表明在适当条件下优化的投影能够更好地在两种属性之间取得权衡。在八个图像预训练模型上的实验表明,仅通过五轮自监督训练即可在多个视频任务级别上实现一致的改进。代码可在https://github.com/yafeng19/Co-Settle/获取。
Summary / 总结
This paper addresses the trade-off between intra-video temporal consistency and inter-video semantic separability in image-to-video representation transfer learning. It proposes the Co-Settle framework, which uses a lightweight projection layer with a temporal cycle consistency objective and a semantic separability constraint to optimize the representation space. Experiments show consistent improvements across various video tasks with minimal self-supervised training epochs.
该研究解决了从图像到视频表示迁移学习中的内在视频时间一致性和跨视频语义可分性之间的权衡问题。提出了Co-Settle框架,使用轻量级投影层调整表示空间,并结合时间循环一致性目标和语义可分性约束。实验结果显示,在少量自我监督训练周期内,该方法在多种视频任务上表现出一致的改进。
Characterization and forecasting of national-scale solar power ramp events
Authors: Luca Lanzilao, Angela Meyer
First: 2026-03-27T16:56:46+00:00 · Latest: 2026-03-27T16:56:46+00:00
Abstract
The rapid growth of solar energy is reshaping power system operations and increasing the complexity of grid management. As photovoltaic (PV) capacity expands, short-term fluctuations in PV generation introduce substantial operational uncertainty. At the same time, solar power ramp events intensify risks of grid instability and unplanned outages due to sudden large power fluctuations. Accurate identification, forecasting and mitigation of solar ramp events are therefore critical to maintaining grid stability. In this study, we analyze two years of PV power production from 6434 PV stations at 15-minute resolution. We develop quantitative metrics to define solar ramp events and systematically characterize their occurrence, frequency, and magnitude at a national scale. Furthermore, we examine the meteorological drivers of ramp events, highlighting the role of mesoscale cloud systems. In particular, we observe that ramp-up events are typically associated with cloud dissipation during the morning, while ramp-down events commonly occur when cloud cover increases in the afternoon. Additionally, we adopt a recently developed spatiotemporal forecasting framework to evaluate both deterministic and probabilistic PV power forecasts derived from deep learning and physics-based models, including SolarSTEPS, SHADECast, IrradianceNet, and IFS-ENS. The results show that SHADECast is the most reliable model, achieving a CRPS 10.8% lower than that of SolarSTEPS at a two-hour lead time. Nonetheless, state-of-the-art nowcasting models struggle to capture ramp dynamics, with forecast RMSE increasing by up to 50% compared to normal operating conditions. Overall, these results emphasize the need for improved high-resolution spatiotemporal modelling to enhance ramp prediction skill and support the reliable integration of large-scale solar generation into power systems.
中文标题/摘要
标题:国家尺度太阳能功率骤变事件的表征与预测
太阳能的快速增长正在重塑电力系统运营并增加电网管理的复杂性。随着光伏(PV)容量的扩大,光伏发电的短期波动引入了重大的运营不确定性。同时,太阳能功率骤变事件由于突然的大功率波动加剧了电网不稳定性及非计划停机的风险。因此,准确识别、预测和缓解太阳能骤变事件对于维持电网稳定性至关重要。在本研究中,我们分析了来自6434个光伏站两年的15分钟分辨率的光伏电力生产数据。我们开发了定量指标来定义太阳能骤变事件,并系统地表征了其在全国尺度上的发生频率、频率和幅度。此外,我们还研究了骤变事件的气象驱动因素,突出了中尺度云系统的作用。特别是,我们观察到,通常在早晨云消散时发生功率上升事件,而在下午云层增加时发生功率下降事件。此外,我们采用了一种最近开发的空间时间预测框架来评估基于深度学习和物理模型的确定性和概率光伏功率预测,包括SolarSTEPS、SHADECast、IrradianceNet和IFS-ENS。结果表明,SHADECast是最可靠的模型,在两小时预测时,其CRPS比SolarSTEPS低10.8%。然而,最先进的现在预测模型难以捕捉骤变动态,与正常运行条件相比,预测RMSE增加了高达50%。总体而言,这些结果强调了需要改进高分辨率的空间时间建模以提高骤变预测技能,并支持大规模太阳能发电的可靠集成到电力系统中。
Summary / 总结
This study aims to characterize and forecast solar power ramp events to enhance grid stability. By analyzing two years of PV power data from 6434 stations, the researchers developed metrics to define and quantify ramp events, identifying meteorological drivers such as cloud systems. They also evaluated various forecasting models, finding SHADECast to be the most reliable, with a 10.8% lower Continuous Ranked Probability Score (CRPS) compared to SolarSTEPS at a two-hour lead time. However, existing models struggle to accurately capture ramp dynamics, with forecast errors increasing significantly during these events.
该研究通过分析来自6434个站点的两年15分钟分辨率的光伏电力数据,定义并表征了太阳能跃变事件,并识别了气象驱动因素,评估了预测模型。研究发现,SHADECast是最可靠的模型,其连续排名概率得分(CRPS)比SolarSTEPS低10.8%,但在两小时预报时效下,当前模型难以准确预测跃变动态,导致在这些事件期间的预测误差增加。
The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding
Authors: Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene
First: 2026-03-27T16:52:46+00:00 · Latest: 2026-03-27T16:52:46+00:00
Comments: 7 figures, 5 tables
Abstract
What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.
中文标题/摘要
标题:从图片和文本学习的局限性:视觉语言模型与具身场景理解
什么样的信息足以学习人类场景理解的全部丰富性?分布假设认为,语言和图像的统计共现捕捉了视觉认知的概念知识。视觉语言模型(VLMs)在大规模配对的文本-图像语料库上进行训练,但缺乏具身经验,使其成为检验分布假设的理想测试。我们报告了两项实验,比较了18个VLMs生成的描述与超过2000名人类观察者在15项高层场景理解任务中的描述,这些任务涵盖了常识、功能、感官体验、情感反应和未来预测。由于许多任务缺乏真实答案,我们开发了一种基于人类校准余弦距离(HCD)度量,衡量VLM输出与人类响应分布的相似性,按人类内部变异性缩放。在实验1中,VLMs在常识任务上接近人类水平的表现,但在功能任务上表现出明显的缺陷,这些任务抵抗了提示工程且在新模型版本中没有改善。在实验2中,我们测试了六个解释这种功能差距的机制假设,发现缺陷是结构性的而非风格性的,提供显式空间信息也无法解决。语料库分析表明,图像字幕数据集中包含稀疏的以代理为中心的功能语言,这与格赖斯关于为什么具身知识可能系统性地在语言中被低估的解释一致。这些发现共同表明,从图像和文本中进行分布学习不足以进行基于功能的场景理解,暗示人类视觉认知的一些维度可能需要像照片或字幕无法编码的以代理为中心的三维体验。
Summary / 总结
The study investigates whether vision-language models (VLMs) can achieve human-level scene understanding, particularly in tasks related to affordances. VLMs, trained on large text-image datasets, were compared against human responses across 15 tasks. Experiment 1 showed VLMs performed well on general knowledge tasks but struggled with affordance tasks, which did not improve with newer model versions. Experiment 2 tested six hypotheses and found the deficit was structural rather than stylistic, suggesting distributional learning is insufficient for understanding affordances. The findings imply that some aspects of visual cognition require embodied, three-dimensional experience that cannot be fully captured by images and text alone.
研究探讨了视觉语言模型(VLMs)是否能在场景理解任务中达到人类水平,特别是与功能相关的任务。VLMs基于大规模图文数据集进行训练,并与人类在15项任务上的反应进行了比较。实验1显示,VLMs在一般知识任务上表现良好,但在功能任务上却表现出明显不足,即使使用了更新的模型版本也未能改善。实验2测试了六个假设,发现不足是结构性的,而不是风格性的,并且通过提供空间信息也无法解决。研究结果表明,从图像和文本中进行分布学习不足以理解功能,暗示某些视觉认知维度可能需要类似行动者中心的三维体验,而这种体验无法通过照片或描述来编码。
From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion
Authors: Dávid Pukanec, Tibor Kubík, Michal Španěl
First: 2026-03-27T16:51:40+00:00 · Latest: 2026-03-27T16:51:40+00:00
Comments: VISAPP 2026 Conference
Abstract
We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP_ToothCraft
中文标题/摘要
标题:从合成数据到真实修复:基于扩散模型的患者特定牙冠完成
我们提出了ToothCraft,一种基于扩散的模型,用于生成牙齿冠部的上下文,该模型在人工创建的不完整牙齿上进行训练。基于最近在3D形状上条件扩散模型的进展,我们开发了一种能够在局部解剖上下文中自动完成牙齿冠部的模型。为了解决该任务中训练数据不足的问题,我们设计了一条增强管道,从公开可用的完整牙弓数据集(3DS,ODD)中生成不完整的牙齿几何形状。通过合成多样化的训练示例,我们的方法能够在广泛的牙齿缺陷范围内实现稳健的学习。实验结果表明,我们的模型在重建完整牙冠方面具有强大的能力,合成损坏的测试修复的交并比(IoU)为81.8%,切比雪夫距离(CD)为0.00034。我们的实验表明,该模型可以直接应用于实际病例,有效地填补不完整的牙齿,而生成的牙冠与对颌牙齿的交集最小,从而降低了咬合干扰的风险。代码、模型权重和数据集信息的访问地址为:https://github.com/ikarus1211/VISAPP_ToothCraft
Summary / 总结
ToothCraft is a diffusion-based model for generating complete tooth crowns from incomplete teeth, trained on synthetic data. It uses an augmentation pipeline to create diverse training examples from a public dataset of complete dental arches. The model achieves an IoU of 81.8% and a CD of 0.00034 on synthetic testing restorations and can be applied to real-world cases, effectively filling in incomplete teeth without occlusal interference.
ToothCraft 是一种基于扩散的模型,用于从不完整牙齿生成完整的牙冠,训练数据来自合成数据。它使用一个扩增管道从一个完整的牙弓公共数据集中生成多样化的训练示例。该模型在合成测试修复上实现了81.8%的IoU和0.00034的CD,并且可以应用于实际病例,有效地填补不完整牙齿,同时生成的牙冠与对颌牙齿几乎没有交集,减少了咬合干扰的风险。
MA-Bench: Towards Fine-grained Micro-Action Understanding
Authors: Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan, Dan Guo
Venue: CVPR 2026
First: 2026-03-27T16:49:19+00:00 · Latest: 2026-03-27T16:49:19+00:00
Comments: Accepted by CVPR 2026
Abstract
With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io
中文标题/摘要
标题:MA-Bench:迈向精细微动作理解
随着多模态大型语言模型(MLLMs)的快速发展,它们在微动作理解中的潜力尚未得到探索,因为缺乏专门的基准测试。为了解决这一问题,我们提出了MA-Bench,这是一个包含1000个视频和三层评估架构的基准测试,该架构逐步检验微动作感知、关系理解以及解释推理。MA-Bench 包含12,000个结构化问题-答案对,使评估不仅包括识别准确性,还包括动作解释。23个代表性MLLMs的结果表明,在捕捉动作细节和精细的身体部分动态方面存在重大挑战。为应对这些挑战,我们进一步构建了MA-Bench-Train,这是一个包含20,500个视频的大规模训练语料库,这些视频被标注了结构化的微动作描述,用于微调MLLMs。经过MA-Bench-Train微调的Qwen3-VL-8B在微动作推理和解释任务中表现出明显的性能提升。我们的工作旨在为推动MLLMs在理解细微微动作和人类相关行为方面建立基础基准。项目页面:https://MA-Bench.github.io
Summary / 总结
MA-Bench is a benchmark for micro-action understanding, consisting of 1,000 videos and a three-tier evaluation system. It evaluates perception, relational comprehension, and interpretive reasoning, with 12,000 structured question-answer pairs. The study finds that MLLMs struggle with motion granularity and fine-grained body-part dynamics. To address this, MA-Bench-Train, a large-scale training corpus, was created, showing performance improvements in micro-action reasoning and explanation tasks after fine-tuning Qwen3-VL-8B on it. The work aims to advance MLLMs in understanding subtle human behaviors.
MA-Bench 是一个用于微动作理解的基准,旨在填补针对多模态大型语言模型(MLLMs)的专业基准的空白。它包含1,000个视频和一个三级评估系统。研究发现,在捕捉运动细节和细微的身体部位动态方面存在显著挑战。为此,MA-Bench-Train 作为一个包含20.5K个视频的训练语料库被构建,用于提升 MLLM 的性能。通过在 MA-Bench-Train 上微调 Qwen3-VL-8B,其在微动作推理和解释任务上的表现得到了改善。这项工作旨在推动 MLLMs 在理解细微的人类行为方面的发展。
Massive Redundancy in Gradient Transport Enables Sparse Online Learning
Authors: Aur Shalev Merin
First: 2026-03-16T12:32:55+00:00 · Latest: 2026-03-27T16:42:19+00:00
Comments: 26 pages, 5 figures, 14 tables
Abstract
Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL's adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.
中文标题/摘要
标题:梯度传输中的大量冗余使稀疏在线学习成为可能
实时递归学习(RTRL)通过递归动力学前向传播雅可比张量来计算精确的在线梯度,但每步成本为O(n^4)。先前的工作寻求结构化近似(秩1压缩、基于图的稀疏性、克罗内克分解)。我们表明,在连续误差信号的范围内,递归雅可比矩阵具有大量冗余性:通过随机传播6%的路径(k=4的n=64)可以在五个种子中恢复84±6%的全RTRL的适应能力,并且绝对数量k=4从n=64到n=256(6%到1.6%,恢复84到78%)中保持有效,这意味着稀疏RTRL随着网络的增长而相对更便宜。在RNN中,恢复是选择不变的(即使对抗性路径选择也有效),并且表现出从零到任何非零传播的阶跃函数转换。谱分析揭示了机制:雅可比矩阵是满秩但近各向同性的(条件数2.6-6.5),因此任何随机子集都能提供方向上代表性的梯度估计。在混沌动力学(洛伦兹吸引子)中,稀疏传播比全RTRL更具数值稳定性(变异系数13% vs. 88%),因为采样避免放大病态谱模。这种冗余也扩展到LSTMs(k=4匹配全RTRL)和通过稀疏梯度传输的变压器(50%头稀疏性优于密集参考;33%接近临界值),更高的阈值反映了头的专业化而不是各向同性。在真实灵长类神经数据中,稀疏RTRL(k=4)在线适应跨会话电极漂移(80±11%恢复,5个种子),其中稀疏传播再次比全RTRL更稳定。没有连续误差信号,雅可比传播累积数值漂移并恶化所有RTRL变体,这是所有前向模式方法的适用范围条件。结果在使用SGD(92±1%恢复)时仍然成立,表明与优化器选择无关。
Summary / 总结
The paper investigates the redundancy in gradient transport for online learning in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. By propagating gradients through only a small subset of paths (6% of the total), the study shows that 84% of the full RTRL adaptation ability can be recovered, with this percentage dropping to 78% as the network size increases. The recovery is robust and invariant to path selection, even under adversarial conditions. Spectral analysis indicates that the Jacobian is near-isotropic, allowing for effective gradient estimates from any random subset. Sparse propagation also improves numerical stability in chaotic dynamics and real neural data, with higher thresholds reflecting head specialization. The results hold with stochastic gradient descent, suggesting optimizer independence.
该研究探讨了递归神经网络(RNN)和长短期记忆(LSTM)网络中梯度传输的冗余性。通过只传播一小部分路径的梯度,研究发现稀疏RTRL可以在RNN中恢复84%的完整RTRL的适应能力,这一比例随着网络规模的增加而降低。结果表明,稀疏RTRL在混沌动力学和真实神经数据场景中比完整RTRL更具数值稳定性,且不易出现数值漂移。
Scene Grounding In the Wild
Authors: Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor
First: 2026-03-27T16:41:20+00:00 · Latest: 2026-03-27T16:41:20+00:00
Comments: Project page at https://tau-vailab.github.io/SceneGround/
Abstract
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.
中文标题/摘要
标题:野外场景定位
从无结构的野外图像中重建大规模真实世界场景的精确3D模型仍然是计算机视觉中的核心挑战,尤其是在输入视图几乎没有重叠的情况下。在这种情况下,现有的重建管道通常会产生多个断开的部分重建,或者错误地将非重叠区域合并为重叠几何结构。在本工作中,我们提出了一种框架,将每个部分重建定位到场景的完整参考模型中,即使在没有视觉重叠的情况下也能实现全局一致的对齐。我们从Google Earth Studio中获得参考模型,这些模型是密集的、地理上准确的伪合成渲染,这些渲染提供了完整的场景覆盖,但与真实世界的照片在外观上存在显著差异。我们的关键见解是,尽管存在显著的领域差距,但两个领域共享相同的场景语义。我们使用3D高斯点云表示参考模型,并为每个高斯添加语义特征,将对齐形式化为一种逆特征优化方案,该方案估计全局6自由度姿态和比例,同时保持参考模型固定。此外,我们引入了WikiEarth数据集,该数据集将现有的部分3D重建与伪合成参考模型进行注册。我们证明,当使用各种经典和基于学习的管道初始化时,我们的方法可以一致地提高全局对齐,同时缓解最先进的端到端模型的失败模式。所有代码和数据将被发布。
Summary / 总结
This work addresses the challenge of reconstructing accurate 3D models from unstructured in-the-wild imagery with little overlap. It proposes a framework that grounds partial reconstructions to a complete reference model derived from geospatially accurate pseudo-synthetic renderings. The method uses 3D Gaussian Splatting with semantic features and an inverse feature-based optimization scheme to achieve globally consistent alignment. Experiments show that this approach improves global alignment and mitigates failure modes of state-of-the-art models when initialized with various pipelines. The WikiEarth dataset is introduced to facilitate this process.
该研究解决了从少量重叠的无结构真实世界图像中重建准确的3D模型的挑战。它提出了一种框架,将部分重建与来自地理准确伪合成渲染的完整参考模型对齐。该方法使用带有语义特征的3D高斯散点图和逆特征优化来实现全局一致对齐。实验表明,该方法在各种管道初始化时能够提高全局对齐并缓解最先进的端到端模型的常见故障模式。还引入了WikiEarth数据集,将部分重建与伪合成模型进行注册,以促进该研究。
The Climber's Grip -- Personalized Deep Learning Models for Fear and Muscle Activity in Climbing
Authors: Matthias Boeker, Dana Swarbrick, Ulysse T. A. Côté-Allard, Marc T. P. Adam, Hugo L. Hammer, Pål Halvorsen
First: 2026-03-27T16:34:55+00:00 · Latest: 2026-03-27T16:34:55+00:00
Abstract
Climbing is a multifaceted sport that combines physical demands and emotional and cognitive challenges. Ascent styles differ in fall distance with lead climbing involving larger falls than top rope climbing, which may result in different perceived risk and fear. In this study, we investigated the psychophysiological relationship between perceived fear and muscle activity in climbers using a combination of statistical modeling and deep learning techniques. We conducted an experiment with 19 climbers, collecting electromyography (EMG), electrocardiography (ECG) and arm motion data during lead and top rope climbing. Perceived fear ratings were collected for the different phases of the climb. Using a linear mixed-effects model, we analyzed the relationships between perceived fear and physiological measures. To capture the non-linear dynamics of this relationship, we extended our analysis to deep learning models and integrated random effects for a personalized modeling approach. Our results showed that random effects improved model performance of the mean squared error (MSE), mean absolute error (MAE) and root mean squared error (RMSE). The results showed that muscle fatigue correlates significantly with increased fear during \textit{lead climbing}. This study highlights the potential of combining statistical and deep learning approaches for modeling the interplay between psychological and physiological states during climbing.
中文标题/摘要
标题:攀岩者的握力——个性化深度学习模型在攀岩中的恐惧与肌肉活动
攀岩是一项多方面的运动,结合了身体需求和情感与认知挑战。攀登风格不同,导致坠落距离不同,领攀涉及更大的坠落距离,而顶绳攀登则较小,这可能导致不同的感知风险和恐惧。在本研究中,我们使用统计建模和深度学习技术研究了攀岩者感知恐惧与生理反应之间的心理生理关系。我们进行了一个涉及19名攀岩者的实验,在领攀和顶绳攀岩过程中收集了肌电图(EMG)、心电图(ECG)和手臂运动数据。收集了不同攀爬阶段的感知恐惧评分。通过线性混合效应模型,我们分析了感知恐惧与生理指标之间的关系。为了捕捉这种关系的非线性动态,我们将分析扩展到深度学习模型,并结合随机效应以实现个性化建模方法。结果显示,随机效应提高了均方误差(MSE)、绝对误差(MAE)和均方根误差(RMSE)的模型性能。结果显示,肌肉疲劳与领攀过程中的恐惧增加显著相关。本研究突显了结合统计和深度学习方法在建模攀岩过程中心理和生理状态相互作用的潜力。
Summary / 总结
This study investigates the relationship between perceived fear and muscle activity in climbers using statistical and deep learning methods. With 19 participants, the study collected EMG, ECG, and motion data during lead and top rope climbing. The results indicated that random effects in deep learning models improved model performance. Muscle fatigue was found to correlate significantly with increased fear during lead climbing, highlighting the potential of combining statistical and deep learning approaches to model the interplay between psychological and physiological states during climbing.
本研究使用统计和深度学习技术探讨攀岩者在攀岩过程中感知恐惧与生理指标之间的关系。通过收集19名攀岩者的EMG、ECG和手臂运动数据,研究发现随机效应在深度学习模型中的应用提高了模型性能,并且肌肉疲劳与铅绳攀岩中的恐惧感显著相关。这表明结合统计和深度学习方法有助于理解心理和生理状态在攀岩过程中的相互作用。
Particulate: Feed-Forward 3D Object Articulation
Authors: Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi
Venue: CVPR 2026
First: 2025-12-12T18:59:51+00:00 · Latest: 2026-03-27T16:33:22+00:00
Comments: CVPR 2026. Project page: https://ruiningli.com/particulate
Abstract
We introduce Particulate, a feed-forward model that, given a 3D mesh of an object, infers its articulations, including its 3D parts, their kinematic structure, and the motion constraints. The model is based on a transformer network, the Part Articulation Transformer, which predicts all these parameters for all joints. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate maps the output of the network back to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate also works on AI-generated 3D assets, enabling the generation of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D model. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Empirically, Particulate significantly outperforms state-of-the-art approaches.
中文标题/摘要
标题:Particulate:前馈3D物体关节化
我们引入了Particulate,这是一种前馈模型,给定一个物体的3D网格,可以推断出其关节,包括其3D部分、它们的运动学结构以及运动约束。该模型基于一种变压器网络——Part Articulation Transformer,它可以预测所有关节的所有参数。我们通过在公共数据集中的多样化3D资产上端到端训练网络。在推理过程中,Particulate将网络的输出映射回输入网格,几秒钟内生成一个完全关节化的3D模型,比之前需要针对每个物体进行优化的方法快得多。Particulate还可以处理AI生成的3D资产,结合现成的图像到3D模型时,可以从单个(真实或合成)图像生成关节化的3D物体。我们还引入了一个新的具有挑战性的3D关节化估计基准,该基准从高质量的公共3D资产中精心挑选而来,并重新设计了评估协议,使其更符合人类的偏好。实验上,Particulate显著优于最先进的方法。
Summary / 总结
Particulate is a feed-forward model that infers the articulations of 3D objects from their meshes using a transformer network called Part Articulation Transformer. The model predicts 3D parts, kinematic structure, and motion constraints for all joints and is trained end-to-end on diverse 3D assets. During inference, Particulate quickly generates a fully articulated 3D model, outperforming previous methods that require per-object optimization. It also works on AI-generated 3D assets, enabling the creation of articulated 3D objects from a single image. Empirical results show that Particulate significantly outperforms state-of-the-art approaches.
Particulate 是一种前馈模型,使用名为 Part Articulation Transformer 的变压器网络从 3D 网格中推断物体的关节。该模型预测所有关节的 3D 部件、运动结构和运动约束,并通过端到端训练在多样化的 3D 资产上进行训练。在推理过程中,它能够快速生成完全关节化的 3D 模型,优于需要逐对象优化的先前方法。Particulate 还可以处理 AI 生成的 3D 资产,结合现成的图像到 3D 模型可以生成关节化的 3D 对象。实验证明,Particulate 显著优于现有最佳方法。
Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Authors: Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, Hiroshi Watanabe
First: 2026-03-27T16:33:20+00:00 · Latest: 2026-03-27T16:33:20+00:00
Comments: 9 pages, 3 figures
Abstract
Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
中文标题/摘要
标题:生成即压缩:基于随机化流的零样本视频编码
现有的生成视频压缩方法仅将生成模型用作传统编解码器之上的后处理重建模块。我们提出了一种名为\emph{生成视频编解码器}(GVC)的零样本框架,将预训练的视频生成模型直接转换为编解码器本身:传输的位流直接指定生成解码轨迹,无需重新训练。为了实现这一点,我们在推理时将现代视频基础模型的确定性整流流ODE转换为等效的SDE,解锁了每步的随机注入点,以实现基于码本的压缩。基于此统一的骨干,我们实例化了三种互补的条件策略——\emph{图像到视频}(I2V)具有自适应尾帧原子分配,\emph{文本到视频}(T2V)在接近零的侧信息下作为纯粹的生成先验,以及\emph{首尾帧到视频}(FLF2V)具有边界共享GOP链式以实现双锚点的时间控制。这些变体共同构成了空间保真度、时间连贯性和压缩效率之间的原理性权衡空间。在标准基准上的实验表明,GVC在低于0.002\,bpp的情况下实现了高质量的重建,同时通过单一超参数支持灵活的比特率控制。
Summary / 总结
The research aims to improve video compression by leveraging generative models directly as the codec. The method involves converting the deterministic rectified-flow ODE of video foundation models into an equivalent SDE for stochastic injection points, enabling codebook-driven compression. The key findings show that the proposed Generative Video Codec (GVC) achieves high-quality reconstruction below 0.002 bpp with flexible bitrate control through a single hyperparameter.
研究旨在通过将生成模型直接集成到编解码器中来改进视频压缩。方法是将视频基础模型中的确定性整流流ODE转换为随机微分方程,以实现基于码本的压缩。关键发现表明,提出的生成视频编解码器(GVC)在低于0.002 bpp的情况下实现了高质量的视频重建,并通过一个超参数支持灵活的比特率控制。
Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation
Authors: Kennedy Edemacu, Vinay M. Shashidhar, Micheal Tuape, Dan Abudu, Beakcheol Jang, Jong Wook Kim
First: 2025-08-04T19:03:52+00:00 · Latest: 2026-03-27T16:32:20+00:00
Comments: Preprint for Submission
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.
中文标题/摘要
标题:防御检索增强生成中的知识投毒攻击
检索增强生成(RAG)已成为通过引入外部的最新知识源来增强大型语言模型(LLMs)能力的一种强大方法。然而,这引入了一种潜在的安全漏洞,即知识投毒攻击,攻击者可以通过篡改知识源来误导生成模型。其中一种攻击是PoisonedRAG,攻击者注入的对抗性文本会引导模型生成攻击者选择的答案。在本文中,我们提出了新的防御方法FilterRAG和ML-FilterRAG来缓解PoisonedRAG攻击。首先,我们提出了一种新特性来揭示区分对抗性和干净文本的独特属性。然后,我们利用这一特性在我们提出的方法设计中过滤掉对抗性文本。使用基准数据集对这些方法的评估表明,它们的有效性接近原始RAG系统的性能。
Summary / 总结
This paper addresses the vulnerability of Retrieval-Augmented Generation (RAG) models to knowledge poisoning attacks, where attackers can inject adversarial texts to mislead the model. The authors propose FilterRAG and ML-FilterRAG, which identify and filter out adversarial texts based on new properties of the knowledge data. Experimental results show that these methods effectively defend against PoisonedRAG attacks, with performance nearly matching that of the original RAG systems.
该研究针对检索增强生成(RAG)模型面临的知识投毒攻击漏洞,攻击者通过注入恶意文本误导模型。作者提出了FilterRAG和ML-FilterRAG方法,基于知识数据的新特性来识别和过滤恶意文本。实验结果表明,这些方法能够有效抵御PoisonedRAG攻击,性能接近原始的RAG系统。
Machine Unlearning under Retain-Forget Entanglement
Authors: Jingpu Cheng, Ping Liu, Qianxiao Li, Chi Zhang
Venue: ICLR 2026
First: 2026-03-27T16:32:09+00:00 · Latest: 2026-03-27T16:32:09+00:00
Comments: ICLR 2026 camera-ready
Abstract
Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retai-forget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.
中文标题/摘要
标题:机器去学习中的保留-遗忘纠缠
在机器去学习中忘记一个子集通常不是一个孤立的任务。通常,与遗忘集密切相关的保留样本可能会无意中受到影响,尤其是在它们具有预训练的关联特征或表现出强烈语义相似性的情况下。为了解决这一挑战,我们提出了一种新的两阶段优化框架,专门设计用于处理这种保留-遗忘纠缠。在第一阶段,通过增广拉格朗日方法增加遗忘集上的损失,同时保持与遗忘集关系较弱的保留样本的准确性。第二阶段应用梯度投影步骤,并通过Wasserstein-2距离正则化,以减轻与保留样本语义相关的性能下降,而不牺牲去学习目标。我们通过在多个去学习任务、标准基准数据集和多种神经网络架构上的全面实验验证了我们的方法,证明了它在保持准确性和删除保真度方面都优于现有基线。
Summary / 总结
The paper addresses the challenge of machine unlearning where retained samples can be unintentionally affected when forgetting a subset. It proposes a two-phase optimization framework that uses an augmented Lagrangian method in the first phase to increase loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step regularized by the Wasserstein-2 distance to mitigate performance degradation on semantically related retained samples. Experiments show that this approach effectively and reliably achieves unlearning with better accuracy retention and removal fidelity compared to existing methods.
论文针对机器卸载过程中保留某些样本时,由于共享特征或语义相似性而无意中影响相关样本的问题进行了研究。提出了一种两阶段优化框架,首先使用增广拉格朗日方法增加忘记集上的损失同时保持与之关系较弱的保留样本的准确性。第二阶段采用梯度投影步骤,并通过Wasserstein-2距离进行正则化,以减轻对语义相关保留样本的性能下降,而不牺牲卸载目标。实验表明,该方法在保持准确性和删除保真度方面优于现有方法,实现了有效和可靠的卸载。
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
Authors: Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan, Chris Brown
First: 2026-03-27T16:30:54+00:00 · Latest: 2026-03-27T16:30:54+00:00
Abstract
Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.
中文标题/摘要
标题:超越代码片段:在仓库级别问题回答中评估LLMs
大型语言模型(LLMs)在软件工程任务中展示了令人印象深刻的性能,包括问题回答(QA)。然而,大多数研究和基准测试主要集中在孤立的功能或单个代码片段上,忽视了真实程序理解的挑战,这些挑战往往跨越多个文件和系统级依赖。在本研究中,我们引入了StackRepoQA,这是首个基于1,318个真实开发人员问题及其接受答案构建的多项目仓库级别问题回答数据集,涵盖134个开源Java项目。使用此数据集,我们系统地评估了两种广泛使用的LLM(Claude 3.5 Sonnet和GPT-4o)在直接提示和代理配置下的表现。我们将基线性能与利用文件级检索和基于图的结构依赖表示的检索增强生成方法进行了比较。结果显示,LLM在基线时达到中等准确度,当结构信号被纳入时,性能有所提高。然而,总体准确度在仓库规模理解方面仍然有限。分析表明,高分往往来自于逐字复制Stack Overflow答案,而不是真正的推理。据我们所知,这是首个在仓库级别QA中提供此类证据的经验研究。我们发布了StackRepoQA,以鼓励进一步研究基准测试、评估协议和分离记忆与推理的增强策略,推动LLM作为可靠工具用于仓库规模程序理解的发展。
Summary / 总结
This study introduces StackRepoQA, a new dataset for repository-level question answering in software engineering, addressing the limitations of previous benchmarks that focus on isolated code snippets. Using this dataset, the researchers evaluated two LLMs (Claude 3.5 Sonnet and GPT-4o) under different configurations and compared their performance with retrieval-augmented generation methods. The results indicate that while LLMs show moderate accuracy at baseline, incorporating structural signals improves performance but still falls short for comprehensive repository-level comprehension. The study provides evidence that high scores often result from memorization rather than reasoning, highlighting the need for benchmarks that disentangle these aspects.
该研究引入了StackRepoQA,这是一个针对软件工程中的仓库级问题回答的新数据集,解决了之前基准主要关注孤立代码片段的局限性。研究使用该数据集评估了LLM(Claude 3.5 Sonnet和GPT-4o),并与基于检索的生成方法进行了比较。结果显示,虽然LLM在基线时表现出中等的准确性,但在结合结构信号后性能有所提升,但仍不足以实现全面的仓库级理解。研究指出,高分数往往来自于直接复制Stack Overflow的答案,而不是真正的推理,强调了需要基准来区分记忆与推理能力的必要性。
History
20260329_0342 20260328_0350 20260327_0407 20260326_0356 20260325_0407 20260324_0402 20260323_0334 20260322_0333 20260321_0346 20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553