arXiv 论文速递

2026-04-01 04:09
Snapshot: 20260401_0409
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Authors: Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue
First: 2026-03-30T17:59:56+00:00 · Latest: 2026-03-30T17:59:56+00:00
Comments: Project page: https://gen-searcher.vercel.app Code: https://github.com/tulerfeng/Gen-Searcher
Abstract
Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
中文标题/摘要
标题:Gen-Searcher: 强化代理搜索以进行图像生成
近期的图像生成模型在生成高保真和照片级真实感图像方面表现出强大的能力。然而,它们本质上受限于固定的内部知识,因此在知识密集型或需要最新信息的现实场景中常常失败。在本文中,我们提出了Gen-Searcher,这是首次尝试训练一个搜索增强的图像生成代理,该代理进行多跳推理和搜索以收集用于基于场景生成所需的文本知识和参考图像。为了实现这一点,我们构建了一个定制的数据管道,并创建了两个高质量的数据集Gen-Searcher-SFT-10k和Gen-Searcher-RL-6k,包含多样化的搜索密集型提示及其对应的合成图像。我们还引入了KnowGen,这是一个全面的基准测试,明确要求搜索驱动的外部知识进行图像生成,并从多个维度评估模型。基于这些资源,我们使用SFT进行训练,然后使用双重奖励反馈的代理强化学习进行训练,结合文本和图像奖励以提供更稳定和信息丰富的学习信号以供GRPO训练。实验表明,Gen-Searcher 带来了显著的改进,使Qwen-Image在KnowGen上提高了约16分,在WISE上提高了约15分。我们希望这项工作可以作为图像生成中搜索代理的开放基础,并完全开源我们的数据、模型和代码。
Summary / 总结
Gen-Searcher is the first search-augmented image generation agent that performs multi-hop reasoning and search to gather necessary knowledge and reference images for grounded generation. It leverages a tailored data pipeline and two high-quality datasets, and is trained using SFT followed by agentic reinforcement learning with dual reward feedback. Experiments show that Gen-Searcher significantly improves Qwen-Image by around 16 points on KnowGen and 15 points on WISE, demonstrating substantial gains in handling knowledge-intensive and up-to-date information in image generation tasks.
Gen-Searcher 是第一个通过多跳推理和搜索来收集必要知识和参考图像的增强型图像生成代理。它使用定制的数据管道和两个高质量的数据集进行训练,采用 SFT 跟后是强化学习的方法。实验表明,Gen-Searcher 在 KnowGen 基准上分别将 Qwen-Image 和 WISE 的性能提高了约 16 和 15 个点。这项工作为图像生成中的搜索代理开辟了一个新方向,并且完全开源。
HandX: Scaling Bimanual Motion and Interaction Generation
Authors: Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Venue: CVPR 2026
First: 2026-03-30T17:59:49+00:00 · Latest: 2026-03-30T17:59:49+00:00
Comments: CVPR 2026. Project Page: https://handx-project.github.io. Code: https://github.com/handx-project/HandX
Abstract
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
中文标题/摘要
标题:HandX:双臂运动和交互生成的扩展
合成人类运动取得了快速进展,但现实的手部运动和双臂交互仍然未被充分探索。全身模型往往忽略了驱动灵巧行为、手指关节活动、接触时机和双手协调的细微线索,而现有资源缺乏能够捕捉细腻手指动态和协作的高保真双臂序列。为填补这一空白,我们提出了HandX,这是一个涵盖数据、标注和评估的统一基础。我们整合并筛选现有数据集以保证质量,并收集了一个新的运动捕捉数据集,专门针对未充分代表的双臂交互,包含详细的指尖动态。为了实现可扩展的标注,我们引入了一种解耦策略,提取代表性运动特征,如接触事件和手指弯曲,并利用大型语言模型的推理生成与这些特征对齐的细粒度、语义丰富的描述。基于所得数据和标注,我们对具有多种条件模式的扩散和自回归模型进行了基准测试。实验表明,我们的新提出的专注于手部的度量标准支持高质量的灵巧运动生成。我们还观察到明显的扩展趋势:在更大、更高质量的数据集上训练的更大模型产生更语义连贯的双臂运动。我们的数据集已发布,以支持未来的研究。
Summary / 总结
The research aims to address the underexplored area of realistic hand motion and bimanual interaction synthesis. To achieve this, the authors present HandX, a unified foundation that includes data, annotation, and evaluation. They consolidate existing datasets, collect a new motion-capture dataset with detailed finger dynamics, and introduce a decoupled annotation strategy. Experiments show that larger models trained on high-quality data generate more semantically coherent bimanual motions, and the authors release their dataset to support future research.
HandX通过引入包含详细手指动态的新动作捕捉数据集、分耦合标注策略以及扩散和自回归模型基准,解决了手部运动和双臂交互合成中的现实性不足问题。实验结果显示高质量的手部灵巧动作生成,并观察到规模效应:更大模型在更高质量数据上的训练能产生更连贯的双臂运动。
ViPRA: Video Prediction for Robot Actions
Authors: Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, Deepak Pathak
Venue: ICLR 2026
First: 2025-11-11T01:33:03+00:00 · Latest: 2026-03-30T17:59:36+00:00
Comments: In ICLR 2026. Website: https://vipra-project.github.io
Abstract
Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, ViPRA explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We have released models and code at https://vipra-project.github.io
中文标题/摘要
标题:ViPRA:机器人动作的视频预测
我们能否将视频预测模型转化为机器人的策略?视频,包括人类或远程操作机器人的视频,捕捉了丰富的物理交互。然而,大多数视频缺乏标注的动作,限制了它们在机器人学习中的应用。我们提出了Video Prediction for Robot Actions (ViPRA),一种简单的预训练-微调框架,从这些无动作的视频中学习连续的机器人控制。我们不是直接预测动作,而是训练一个视频-语言模型来预测未来的视觉观察和运动中心的潜在动作,这些潜在动作作为场景动力学的中间表示。我们使用感知损失和光流一致性来训练这些潜在动作,以确保它们反映物理上合理的行为。对于下游控制,我们引入了一种分块流匹配解码器,仅使用100到200个远程操作演示,将潜在动作映射到机器人特定的连续动作序列。这种方法避免了昂贵的动作标注,支持跨实体的泛化,并通过分块动作解码实现每秒22帧的平滑、高频连续控制。与先前的潜在动作工作不同,ViPRA明确建模了什么变化以及如何变化。我们的方法在SIMPLER基准上优于强基线,获得了16%的提升,并在真实世界操作任务中提高了13%。我们已在https://vipra-project.github.io发布了模型和代码
Summary / 总结
ViPRA is a framework that uses video prediction to learn continuous robot control from unlabeled action videos. It trains a video-language model to predict future visual observations and motion-centric latent actions, which are then used to generate robot-specific action sequences. This approach avoids the need for expensive action annotations and achieves better performance than strong baselines, improving control performance by 16% on the SIMPLER benchmark and 13% on real-world manipulation tasks.
ViPRA 是一个框架,利用视频预测从未标注的视频中学习连续的机器人控制。它训练一个视频-语言模型来预测未来的视觉观察和潜在动作,然后使用分块流匹配解码器将潜在动作解码为机器人特定的动作。这种方法避免了动作标注的需求,并在模拟和真实世界操作任务中都取得了优于强基线的表现。
Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds
Authors: N Alex Cayco Gajic, Arthur Pellegrino
First: 2026-03-30T17:59:22+00:00 · Latest: 2026-03-30T17:59:22+00:00
Abstract
Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.
中文标题/摘要
标题:用于黎曼流形和统计流形上神经表示的几何感知相似性度量
相似性度量广泛用于解释神经网络用于解决任务的表示几何结构。然而,由于现有方法比较表示在状态空间中的外在几何结构,而不是其内在几何结构,它们可能无法捕捉到不同神经网络解决方案之间细微但至关重要的区别。在这里,我们引入了度量相似性分析(MSA),这是一种新颖的方法,利用黎曼几何工具比较流形假设下的神经表示的内在几何结构。我们展示了MSA可以用于i) 分离具有不同学习模式的深层网络中神经计算的特征,ii) 比较非线性动力学,iii) 探究扩散模型。因此,我们提出了一种基于数学原理且广泛适用的框架,通过比较其内在几何结构来理解神经计算背后的机制。
Summary / 总结
The research aims to improve the interpretation of neural network representations by focusing on their intrinsic geometry rather than extrinsic geometry. The method, metric similarity analysis (MSA), uses Riemannian geometry to compare the intrinsic geometry of neural representations. Key findings include the ability to disentangle features in deep networks with different learning regimes, compare nonlinear dynamics, and investigate diffusion models, providing a mathematically grounded framework for understanding neural computations.
研究旨在通过关注神经网络表示的内在几何结构而非外在几何结构来提高对其的解释。方法是使用黎曼几何工具进行度量相似性分析(MSA),以比较神经表示的内在几何结构。关键发现包括能够区分具有不同学习模式的深层网络中神经计算的特征、比较非线性动态以及研究扩散模型,提供了一个数学上扎实且广泛适用的框架来理解神经计算机制。
PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models
Authors: Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht
First: 2026-03-30T17:59:18+00:00 · Latest: 2026-03-30T17:59:18+00:00
Abstract
Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.
中文标题/摘要
标题:PoseDreamer:基于扩散模型的大规模和高保真人体数据生成管道
由于深度歧义和单目图像中三维几何标注的固有难度,获取用于3D人体网格估计的标注数据集具有挑战性。现有数据集要么是真实数据,带有手动标注的三维几何但规模有限,要么是合成数据,由三维引擎渲染生成,提供精确标签但缺乏高保真度、多样性低且生产成本高。在本文中,我们探索第三条路径:生成数据。我们介绍了PoseDreamer,一种新颖的管道,利用扩散模型生成带有三维网格标注的大规模合成数据集。我们的方法结合了可控图像生成与直接偏好优化以实现控制对齐、基于课程的学习困难样本挖掘以及多阶段质量筛选。这些组件共同自然地保持了三维标签与生成图像之间的对应关系,同时优先处理具有挑战性的样本以最大化数据集的实用性。使用PoseDreamer,我们生成了超过50万个高质量的合成样本,与基于渲染的数据集相比,在图像质量指标上提高了76%。在PoseDreamer上训练的模型在性能上与或优于在真实世界和传统合成数据集上训练的模型。此外,将PoseDreamer与合成数据集结合使用比将真实世界和合成数据集结合使用具有更好的性能,证明了我们数据集的互补性。我们将发布完整数据集和生成代码。
Summary / 总结
The paper addresses the challenge of acquiring large-scale labeled datasets for 3D human mesh estimation by introducing PoseDreamer, a pipeline that uses diffusion models to generate synthetic data. This approach combines controllable image generation with optimization techniques and quality filters to maintain correspondence between 3D labels and images, and prioritize challenging samples. The result is over 500,000 high-quality synthetic samples that outperform rendering-based datasets in image-quality metrics and improve model performance when combined with real-world data.
该研究通过引入PoseDreamer管道,利用扩散模型生成大规模带有3D网格注释的合成数据集,以解决获取3D人体网格标签数据的挑战。该方法结合可控图像生成与优化技术及质量筛选,保持3D标签与图像之间的对应关系,优先处理具有挑战性的样本。结果生成了超过500,000个高质量的合成样本,图像质量指标有所提升,并且与真实世界数据集结合时表现出色,优于将真实世界和传统合成数据集结合使用的情况,展示了数据集的互补性。
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Authors: Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or
Venue: SIGGRAPH 2026
First: 2026-03-30T17:59:13+00:00 · Latest: 2026-03-30T17:59:13+00:00
Comments: Conditionally accepted to SIGGRAPH 2026. Project page: https://contextual-repulsion.github.io/
Abstract
Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.
中文标题/摘要
标题:上下文空间中的即时排斥以实现扩散变换器的丰富多样性
现代文本到图像(T2I)扩散模型在语义对齐方面取得了显著成就,但往往缺乏多样性,对于任何给定的提示,它们往往会收敛于一组狭窄的视觉解决方案。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们发现当前多样性方法中的一个基本权衡:修改模型输入需要昂贵的优化来结合生成路径的反馈。相比之下,在空间上已承诺的中间潜在变量上采取行动往往会破坏正在形成的视觉结构,导致伪影。在本文中,我们提出在上下文空间中应用排斥作为实现扩散变换器丰富多样性的新框架。通过干预多模态注意力通道,在变换器的前向传递过程中即时应用排斥,将干预注入文本条件丰富了新兴图像结构的块之间。这允许在结构信息丰富但组成固定之前重新引导指导轨迹。我们的结果表明,上下文空间中的排斥能够显著增加多样性,而不牺牲视觉保真度或语义一致性。此外,我们的方法具有独特效率,施加较小的计算开销,即使在现代“涡轮”和精简模型中,传统轨迹干预通常失败时,仍然有效。
Summary / 总结
This paper addresses the lack of variety in Text-to-Image (T2I) diffusion models by proposing a novel framework called repulsion in the Contextual Space. The method intervenes in the multimodal attention channels during the transformer's forward pass to redirect the guidance trajectory, enhancing diversity without compromising visual fidelity or semantic adherence. Experiments show that this approach produces richer diversity with minimal computational overhead, even in modern models.
论文针对文本到图像(T2I)模型缺乏多样性的问题,这些模型常对同一提示生成相似的图像。它提出了一种名为上下文空间排斥的新方法,以增强多样性。通过在变压器前向传递过程中应用排斥,该方法在结构信息被纳入但图像未完全形成之前重新引导指导轨迹。结果表明,这种方法增加了多样性,同时没有牺牲视觉保真度或语义一致性,并且在现代“Turbo”和精简模型中具有高效的计算开销。
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
Authors: Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, Sizhe An, He Wen, Alex Wong, Tomas Hodan, Kun He
Venue: CVPR 2026
First: 2026-03-30T17:58:27+00:00 · Latest: 2026-03-30T17:58:27+00:00
Comments: CVPR 2026
Abstract
Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io
中文标题/摘要
标题:SHOW3D:在自然环境中的3D手部和物体捕捉
在操作过程中对人类手部和物体进行准确的3D理解仍然是自视点计算机视觉中的一个重大挑战。现有的手部-物体交互数据集主要是在受控的摄影棚环境中捕获的,这限制了环境的多样性,并且使得基于此类数据训练的模型难以泛化到真实世界的情景中。为了解决这一挑战,我们引入了一种新型的无标记多摄像头系统,该系统允许在真正自然环境中的几乎不受限制的移动,同时仍能够生成手部和物体的精确3D注释。捕获系统由一个轻量级、后背安装的多摄像头装置组成,该装置与用户佩戴的VR头显同步和校准。为了对手部和物体进行3D地面真实注释,我们开发了一种自我-外部跟踪流水线,并对其质量进行了严格的评估。最后,我们提出了SHOW3D,这是第一个具有3D注释的大规模数据集,展示了手部在多种真实环境中的交互,包括户外环境。我们的方法显著减少了环境现实性和3D注释准确性之间的基本权衡,并通过在几个下游任务上的实验进行了验证。show3d-dataset.github.io
Summary / 总结
The research aims to improve the 3D understanding of hands and objects during manipulation in real-world settings by addressing the limitations of existing controlled datasets. The method involves a lightweight multi-camera system synchronized with a VR headset, enabling precise 3D annotations in diverse environments. Key findings include the creation of SHOW3D, the first large-scale dataset with 3D annotations of hands interacting with objects in various real-world settings, demonstrating significant improvements in both environmental realism and annotation accuracy for downstream tasks.
研究旨在提高对人体手部和物体在真实世界中操作时的3D理解。它引入了一个无标记的多摄像头系统,用于在不受约束的环境中捕捉手部和物体,克服了之前控制室数据集的限制。该系统与VR头显同步,生成精确的3D注释。关键发现包括创建SHOW3D数据集,该数据集包含手部与物体在多种真实世界环境中的3D注释,增强模型对真实世界场景的泛化能力。实验验证了该方法在下游任务中的有效性。
FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement
Authors: Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney
First: 2026-03-30T17:58:12+00:00 · Latest: 2026-03-30T17:58:12+00:00
Abstract
We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.
中文标题/摘要
标题:FlowIt:光学流估计的全局匹配架构及其置信引导细化
我们提出了FlowIt,一种新颖的光学流估计架构,旨在稳健处理大像素位移。FlowIt的核心在于采用了一种分层变换器架构,能够捕捉广泛的全局上下文,使模型能够有效建模长距离对应关系。为克服局部匹配的局限性,我们将流初始化形式化为最优传输问题。这种形式化不仅提供了高度鲁棒的初始流场,还显式地生成了遮挡和置信度图。这些线索随后无缝地集成到引导细化阶段,网络在此阶段主动将高置信度区域的可靠运动估计传播到模糊的低置信度区域。在Sintel、KITTI、Spring和LayeredFlow数据集上的广泛实验验证了我们方法的有效性。FlowIt在竞争性的Sintel和KITTI基准测试中达到了最先进的性能,同时在Sintel、Spring和LayeredFlow上的跨数据集零样本泛化性能也达到了新的最先进的水平。
Summary / 总结
FlowIt is a novel architecture for optical flow estimation that uses a hierarchical transformer to capture global context, addressing large pixel displacements. The approach formulates flow initialization as an optimal transport problem to produce a robust initial flow field and confidence maps. These are integrated into a guided refinement stage to propagate reliable motion estimates. Experiments show FlowIt outperforms existing methods on Sintel, KITTI, Spring, and LayeredFlow datasets, achieving state-of-the-art results and new benchmarks for zero-shot generalization.
FlowIt 是一种新颖的光流估计架构,利用层次变换器捕捉全局上下文,解决大像素位移问题。方法将光流初始化表述为最优传输问题,生成稳健的初始光流场和置信度图。这些图与引导细化阶段结合,以传播可靠运动估计。实验表明,FlowIt 在 Sintel、KITTI、Spring 和 LayeredFlow 数据集上的表现优于现有方法,达到最先进的结果,并在 Sintel、Spring 和 LayeredFlow 上建立了新的零样本泛化性能。
SonoWorld: From One Image to a 3D Audio-Visual Scene
Authors: Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao
Venue: CVPR 2026
First: 2026-03-30T17:57:47+00:00 · Latest: 2026-03-30T17:57:47+00:00
Comments: Accepted by CVPR 2026, project page: https://humathe.github.io/sonoworld/
Abstract
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/
中文标题/摘要
标题:SonoWorld:从一张图像到3D音视频场景
视觉场景生成的显著进步现在可以将单张图像转化为可探索的3D世界,但缺少声音会使沉浸感不完整。我们介绍了从单张图像生成3D音视频场景的任务Image2AVScene,并提出了SonoWorld,这是首个解决这一挑战的框架。从一张图像开始,我们的流水线补全出360°全景,将其提升为可导航的3D场景,放置语言引导的声音锚点,并渲染球面声场、区域声场和环境声场,生成与场景几何和语义相匹配的空间音频。在新收集的真实世界数据集上的定量评估和受控用户研究均证实了我们方法的有效性。除了自由视角音视频渲染,我们还展示了其在单次学习声学和音视频空间声源分离中的应用。项目网站:https://humathe.github.io/sonoworld/
Summary / 总结
The research aims to enhance the immersion of 3D scenes by integrating sound with visual elements. The method involves generating a 3D audio-visual scene from a single image through a pipeline that outpaints a 360° panorama, creates a navigable 3D environment, places sound anchors, and renders ambisonics. Key findings show that the approach effectively aligns spatial audio with scene geometry and semantics, as confirmed by quantitative evaluations and a user study. Beyond rendering, the framework also supports one-shot acoustic learning and audio-visual spatial source separation.
研究旨在通过结合声音元素来增强3D场景的沉浸感。方法是从单张图像生成3D音频-视觉场景,包括绘制360°全景图、创建可导航的3D环境、放置声源锚点和渲染全景声。关键发现表明,SonoWorld能够有效生成与场景几何和语义相匹配的空间音频,这得到了定量评估和用户研究的证实。除了渲染外,该框架还支持单次声学学习和音频-视觉空间声源分离的应用。
Temporal Credit Is Free
Authors: Aur Shalev Merin
First: 2026-03-30T17:54:55+00:00 · Latest: 2026-03-30T17:54:55+00:00
Comments: 16 pages, 4 figures, 5 tables
Abstract
Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.
中文标题/摘要
标题:时间信用是免费的
循环网络不需要通过雅可比传播来在线适应。隐藏状态已经在前向传播过程中携带了时间信用;如果停止使用过时的跟踪记忆污染它们,并在参数组之间归一化梯度尺度,立即的导数就足够了。一个架构规则预测何时需要归一化:当梯度必须通过没有输出旁路的非线性状态更新时,需要\b{eta}2,否则不需要。在十个架构、真实灵长类神经数据和流式ML基准测试中,RMSprop的立即导数与完整的RTRL相当或超过,可扩展到n = 1024,且内存消耗仅为后者的1000分之一。
Summary / 总结
The study aims to demonstrate that recurrent networks can adapt online without Jacobian propagation by utilizing the hidden state to carry temporal credit during the forward pass. The method involves stopping the corruption of immediate derivatives with stale trace memory and normalizing gradient scales across parameter groups. Key findings show that immediate derivatives with RMSprop match or exceed full RTRL performance across various architectures, neural data, and ML benchmarks, and can scale to large networks with significantly less memory usage.
该研究探讨了无需进行雅可比传播的递归网络自适应方法,利用隐藏状态来携带时间上的信用。研究提出,如果在参数组之间规范化梯度尺度,且梯度通过非线性状态更新而没有输出旁路时,立即的梯度就足够了。实验结果显示,使用RMSprop的立即梯度在各种架构、神经数据和基准测试中与完整的RTRL性能相当或更优,并且能够高效地扩展到大型网络,同时使用极小的内存。
To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking
Authors: Hannah Lawrence, Elyssa Hofgard, Vasco Portilheiro, Yuxuan Chen, Tess Smidt, Robin Walters
Venue: ICLR 2026 short
First: 2025-10-01T18:26:33+00:00 · Latest: 2026-03-30T17:52:45+00:00
Comments: Published as a conference paper at ICLR 2026. A short version of this paper appeared at the ICLR AI4Mat workshop in April 2025
Abstract
Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can improve generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or "important", under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we propose a metric to quantify the amount of symmetry breaking in a dataset, via a two-sample classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of symmetry-breaking in several benchmark point cloud datasets, constituting a severe form of dataset bias. We show theoretically that distributional symmetry-breaking can prevent invariant methods from performing optimally even when the underlying labels are truly invariant, for invariant ridge regression in the infinite feature limit. Empirically, the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some symmetry-biased datasets, but not others, particularly when the symmetry bias is predictive of the labels. Overall, these findings suggest that understanding equivariance -- both when it works, and why -- may require rethinking symmetry biases in the data.
中文标题/摘要
标题:增广还是不增广?诊断对称性破坏
针对机器学习的对称性感知方法,如数据增广和协变架构,鼓励模型在原始数据集的所有变换(例如旋转或排列)上表现正确。这些方法可以在假设变换后的数据点在测试分布下高度可能或“重要”的前提下,提高泛化能力和样本效率。在本研究中,我们开发了一种方法来批判性地评估这一假设。具体而言,我们提出了一种通过两样本分类器测试来量化数据集中对称性破坏程度的度量方法,该测试能够区分原始数据集及其随机增广的等价版本。我们在合成数据集上验证了该度量方法,并使用它揭示了几个基准点云数据集中对称性破坏的惊人程度,构成了数据集偏差的一种严重形式。理论上,我们证明了对称性破坏的分布可以阻止不变方法在底层标签真正不变的情况下表现最优,特别是在无限特征限制下的不变岭归一化。实证上,对称性感知方法的影响取决于数据集:协变方法在某些对称性偏差的数据集上仍然有益,但在其他数据集上则不然,尤其是在对称性偏差预测标签的情况下。总体而言,这些发现表明,理解协变性——无论是何时有效,还是为什么有效——可能需要重新思考数据中的对称性偏差。
Summary / 总结
This work evaluates the assumption that augmented data is important under the test distribution, crucial for symmetry-aware methods like data augmentation and equivariant architectures. The authors propose a metric to measure symmetry breaking in datasets using a two-sample classifier test. They find high degrees of symmetry-breaking in benchmark point cloud datasets, indicating severe dataset bias. Theoretical and empirical results show that distributional symmetry-breaking can hinder the performance of invariant methods, and the benefits of equivariant methods depend on the dataset's symmetry bias.
研究开发了一种度量数据集对称性破坏的指标,通过两样本分类器区分原始数据和增强数据。研究发现,基准点云数据集中的对称性破坏程度很高,表明存在一种数据偏差。理论和实验证明,虽然对称性方法在某些数据集上仍然可以提供益处,但在其他数据集上可能无效,尤其是在对称性偏差预测标签时。这表明理解对称性需要解决数据中的对称性偏差。
Equivariant symmetry-aware head pose estimation for fetal MRI
Authors: Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Benjamin Billot, Polina Golland
First: 2025-12-04T15:15:55+00:00 · Latest: 2026-03-30T17:52:11+00:00
Abstract
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of diagnostic 2D MRI slices with 6-DoF head pose estimation, supported by rapid low-resolution 3D MRI volumes acquired before each 2D slice. Existing pose estimation methods struggle to generalize to clinical volumes due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, supporting future clinical translation. Our implementation is publicly available at github.com/MedicalVisionGroup/E3-Pose.
中文标题/摘要
标题:Equivariant Symmetry-aware胎儿MRI头部姿态估计
我们提出了E(3)-Pose,这是一种新颖的快速姿态估计方法,能够同时和明确地建模旋转等变性和物体对称性。我们的工作旨在解决诊断MRI扫描中胎儿头部运动的挑战性问题。我们旨在通过支持快速低分辨率3D MRI体积的6-DoF头部姿态估计,实现自动适应性诊断2D MRI切片的处方。现有的姿态估计方法由于固有的解剖对称性引起的姿态歧义,以及低分辨率、噪声和伪影,难以在临床体积上泛化。相比之下,E(3)-Pose通过构造捕捉解剖对称性和刚性姿态等变性,从而提供胎儿头部姿态的稳健估计。我们在公开可用和代表性的临床胎儿MRI数据集上的实验表明,我们的方法在不同领域具有优越的稳健性和泛化能力。至关重要的是,E(3)-Pose在临床MRI体积上达到了最先进的准确性,支持未来的临床转化。我们的实现可在github.com/MedicalVisionGroup/E3-Pose获取。
Summary / 总结
E(3)-Pose is a novel pose estimation method that models rotation equivariance and object symmetry to address the challenge of fetal head motion during MRI scans. It aims to enable automatic 6-DoF head pose estimation for diagnostic 2D MRI slices using low-resolution 3D MRI volumes. Experiments show that E(3)-Pose outperforms existing methods in terms of robustness and generalization across different clinical datasets, achieving state-of-the-art accuracy on clinical MRI volumes.
E(3)-Pose 是一种新颖的姿态估计方法,通过建模旋转等变性和物体对称性来解决胎儿头部在 MRI 扫描中运动的挑战。它旨在使用低分辨率 3D MRI 体积自动进行 6-DoF 头部姿态估计,以支持诊断 2D MRI 切片。实验表明,E(3)-Pose 在不同临床数据集上的鲁棒性和泛化能力优于现有方法,并在临床 MRI 体积上达到了最先进的准确性。
See it to Place it: Evolving Macro Placements with Vision-Language Models
Authors: Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee, Joe Wenjie Jiang, Vijay Janapa Reddi, Vincent Zhuang
First: 2026-03-30T17:47:34+00:00 · Latest: 2026-03-30T17:47:34+00:00
Comments: 31 pages, 11 figures, 14 tables
Abstract
We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.
中文标题/摘要
标题:见之即置:利用视觉语言模型进行宏放置优化
我们提出使用视觉语言模型(VLMs)进行芯片版图的宏放置,这是一个复杂的优化任务,最近通过机器学习方法显示出有希望的进步。由于人类设计师在安排芯片画布上的组件时高度依赖空间推理,我们假设具有强大视觉推理能力的VLMs可以有效地补充现有的基于学习的方法。我们引入了VeoPlace(视觉进化优化放置)框架,该框架使用一个未经微调的VLM来指导基础放置器的动作,将其限制在芯片画布的子区域。VLM的建议通过进化搜索策略迭代优化,以提高最终的放置质量。在开源基准测试中,VeoPlace在9个基准中的10个上优于最佳的先前基于学习的方法,峰值线长减少超过32%。我们进一步证明VeoPlace可以泛化到分析型放置器,提高DREAMPlace在所有8个评估基准上的性能,最高增幅达4.3%。我们的方法为利用基础模型解决复杂物理设计问题的电子设计自动化工具打开了新的可能性。
Summary / 总结
The research aims to enhance chip floorplanning by integrating Vision-Language Models (VLMs) for macro placement, a task traditionally handled by human designers. VeoPlace, a novel framework, uses a VLM to guide a base placer by constraining its actions to specific regions of the chip canvas. Iterative optimization through an evolutionary search strategy improves placement quality. Experiments show that VeoPlace outperforms previous learning-based approaches on 9 out of 10 benchmarks, with significant reductions in peak wirelength, and also improves the performance of analytical placers, demonstrating its versatility.
论文提出使用视觉语言模型(VLMs)进行芯片布图中的宏放置,旨在通过利用VLMs的视觉推理能力来增强现有的基于学习的方法。VeoPlace是一种新颖的框架,使用VLMs来引导基础放置器的行动,将其限制在芯片画布的特定区域,并通过进化搜索策略迭代优化提案。结果表明,VeoPlace在10个基准中的9个上优于之前的基于学习的方法,显著减少了峰值线长,并且也提高了分析放置器的性能,展示了其泛化能力。
Pandora: Articulated 3D Scene Graphs from Egocentric Vision
Authors: Alan Yu, Yun Chang, Christopher Xie, Luca Carlone
First: 2026-03-30T17:47:07+00:00 · Latest: 2026-03-30T17:47:07+00:00
Comments: 14 pages, 5 figures. Presented at the 2025 British Machine Vision Conference (BMVC) in Sheffield, UK
Abstract
Robotic mapping systems typically approach building metric-semantic scene representations from the robot's own sensors and cameras. However, these "first person" maps inherit the robot's own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot's ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.
中文标题/摘要
标题:潘多拉:基于第一人称视觉的 articulated 3D 场景图
机器人制图系统通常从机器人的自身传感器和摄像头构建度量语义场景表示。然而,这些“第一人称”地图继承了机器人的局限性,由于其身体或技能限制,可能会遗漏环境中的许多方面。例如,机器人可能无法打开抽屉或访问墙柜。从这个意义上说,地图表示并不完整,需要更强大的机器人来填补这些空白。我们通过利用人类佩戴Project Aria眼镜自然探索场景时捕获的第一人称数据,缩小了当前方法中的这些盲点,从而为任何可部署的机器人直接转移关于articulation的知识提供了一种方式。我们证明,通过使用简单的启发式方法,可以利用第一人称数据恢复articulate对象部分的模型,其质量与基于其他输入模态的最先进的方法相当。我们还展示了如何将这些模型整合到3D场景图表示中,从而更好地理解对象动力学和对象-容器关系。最后,我们展示了这些articulated 3D场景图如何增强机器人执行移动操作任务的能力,展示了仅给定3D场景图作为输入时,波士顿动力公司的Spot机器人如何被任务要求检索隐藏的目标物品的应用场景。
Summary / 总结
This paper addresses the limitations of robotic mapping systems by proposing a method to create articulated 3D scene graphs using egocentric data from human exploration. The approach leverages simple heuristics to recover models of articulate object parts, comparable to state-of-the-art methods. The integration of these models into 3D scene graphs enhances a robot's ability to perform mobile manipulation tasks, as demonstrated by a Boston Dynamics Spot retrieving concealed target items based solely on the 3D scene graph.
该论文通过提出使用人类探索时的主观数据来创建具有关节模型的3D场景图的方法,解决了机器人制图系统的局限性。该方法利用简单的启发式方法恢复关节对象部分的模型,与基于其他输入模态的先进方法相当。将这些模型整合到3D场景图中,增强了机器人执行移动操作任务的能力,如通过仅提供3D场景图作为输入,使波士顿动力公司的Spot机器人检索隐藏的目标物品。
SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability
Authors: Oliver Aleksander Larsen, Mahyar T. Moghaddam
First: 2026-03-30T17:46:41+00:00 · Latest: 2026-03-30T17:46:41+00:00
Comments: Accepted at SAGAI 2026, co-located with IEEE ICSA 2026. 8 pages
Abstract
Modern distributed systems integrate heterogeneous services, REST APIs with different schema versions, GraphQL endpoints, and IoT devices with proprietary payloads that suffer from persistent schema mismatches. Traditional static adapters require manual coding for every schema pair and cannot handle novel combinations at runtime. We present SAGAI-MID, a FastAPI-based middleware that uses large language models (LLMs) to dynamically detect and resolve schema mismatches at runtime. The system employs a five-layer pipeline: hybrid detection (structural diff plus LLM semantic analysis), dual resolution strategies (per-request LLM transformation and LLM-generated reusable adapter code), and a three-tier safeguard stack (validation, ensemble voting, rule-based fallback). We frame the architecture through Bass et al.'s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. We evaluate SAGAI-MID on 10 interoperability scenarios spanning REST version migration, IoT-to-analytics bridging, and GraphQL protocol conversion across six LLMs from two providers. The best-performing configuration achieves 0.90 pass@1 accuracy. The CODEGEN strategy consistently outperforms DIRECT (0.83 vs 0.77 mean pass@1), while cost varies by over 30x across models with no proportional accuracy gain; the most accurate model is also the cheapest. We discuss implications for software architects adopting LLMs as runtime architectural components.
中文标题/摘要
标题:SAGAI-MID:一种生成式AI驱动的中间件,用于动态运行时互操作性
现代分布式系统整合了异构服务、具有不同模式版本的REST API、GraphQL端点以及具有专有负载的物联网设备,这些设备遭受持续的模式不匹配。传统的静态适配器需要为每对模式手动编写代码,并且无法在运行时处理新的组合。我们提出了SAGAI-MID,这是一种基于FastAPI的中间件,使用大型语言模型(LLMs)在运行时动态检测和解决模式不匹配问题。该系统采用五层流水线:混合检测(结构差异加上LLM语义分析)、双重解决策略(每请求LLM转换和LLM生成的可重用适配器代码)以及三层保障堆栈(验证、集成投票、基于规则的回退)。我们通过Bass等人提出的互操作性策略框架化该架构,将它们从设计时的产物转化为运行时的能力。我们在六种来自两家供应商的大型语言模型上评估了SAGAI-MID,涉及10个互操作性场景,包括REST版本迁移、物联网到分析的桥梁构建以及GraphQL协议转换。最佳配置的准确率为0.90 pass@1。CODEGEN策略始终优于DIRECT(0.83 vs 0.77平均pass@1),而成本在模型之间相差超过30倍,没有相应的准确率提升;最准确的模型也是最便宜的。我们讨论了软件架构师采用LLMs作为运行时架构组件的影响。
Summary / 总结
SAGAI-MID is a FastAPI-based middleware that uses large language models to dynamically detect and resolve schema mismatches in heterogeneous services. It employs a five-layer pipeline and is evaluated on 10 interoperability scenarios, achieving 0.90 pass@1 accuracy in the best configuration. The CODEGEN strategy outperforms DIRECT, but there is a significant variation in cost and accuracy across different models.
SAGAI-MID 是一个基于 FastAPI 的中间件,使用大型语言模型在运行时动态检测和解决异构服务中的模式不匹配问题。它采用五层流水线并从设计时将互操作性策略转换为运行时。在 10 个场景中的评估显示,最佳配置实现了 0.90 的 pass@1 准确率,CODEGEN 策略优于 DIRECT,而最准确的模型也是最便宜的,尽管不同模型的成本差异超过 30 倍但没有相应的准确率提升。
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Authors: Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza
First: 2026-03-30T17:46:31+00:00 · Latest: 2026-03-30T17:46:31+00:00
Abstract
Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.
中文标题/摘要
标题:SOLE-R1:仅基于视频-语言推理的机器人在线强化学习奖励
视觉-语言模型(VLMs)在多种任务中展现了令人印象深刻的性能,激发了利用这些模型监督机器人学习的努力。然而,当作为强化学习(RL)中的评估器使用时,今天最强的模型往往在部分可观测性和分布偏移下失败,使策略利用感知错误而非解决任务。为解决这一局限,我们引入了SOLE-R1(自我观察学习者),这是一种专门设计用于作为在线RL唯一奖励信号的视频-语言推理模型。仅给定原始视频观察和自然语言目标,SOLE-R1进行每时间步的时空链式思考(CoT)推理,并生成可以直接用作奖励的密集的任务进度估计。为了训练SOLE-R1,我们开发了一个大规模的视频轨迹和推理合成管道,生成与连续进度监督对齐的时空链式思考(CoT)轨迹。该数据结合基础的空间和多帧时空推理,用于训练模型的混合框架,该框架将监督微调与可验证奖励的RL结合在一起。在四个不同的模拟环境中和一个真实机器人设置中,SOLE-R1实现了从随机初始化的零样本在线RL:机器人在没有真实奖励、成功指标、演示或任务特定调整的情况下学习以前未见过的操控任务。SOLE-R1在24个未见过的任务中取得成功,并显著优于包括GPT-5和Gemini-3-Pro在内的强大视觉-语言奖励器,同时表现出明显的抗奖励作弊的鲁棒性。
Summary / 总结
SOLE-R1 is a video-language reasoning model designed to serve as the sole reward signal for online reinforcement learning. It performs spatiotemporal chain-of-thought reasoning to estimate task progress and learns from a large-scale dataset of video trajectories and reasoning traces. Across various simulation and real-robot settings, SOLE-R1 enables robots to learn unseen manipulation tasks from random initialization without additional rewards or demonstrations, outperforming other vision-language models and demonstrating robustness to reward hacking.
SOLE-R1 是一种用于在线强化学习的视频-语言推理模型,它执行时空链式思考推理并生成密集的任务进度估计。在四个模拟环境和一个真实机器人设置中,SOLE-R1 能够在没有真实奖励、指示或演示的情况下学习新的操作任务,表现出色并优于其他视觉-语言奖励器,同时对奖励作弊具有更强的鲁棒性。
Stepwise Credit Assignment for GRPO on Flow-Matching Models
Authors: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
Venue: CVPR
First: 2026-03-30T17:35:14+00:00 · Latest: 2026-03-30T17:35:14+00:00
Comments: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026 Project page: https://stepwiseflowgrpo.com
Abstract
Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
中文标题/摘要
标题:流模型上的逐步信用分配方法
流-GRPO成功地将强化学习应用于流模型,但使用了在整个步骤中均匀分配的信用分配方法,这忽略了扩散生成的时间结构:早期步骤决定了组成和内容(低频结构),而后期步骤则解决了细节和纹理(高频细节)。此外,仅基于最终图像均匀分配信用可能会无意中奖励中间步骤中的次优步骤,尤其是在扩散轨迹后期纠正错误时。我们提出了逐步-流-GRPO方法,根据每个步骤的奖励改进来分配信用。通过利用Tweedie公式获得中间奖励估计,并引入基于增益的优势,我们的方法实现了更高的样本效率和更快的收敛速度。我们还引入了一种受DDIM启发的SDE,提高了奖励质量,同时保持了用于策略梯度的随机性。
Summary / 总结
The research aims to improve the credit assignment in flow models by considering the temporal structure of diffusion generation. Stepwise-Flow-GRPO assigns credit based on each step's reward improvement, using Tweedie's formula and gain-based advantages. This method enhances sample efficiency and convergence speed compared to uniform credit assignment. The study introduces a DDIM-inspired SDE to improve reward quality while maintaining stochasticity for policy gradients.
研究旨在通过考虑扩散生成的时间结构来改进流模型中的信用分配。Stepwise-Flow-GRPO 根据每一步的奖励改进来分配信用,使用 Tweedie 公式和基于增益的优势。该方法相比均匀信用分配提高了样本效率和收敛速度。研究还引入了一种受 DDIM 启发的 SDE,以提高奖励质量同时保持策略梯度的随机性。
Dynamic Dual-Granularity Skill Bank for Agentic RL
Authors: Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dongbin Zhao
First: 2026-03-30T17:32:11+00:00 · Latest: 2026-03-30T17:32:11+00:00
Comments: 12 pages
Abstract
Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.
中文标题/摘要
标题:动态双粒度技能库用于代理强化学习
代理强化学习(RL)可以从可重用的经验中受益匪浅,但现有的基于技能的方法主要提取轨迹级别的指导,往往缺乏维持不断演化的技能记忆的原理性机制。我们提出了D2Skill,一种用于代理强化学习的动态双粒度技能库,将可重用的经验组织成任务技能以提供高层次的指导,并组织成步骤技能以提供精细的决策支持和错误纠正。D2Skill通过在相同策略下进行配对的基线和技能注入的滚动训练策略和技能库,利用它们的性能差距来推导出回溯效用信号,用于技能更新和策略优化。完全基于训练时的经验构建,技能库通过反思不断扩展,并通过具有效用意识的检索和修剪进行维护。在ALFWorld和WebShop上的实验使用Qwen2.5-7B-Instruct和Qwen3-4B-Instruct-2507显示,D2Skill在无技能基线上的成功率提高了10-20个百分点。进一步的消融分析和研究表明,双粒度技能建模和动态技能维护对于这些改进至关重要,而学习到的技能具有更高的效用,可以在不同的评估设置中转移,并且仅引入了适度的训练开销。
Summary / 总结
D2Skill is a dynamic dual-granularity skill bank for agentic reinforcement learning that enhances policy performance by organizing experience into task and step skills. It trains the policy and skill bank together using performance gaps to update skills and optimize the policy. Experiments show D2Skill improves success rates by 10-20 points over skill-free baselines on ALFWorld and WebShop tasks. The dual-granularity and dynamic maintenance of skills are crucial for these improvements, and the learned skills are transferable and have low training overhead.
D2Skill 是一种动态的双粒度技能库,通过将经验组织成任务技能和步骤技能来增强强化学习中的策略性能。它通过使用性能差距来共同训练策略和技能库以更新技能和优化策略。实验表明,D2Skill 在 ALFWorld 和 WebShop 任务上将成功率提高了 10-20 个百分点。双粒度和动态维护技能对于这些改进至关重要,且学习到的技能具有可迁移性且训练开销较低。
DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing
Authors: Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao
First: 2026-03-30T17:30:25+00:00 · Latest: 2026-03-30T17:30:25+00:00
Comments: https://carlofkl.github.io/dreamlite/
Abstract
Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
中文标题/摘要
标题:DreamLite:一种轻量级的设备端统一模型,用于图像生成和编辑
扩散模型在文本到图像(T2I)生成和文本引导的图像编辑方面取得了显著进展。然而,这些模型通常包含数十亿个参数,导致高延迟和部署挑战。虽然设备端的扩散模型提高了效率,但它们主要集中在T2I生成上,缺乏对图像编辑的支持。在本文中,我们提出了一种名为DreamLite的紧凑型统一设备端扩散模型(0.39B),该模型在单一网络中支持T2I生成和文本引导的图像编辑。DreamLite基于剪枝的移动U-Net骨干,并通过潜在空间中的上下文条件化进行统一。它将图像水平拼接作为输入,生成任务使用(目标 | 空白)配置,编辑任务使用(目标 | 来源)配置。为了稳定这种紧凑模型的训练,我们引入了一种任务渐进联合预训练策略,该策略依次针对T2I、编辑和联合任务。经过高质量的SFT和强化学习后,DreamLite在图像生成方面达到了GenEval(0.72)的得分,在图像编辑方面达到了ImgEdit(4.11)的得分,优于现有的设备端模型,并且在几个服务器端模型中保持竞争力。通过采用步骤蒸馏,我们进一步将去噪处理减少到仅4步,使我们的DreamLite能够在小米14智能手机上以不到1秒的时间生成或编辑一个1024 x 1024的图像。据我们所知,DreamLite是第一个同时支持图像生成和图像编辑的统一设备端扩散模型。
Summary / 总结
DreamLite is a lightweight on-device unified model that supports both text-to-image generation and text-guided image editing with only 0.39 billion parameters. It uses a pruned mobile U-Net backbone and in-context spatial concatenation for conditioning. DreamLite was trained using a task-progressive joint pretraining strategy and further refined with SFT and reinforcement learning. It achieves GenEval scores of 0.72 for image generation and ImgEdit scores of 4.11 for image editing, outperforming existing on-device models and competing with some server-side models. DreamLite can generate or edit a 1024 x 1024 image in less than 1 second on a Xiaomi 14 smartphone using just 4 steps of denoising processing.
DreamLite 是一个轻量级的设备端扩散模型(0.39B 参数),支持文本到图像生成和文本引导的图像编辑。它使用精简的移动 U-Net 主干和任务渐进联合预训练策略。DreamLite 在图像生成上的 GenEval 得分为 0.72,在图像编辑上的 ImgEdit 得分为 4.11,优于现有设备端模型,并且与服务器端模型保持竞争力。它可以在小米 14 智能手机上以不到 1 秒的时间生成或编辑一个 1024 x 1024 的图像。
A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations
Authors: Neelu Madan, Àlex Pujol, Andreas Møgelmose, Sergio Escalera, Kamal Nasrollahi, Graham W. Taylor, Thomas B. Moeslund
Venue: CVPR
First: 2026-03-14T16:53:59+00:00 · Latest: 2026-03-30T17:29:33+00:00
Comments: accepted at CVPR Workshops 2026
Abstract
Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a "curvature--task tradeoff": low curvature ($c{=}0.2$) matches or outperforms Euclidean on parent slot retrieval, while moderate curvature ($c{=}0.5$) achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}.
中文标题/摘要
标题:双曲视角下的对象中心场景表示层次结构
槽注意机制已成为无监督对象中心学习的强大框架,将视觉场景分解为一组紧凑的向量表示——称为“槽”,每个槽捕捉一个独特的区域或对象。然而,这些槽是在欧几里得空间中学习的,这为自然结构视觉场景的层次关系提供了没有几何归纳偏差的空间。在本文中,我们提出了一种简单的后处理管道,将欧几里得槽嵌入投影到洛伦兹双曲空间的洛伦兹双曲体上,而不修改底层训练管道。我们直接从槽注意掩码构建五级视觉层次结构,并分析双曲几何是否揭示了在欧几里得空间中看不见的潜在层次结构。将我们的管道与SPOT(图像)、VideoSAUR(视频)和SlotContrast(视频)结合使用,我们发现双曲投影揭示了一致的场景级到对象级组织,其中粗粒度槽占据更大的流形深度,而细粒度槽则没有这种现象。我们进一步发现“曲率-任务权衡”:低曲率(c=0.2)在父槽检索上匹配或优于欧几里得空间,而中等曲率(c=0.5)实现了更好的跨级分离。这些发现表明,槽表示已经编码了潜在的层次结构,双曲几何揭示了这些层次结构,这激励了端到端双曲训练作为自然的下一步。代码和模型可在\href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}获取。
Summary / 总结
This work explores the use of hyperbolic geometry to reveal hierarchical structures in object-centric scene representations learned via slot attention. The authors propose a post-hoc pipeline to project Euclidean slot embeddings into hyperbolic space and find that hyperbolic geometry exposes a consistent scene-to-object hierarchy, which is not visible in Euclidean space. They also observe a tradeoff between curvature and task performance, suggesting that slot representations already encode latent hierarchy that hyperbolic geometry can reveal, motivating end-to-end hyperbolic training.
该研究探讨了使用双曲几何揭示通过槽注意力学习的对象中心场景表示中的层次结构。作者提出了一种后处理管道,将欧几里得槽嵌入投影到洛伦兹双曲空间的洛伦兹双曲体上,分析了五级视觉层次结构。研究发现,双曲空间揭示了一种一致的场景到对象层次结构,粗粒度槽占据更大的流形深度,而细粒度槽则不然,这在欧几里得空间中不可见。研究还发现了一个曲率-任务权衡,表明适度的曲率增强了跨层次的分离,而低曲率在父槽检索上与欧几里得匹配或优于欧几里得。
GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference
Authors: Soutrik Mukherjee, Sangwhan Cha
First: 2026-03-30T17:27:33+00:00 · Latest: 2026-03-30T17:27:33+00:00
Comments: 10 pages, 8 figures, 15 tables
Abstract
This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.
中文标题/摘要
标题:基于GPU的变压器基神经网络实时推理加速优化设计与评估
本文介绍了使用NVIDIA TensorRT和混合精度优化的变压器模型GPU加速推理管道的设计与评估。我们评估了BERT-base(110M参数)和GPT-2(124M参数),批量大小从1到32,序列长度从32到512。该系统在CPU基线上的速度提高了64.4倍,单样本推理的延迟低于10毫秒,并减少了63%的内存使用。我们提出了一种混合精度策略,保留FP32用于数值敏感操作,如softmax和层归一化,同时将FP16应用于线性层。这种方法保持了高数值精度(余弦相似度>=0.9998相对于基线输出),消除了NaN不稳定性。该管道实现为模块化、容器化系统,可在超过360种配置中实现可重复的基准测试。在NVIDIA A100上的跨GPU验证显示,FP16加速比在1.84倍到2.00倍之间,且数值行为稳定。在SST-2上的下游评估表明,混合精度下无准确度下降。在WikiText-2上的验证显示,随机输入低估了全FP16中的NaN不稳定性最多6倍,同时确认了混合方法的鲁棒性(0.0% NaN,余弦相似度>=0.9998)。这些结果提供了GPU架构上性能和准确度权衡的详细描述,并为在延迟关键环境中部署变压器模型提供了实用指导。
Summary / 总结
This paper designs and evaluates a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. A hybrid precision strategy is introduced, preserving FP32 for numerically sensitive operations and applying FP16 to linear layers, maintaining high numerical fidelity and eliminating NaN instability. The pipeline is modular and reproducible, validated across multiple configurations and GPU architectures, showing consistent performance and accuracy improvements.
该论文提出了一种使用NVIDIA TensorRT进行混合精度优化的GPU加速推理管道,实现了高达64.4倍的CPU基线加速,单样本推理延迟低于10毫秒,并减少了63%的内存使用。采用混合精度策略,保留FP32进行数值敏感操作,而将FP16应用于线性层,保持了高数值精度并消除了NaN不稳定现象。该系统模块化且容器化,支持超过360种配置的可重复基准测试,具有一致的FP16加速比和稳定的数值行为。
Vision-Language Agents for Interactive Forest Change Analysis
Authors: James Brock, Ce Zhang, Nantheera Anantrasirichai
First: 2026-01-08T02:02:36+00:00 · Latest: 2026-03-30T17:23:33+00:00
Comments: 5 pages, 4 figures, Accepted into IGARSS 2026
Abstract
Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.
中文标题/摘要
标题:视觉-语言代理在交互式森林变化分析中的应用
现代森林监测工作流程越来越多地受益于高分辨率卫星图像的日益可用和深度学习的进步。在此背景下,准确的像素级变化检测和复杂森林动态的有意义语义变化描述是两个持续存在的挑战。虽然大型语言模型(LLMs)正在被适应用于交互式数据探索,但它们与视觉-语言模型(VLMs)结合进行遥感图像变化解释(RSICI)的研究仍然相对较少。为了解决这一差距,我们引入了一个由LLM驱动的集成森林变化分析代理,该代理支持跨多个RSICI任务的自然语言查询。所提出的系统基于多级变化解释(MCI)视觉-语言骨干,并通过LLM进行编排。为了在森林环境中促进适应和评估,我们进一步引入了森林变化数据集,该数据集包含双时相卫星图像、像素级变化掩码以及使用人类注释和基于规则的方法生成的多粒度语义变化描述。实验结果表明,所提出的系统在森林变化数据集上的mIoU和BLEU-4得分为67.10%和40.17%,在LEVIR-MCI-Trees上的得分为88.13%和34.41%,LEVIR-MCI基准数据集的一个专注于树木的子集,用于联合变化检测和描述。这些结果突显了交互式、LLM驱动的RSICI系统在提高森林变化分析的可访问性、可解释性和效率方面的潜力。所有数据和代码均可在https://github.com/JamesBrockUoB/ForestChat/上公开获取。
Summary / 总结
This paper addresses the challenges of accurate pixel-level change detection and semantic change captioning in forest monitoring using deep learning and large language models. It introduces a vision-language agent driven by an LLM for integrated forest change analysis, utilizing a multi-level change interpretation backbone. The system is evaluated on the Forest-Change dataset and achieves mIoU and BLEU-4 scores of 67.10% and 40.17%, respectively, demonstrating improved accessibility and interpretability in forest change analysis.
该论文利用深度学习和大型语言模型解决森林监测中的像素级变化检测和语义变化描述的挑战,引入了一个由LLM驱动的集成森林变化分析视觉-语言代理,采用多级变化解释骨干。系统在Forest-Change数据集上进行评估,分别实现了mIoU和BLEU-4得分67.10%和40.17%,展示了在森林变化分析中提高的可访问性和可解释性。
CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling
Authors: Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
First: 2026-02-13T18:57:31+00:00 · Latest: 2026-03-30T17:19:32+00:00
Comments: Project Page: https://microsoft.github.io/CoPE
Abstract
Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach, CoPE-VideoLM, reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal and motion reasoning, long-form understanding, and spatial scene understanding.
中文标题/摘要
标题:CoPE-VideoLM:利用编解码器原语实现高效的视频语言建模
视频语言模型(VideoLMs)使AI系统能够理解视频中的时间动态。为了符合最大上下文窗口限制,当前方法使用关键帧采样,这往往由于时间覆盖稀疏而错过了宏观事件和微观细节。此外,处理每一帧的完整图像及其标记还会产生大量的计算开销。我们通过利用视频编解码器原语(特别是运动向量和残差),解决了这些限制,这些原语能够原生地编码视频冗余和稀疏性,而无需对大多数帧进行昂贵的完整图像编码。为此,我们引入了轻量级的基于变压器的编码器,通过预训练策略将编解码器原语的表示与图像编码器嵌入对齐,从而加速端到端微调期间的收敛。我们的方法CoPE-VideoLM与标准VideoLMs相比,将首个标记生成时间减少了高达86%,标记使用量减少了高达93%。此外,通过调整关键帧和编解码器原语的密度,我们在涵盖一般问题回答、时间与运动推理、长视频理解以及空间场景理解等14个不同视频理解基准测试中保持或超越了性能。
Understanding SAM's Robustness to Noisy Labels through Gradient Down-weighting
Authors: Hoang-Chau Luong, Quang-Thuc Nguyen, Dat Ba Tran, Minh-Triet Tran
First: 2024-11-26T05:54:12+00:00 · Latest: 2026-03-30T17:14:51+00:00
Abstract
Sharpness-Aware Minimization (SAM) was introduced to improve generalization by seeking flat minima, yet it also exhibits robustness to label noise, a phenomenon that remains only partially understood. Prior work has mainly attributed this effect to SAM's tendency to prolong the learning of clean samples. In this work, we provide a complementary explanation by analyzing SAM at the element-wise level. We show that when noisy gradients dominate a parameter direction, their influence is reduced by the stronger amplification of clean gradients. This slows the memorization of noisy labels while sustaining clean learning, offering a more complete account of SAM's robustness. Building on this insight, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), a simple variant of SAM that explicitly magnifies this down-weighting effect. Experiments on benchmark image classification tasks with noisy labels demonstrate that SANER significantly mitigates noisy-label memorization and improves generalization over both SAM and SGD. Moreover, since SANER is designed from the mechanism of SAM, it can also be seamlessly integrated into SAM-like variants, further boosting their robustness.
中文标题/摘要
标题:通过梯度降权理解SAM对噪声标签的鲁棒性
Sharpness-Aware Minimization (SAM) 被引入以通过寻找平坦的极小值来提高泛化能力,同时它还表现出对标签噪声的鲁棒性,这一现象尚未完全理解。先前的工作主要将这一效果归因于 SAM 倾向于延长对干净样本的学习时间。在本工作中,我们通过在元素级分析 SAM 提供了一个补充解释。我们表明,当噪声梯度主导一个参数方向时,它们的影响会因干净梯度的更强放大而被削弱。这减缓了对噪声标签的记忆,同时保持了干净学习,从而为 SAM 的鲁棒性提供了更完整的解释。基于这一见解,我们提出了 SANER(Sharpness-Aware Noise-Explicit Reweighting),这是 SAM 的一个简单变体,明确放大了这种降权效应。在具有噪声标签的基准图像分类任务上的实验表明,SANER 显著减轻了噪声标签的记忆,并在 SAM 和 SGD 上都提高了泛化能力。此外,由于 SANER 是从 SAM 机制设计的,因此也可以无缝集成到 SAM 类的变体中,进一步提高它们的鲁棒性。
Summary / 总结
This study investigates the robustness of Sharpness-Aware Minimization (SAM) to noisy labels by analyzing its gradient behavior at the element-wise level. It proposes that SAM reduces the influence of noisy gradients, thereby slowing the memorization of noisy labels while maintaining learning of clean labels. Based on this insight, the authors introduce SANER (Sharpness-Aware Noise-Explicit Reweighting), which further amplifies this effect. Experiments show that SANER improves generalization and mitigates noisy-label memorization compared to both SAM and Stochastic Gradient Descent (SGD).
该研究通过元素级分析探讨了Sharpness-Aware Minimization (SAM) 对噪声标签的鲁棒性。研究发现,SAM 通过减少噪声梯度的影响,减缓了噪声标签的记忆过程,同时保持了对干净标签的学习。基于这一见解,作者提出了SANER(Sharpness-Aware Noise-Explicit Reweighting),进一步放大了这一效果。实验表明,SANER 在泛化能力和噪声标签记忆方面优于 SAM 和随机梯度下降 (SGD)。
AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Authors: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys
First: 2026-03-30T17:14:15+00:00 · Latest: 2026-03-30T17:14:15+00:00
Comments: Project page: https://haozheqi.github.io/adapt-token
Abstract
Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
中文标题/摘要
标题:AdaptToken:基于熵的自适应令牌选择方法用于MLLM长视频理解
由于高内存成本和上下文长度限制,多模态大型语言模型(MLLMs)在理解长视频方面仍然具有挑战性。先前的方法通过为短片段内的帧/令牌评分和选择来缓解这一问题,但它们缺乏一种原理性的机制来(i)比较不同视频片段的相关性,以及(ii)在收集到足够证据后停止处理。我们提出了一种无需训练的框架AdaptToken,将MLLM的自我不确定性转化为长视频令牌选择的全局控制信号。AdaptToken将视频划分为组,提取跨模态注意力以在每组内对令牌进行排名,并使用模型的响应熵来估计每组提示的相关性。该熵信号使令牌预算在组间进行全局分配,并进一步支持早期停止(AdaptToken-Lite),当模型变得足够确定时跳过剩余的组。在四个长视频基准(VideoMME、LongVideoBench、LVBench和MLVU)和多个基础MLLM(7B-72B)上,AdaptToken在准确率上始终优于基线(例如,Qwen2.5-VL 7B平均提高6.7%),并且能够从极长输入中受益(多达10K帧),而AdaptToken-Lite将推理时间减少了一半,性能相当。项目页面:https://haozheqi.github.io/adapt-token
Summary / 总结
AdaptToken is a training-free framework that enhances long video understanding for MLLMs by leveraging the model's self-uncertainty to select relevant tokens. It splits videos into groups, ranks tokens within each group using cross-modal attention, and uses entropy to estimate relevance and allocate a global token budget. This method supports early stopping, reducing inference time by about half. AdaptToken improves accuracy across various benchmarks and long video inputs, with an average increase of 6.7% over Qwen2.5-VL 7B.
AdaptToken 是一个无需训练的框架,利用 MLLM 的自我不确定性来选择长视频理解中的相关 token,解决高内存成本和上下文长度限制的问题。它将视频划分为组,在每组内使用跨模态注意力对 token 进行排序,并基于熵分配全局 token 预算。该方法在多种基准测试中提高了准确性,并且受益于极长的输入,而 AdaptToken-Lite 的推理时间减少了约一半,性能相当。
Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Authors: Zakaria Mhammedi, James Cohan
First: 2026-03-23T17:56:52+00:00 · Latest: 2026-03-30T17:14:06+00:00
Abstract
The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.
中文标题/摘要
标题:解耦探索与策略优化:基于不确定性引导的树搜索方法在困难探索中的应用
发现过程需要积极的探索——即收集新的和有信息量的数据。然而,高效的自主探索仍然是一个主要未解决的问题。主流方法通过使用强化学习(RL)训练具有内在动机的代理,最大化外在奖励和内在奖励的复合目标来应对这一挑战。我们建议这种方法带来了不必要的开销:虽然策略优化对于精确执行任务是必要的,但仅为了扩展状态覆盖范围而使用这种机制可能是低效的。在本文中,我们提出了一种新的范式,明确地将探索与利用分离,并在探索阶段绕过RL。我们的方法使用了受Go-With-The-Winner算法启发的树搜索策略,并配以表征论域不确定性度量,系统地驱动探索。通过去除策略优化的开销,我们的方法在困难的Atari基准测试中比标准的内在动机基线高效得多。此外,我们展示了发现的轨迹可以使用现有的监督反向学习算法进行提炼,从而在Montezuma’s Revenge、Pitfall!和Venture上取得了显著优于现有技术水平的得分,而无需依赖领域特定知识。最后,我们展示了在高维连续动作空间中该框架的通用性,通过直接从图像观察中解决MuJoCo Adroit灵巧操作和AntMaze任务,在稀疏奖励设置下,无需专家演示或离线数据集。据我们所知,这是首次在Adroit任务中实现这一点。
Summary / 总结
This paper addresses the challenge of efficient autonomous exploration in reinforcement learning by proposing a new paradigm that decouples exploration from policy optimization. The method uses a tree-search strategy with epistemic uncertainty to guide exploration, bypassing RL during the exploration phase. This approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks and achieves state-of-the-art scores on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Additionally, it demonstrates the generality of the framework in high-dimensional continuous action spaces by solving MuJoCo Adroit dexterous manipulation and AntMaze tasks directly from image observations and without expert demonstrations or offline datasets.
本文提出了一种新的范式,通过将探索与策略优化分离来解决强化学习中的自主高效探索挑战。该方法使用带有表征不确定性(epistemic uncertainty)的树搜索策略来引导探索,并在探索阶段绕过RL。该方法在硬币Atari基准测试上比标准内在动机基线更高效地探索,并在Montezuma’s Revenge、Pitfall!和Venture上实现了最先进的得分,无需依赖领域特定知识。此外,该框架在高维连续动作空间中展示了其通用性,通过从图像观察直接解决MuJoCo Adroit和AntMaze任务的稀疏奖励设置,而无需专家演示或离线数据集。
Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation
Authors: Rui Yu, Runkai Zhao, Jiagen Li, Qingsong Zhao, HuaiCheng Yan, Meng Wang
First: 2024-09-17T09:30:43+00:00 · Latest: 2026-03-30T17:02:43+00:00
Abstract
The LiDAR 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving. However, many existing LiDAR detection models depend on complex feature transformations, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a faster LiDAR 3D object detector, a framework that adaptively aligns sparse voxels to enable efficient heterogeneous knowledge distillation, called FASD. We aim to distill the Transformer sequence modeling capability into Mamba models, significantly boosting accuracy through knowledge transfer. Specifically, we first design the architecture for cross-model knowledge distillation to impart the global contextual understanding capabilities of the Transformer to Mamba. Transformer-based teacher model employ a scale-adaptive attention mechanism to enhance multiscale fusion. In contrast, Mamba-based student model leverages feature alignment through spatial-based adapters, supervised with latent space feature and span-head distillation losses, leading to improved performance and efficiency. We evaluated the FASD on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2% performance improvement over the baseline, while also delivering significant gains in accuracy and efficiency in real deployment.
中文标题/摘要
标题:释放Mamba的潜力:通过跨模型知识蒸馏提升LiDAR 3D稀疏检测器
能够在准确性和速度之间取得平衡的LiDAR 3D物体检测器对于实现自动驾驶的实时感知至关重要。然而,许多现有的LiDAR检测模型依赖于复杂的特征变换,导致实时性能差和资源消耗高,限制了它们的实际效果。在本工作中,我们提出了一种更快的LiDAR 3D物体检测器,一种能够自适应对齐稀疏体素以实现高效异构知识蒸馏的框架,称为FASD。我们旨在将Transformer的序列建模能力蒸馏到Mamba模型中,通过知识转移显著提升准确性。具体来说,我们首先设计了跨模型知识蒸馏的架构,以赋予Mamba全局上下文理解能力。基于Transformer的教师模型采用尺度自适应注意力机制来增强多尺度融合。相比之下,基于Mamba的学生模型通过基于空间的适配器利用特征对齐,并通过潜在空间特征和跨度头蒸馏损失进行监督,从而提高性能和效率。我们在Waymo和nuScenes数据集上评估了FASD,实现了4倍的资源消耗减少,并在基线基础上提高了1-2%的性能,同时在实际部署中也实现了显著的准确性和效率提升。
Summary / 总结
This work addresses the need for a LiDAR 3D object detector that balances accuracy and speed for real-time autonomous driving. The authors propose FASD, a framework that uses cross-model knowledge distillation to enhance the Mamba model with Transformer capabilities. Specifically, the Transformer-based teacher model transfers global contextual understanding to the Mamba-based student model, which is supervised by feature alignment and distillation losses. Evaluations on Waymo and nuScenes datasets show a 4x reduction in resource consumption and a 1-2% performance improvement over the baseline, while also improving accuracy and efficiency.
本文旨在解决用于实时自动驾驶的LiDAR 3D目标检测器需要在准确性和速度之间取得平衡的问题。作者提出了一种名为FASD的框架,通过跨模型知识蒸馏将Transformer的能力注入到Mamba模型中。具体来说,FASD在教师模型中使用了尺度自适应注意力机制,在学生模型中使用了基于空间的特征对齐,从而实现了4倍的资源消耗减少和1-2%的性能提升,同时在实际部署中显著提高了准确性和效率。
Functional Natural Policy Gradients
Authors: Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus
First: 2026-03-30T16:59:53+00:00 · Latest: 2026-03-30T16:59:53+00:00
Abstract
We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.
中文标题/摘要
标题:功能自然策略梯度
我们提出了一种用于从离线数据中学习策略的交叉匹配去偏差装置。由此产生的学习原则的一个关键后果是在政策类复杂度大于唐斯科尔的情况下,即使对于复杂度大于唐斯科尔的政策类,也能获得$\sqrt N$后悔率,前提是错误乘积的辅助余项为$O(N^{-1/2})$。后悔率界分解为由政策类复杂度控制的插值策略误差因子和由环境动力学复杂度控制的环境辅助因子,明确地表明了两者之间的权衡关系。
Summary / 总结
The paper introduces a cross-fitted debiasing method for learning policies from offline data, achieving a root-N regret rate even for complex policy classes. The method decomposes the regret into a policy error factor and an environment nuisance factor, highlighting the trade-off between them.
研究旨在通过提出一种交叉校正去偏差的方法来解决从离线数据学习策略的挑战。该方法在政策类比Donsker更复杂的情况下,可以达到根号N的遗憾界,前提是错误乘积的次要余项为O(N^{-1/2})。主要发现表明,遗憾可以分解为两个因素:一个与策略类的复杂性相关,另一个与环境动力学的复杂性相关,允许在这两者之间进行权衡。
Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation
Authors: Damian Sójka, Sebastian Cygert, Marc Masana
First: 2026-03-30T16:58:13+00:00 · Latest: 2026-03-30T16:58:13+00:00
Abstract
We introduce PACE, a backpropagation-free continual test-time adaptation system that directly optimizes the affine parameters of normalization layers. Existing derivative-free approaches struggle to balance runtime efficiency with learning capacity, as they either restrict updates to input prompts or require continuous, resource-intensive adaptation regardless of domain stability. To address these limitations, PACE leverages the Covariance Matrix Adaptation Evolution Strategy with the Fastfood projection to optimize high-dimensional affine parameters within a low-dimensional subspace, leading to superior adaptive performance. Furthermore, we enhance the runtime efficiency by incorporating an adaptation stopping criterion and a domain-specialized vector bank to eliminate redundant computation. Our framework achieves state-of-the-art accuracy across multiple benchmarks under continual distribution shifts, reducing runtime by over 50% compared to existing backpropagation-free methods.
中文标题/摘要
标题:子空间优化以实现无反向传播持续测试时适应
我们引入了PACE,一种无反向传播的持续测试时适应系统,直接优化归一化层的仿射参数。现有无导数方法难以在运行时效率与学习能力之间取得平衡,因为它们要么仅限制更新输入提示,要么需要持续、资源密集型的适应,而不考虑领域稳定性。为解决这些限制,PACE利用Covariance Matrix Adaptation Evolution Strategy与Fastfood投影,在低维子空间内优化高维仿射参数,从而实现更优的适应性能。此外,我们通过引入适应停止准则和领域专用向量库来提高运行时效率,以消除冗余计算。我们的框架在多个基准测试下实现了持续分布变化下的最佳准确率,与现有无反向传播方法相比,运行时效率提高了超过50%。
Summary / 总结
PACE is a backpropagation-free continual test-time adaptation system that optimizes the affine parameters of normalization layers directly. It uses the Covariance Matrix Adaptation Evolution Strategy with Fastfood projection to optimize high-dimensional parameters within a low-dimensional subspace, improving adaptive performance. Additionally, it incorporates an adaptation stopping criterion and a domain-specialized vector bank to enhance runtime efficiency. PACE achieves state-of-the-art accuracy across multiple benchmarks and reduces runtime by over 50% compared to existing methods.
PACE 是一种无需反向传播的持续测试时自适应系统,直接优化归一化层的仿射参数。它使用 Covariance Matrix Adaptation Evolution Strategy 结合 Fastfood 投影,在低维子空间中优化高维参数,提高自适应性能。此外,它还引入了自适应停止准则和领域专用向量库以提高运行时效率。PACE 达到了最先进的准确率,并将运行时效率提高了超过 50%。
AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
Authors: Min Wang, Ata Mahjoubfar
First: 2026-03-30T16:48:51+00:00 · Latest: 2026-03-30T16:48:51+00:00
Abstract
Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.
中文标题/摘要
标题:AMIGO:代理多图像定位 oracle 基准
代理型视觉-语言模型越来越多地通过扩展交互来行动,但大多数评估仍然集中在单张图像、单轮次的正确性上。我们引入了AMIGO(代理型多图像定位 oracle 基准),这是一个针对画廊中视觉相似图像的隐藏目标识别的长期基准。在AMIGO中,oracle私下选择一个目标图像,模型必须通过一系列属性导向的Yes/No/Unsure问题来逐步恢复它,且必须遵循严格的协议,对无效操作进行惩罚。这一设置强调了(i) 不确定性下的问题选择,(ii) 轮次间一致的约束跟踪,以及(iii) 随着证据积累的精细区分。AMIGO还支持控制oracle的不完美性,以探究在不一致反馈下的鲁棒性和验证行为。我们以Guess My Preferred Dress任务实例化AMIGO,并报告了涵盖结果和交互质量的指标,包括识别成功率、证据验证、效率、协议合规性、噪声容忍度和轨迹级诊断。
Summary / 总结
The research motivation is to evaluate agentic vision-language models in extended interactions over multiple images, addressing limitations of single-image evaluations. The main method involves a long-horizon benchmark called AMIGO, where the model must ask a series of Yes/No/Unsure questions to identify a hidden target image from a gallery, under a strict protocol that penalizes invalid actions. Key findings include the model's performance in question selection, consistent constraint tracking, and fine-grained discrimination as evidence accumulates, along with metrics on interaction quality and robustness under inconsistent feedback.
研究引入了AMIGO基准,用于评估视觉-语言模型在处理多张图片的长期交互中的能力。模型需要通过一系列是/否/不确定的问题来识别隐藏的目标图片,并对无效操作进行惩罚。关键发现包括模型处理不确定性、跟踪约束以及随着证据积累进行细粒度区分的能力。AMIGO还评估了模型在不一致反馈下的鲁棒性,并提供了交互质量和效率的度量标准。
Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure
Authors: Chao Yin, Hongzhe Yue, Qing Han, Difeng Hu, Zhenyu Liang, Fangzhou Lin, Bing Sun, Boyu Wang, Mingkai Li, Wei Yao, Jack C. P. Cheng
First: 2026-03-30T16:46:40+00:00 · Latest: 2026-03-30T16:46:40+00:00
Comments: 49 pages, 8 figure, 14 tables
Abstract
Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification--core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%--a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at https://github.com/pointcloudyc/Industrial3D.
中文标题/摘要
标题:工业3D:一种用于工业基础设施的地面激光雷达点云数据集和跨范式基准
密集点云的自动语义理解是扫描到BIM管道、数字孪生构建和竣工核实等核心任务的前提,这些任务是建筑行业数字化转型的关键。然而,对于工业机械、电气和管道(MEP)设施,这一挑战仍然未得到解决:水处理厂、制冷大厅和泵站的TLS获取数据表现出极大的几何歧义性、严重的遮挡和极端的类别不平衡,而建筑基准(如S3DIS或ScanNet)无法充分代表。我们提出了工业3D,这是一个包含来自13个水处理设施的6.12亿个6毫米分辨率的专家标注点的地面激光雷达数据集。工业3D的数据量是目前最接近的同类MEP数据集的6.6倍,提供了迄今为止最大的和最具挑战性的工业3D场景理解测试平台。我们进一步建立了首个工业跨范式基准,评估了九种代表性方法在完全监督、弱监督、无监督和基础模型设置下的表现,统一的基准协议下。最佳的监督方法实现了55.74%的mIoU,而零样本Point-SAM仅达到15.79%——39.95个百分点的差距量化了工业TLS数据的未解决领域迁移挑战。系统分析表明,这一差距源于双重危机:统计稀有性(215:1的不平衡,比S3DIS严重3.5倍)和几何歧义(尾类点与头类管道共享圆柱形特征)——基于频率的重新加权无法解决。工业3D,连同基准代码和预训练模型,将在https://github.com/pointcloudyc/Industrial3D公开。
Summary / 总结
The research aims to address the challenge of automated semantic understanding of dense point clouds in industrial MEP facilities, which is crucial for Scan-to-BIM pipelines and digital transformation in construction. The study introduces Industrial3D, a large-scale terrestrial LiDAR dataset with 612 million points from 13 water treatment facilities, and establishes a cross-paradigm benchmark to evaluate nine methods under various supervision settings. The best supervised method achieves 55.74% mIoU, while zero-shot Point-SAM scores 15.79%, highlighting a significant domain-transfer challenge for industrial TLS data. The analysis indicates that this gap is due to statistical rarity and geometric ambiguity, which cannot be fully resolved by frequency-based re-weighting alone.
研究旨在解决工业MEP设施中密集点云的自动语义理解问题,这对于Scan-to-BIM管道和建筑行业的数字化转型至关重要。研究引入了Industrial3D,这是一个包含来自13个水处理设施的6.12亿点的大规模地面激光雷达数据集,并建立了跨范式基准来评估九种方法在不同监督设置下的表现。最佳监督方法的mIoU为55.74%,而零样本Point-SAM得分为15.79%,突显了工业TLS数据的领域迁移挑战。分析表明,这一差距源于统计稀有性和几何模糊性,仅靠频率加权无法完全解决。
Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration
Authors: Joanna Wiekiera, Martyna Zur
First: 2026-03-30T16:45:16+00:00 · Latest: 2026-03-30T16:45:16+00:00
Abstract
Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.
中文标题/摘要
标题:分割与恢复:一种模块化任务解耦的通用图像恢复框架
恢复受到各种类型退化影响的图像,如噪声、模糊或曝光不当,仍然是计算机视觉中的一个重大挑战。尽管最近的趋势倾向于使用复杂的单一架构,但这些模型往往存在任务干扰的问题,并且需要在高性能计算集群上进行长时间的联合训练。在本文中,我们提出了一种基于显式诊断路由机制的模块化、任务解耦图像恢复框架。该架构由一个轻量级的卷积神经网络(CNN)分类器组成,该分类器评估输入图像并动态将其导向专门的恢复节点。该框架的一个关键优势是其模型无关的可扩展性:虽然我们使用三个独立的U-Net专家进行了演示,但系统允许集成任何针对特定任务定制的恢复方法。通过隔离重建路径,该框架可以防止特征冲突并显著减少训练开销。与单一架构不同,在我们的框架中添加新的退化类型只需训练一个专家并更新路由器,而无需对整个系统进行重新训练。实验结果表明,这种方法提供了一种在标准本地硬件上具有可扩展性和高效性的多退化恢复解决方案。代码将在论文接受后发布。
Summary / 总结
The paper addresses the challenge of image restoration from various degradations such as noise and blur. It proposes a modular framework that uses a lightweight CNN classifier to route images to specialized restoration nodes, avoiding task interference. This approach allows for model-agnostic extensibility and reduces training overhead, making it scalable and efficient for multi-degradation restoration on standard hardware.
论文针对噪声和模糊等不同类型的退化图像恢复问题,提出了一种模块化框架,通过轻量级的CNN分类器将图像路由到专门的恢复节点,避免任务干扰。该方法具有模型通用的扩展性,并减少了训练开销,使其在标准硬件上对多退化恢复具有可扩展和高效的特点。
CLMN: Concept based Language Models via Neural Symbolic Reasoning
Authors: Yibo Yang
First: 2025-10-11T06:58:44+00:00 · Latest: 2026-03-30T16:44:05+00:00
Comments: 7 pages, 2 figures
Abstract
Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.
中文标题/摘要
标题:CLMN:基于概念的神经符号语言模型
深度学习推动了自然语言处理(NLP)的发展,但可解释性仍然有限,尤其是在医疗保健和金融领域。概念瓶颈模型将预测与视觉中的人类概念联系起来,但在NLP版本中,要么使用损害文本表示的二元激活,要么使用削弱语义的潜在概念,而且它们很少建模动态概念交互,如否定和上下文。我们提出了概念语言模型网络(CLMN),这是一种既能保持性能又能保持可解释性的神经符号框架。CLMN将概念表示为连续的、可读的人类嵌入,并应用模糊逻辑推理来学习适应性交互规则,说明概念如何相互影响以及最终决策。该模型通过概念感知的表示增强原始文本特征,并自动诱导可解释的逻辑规则。在多个数据集和预训练语言模型上,CLMN在准确性和解释质量方面均优于现有概念基方法。这些结果表明,在统一的概念空间中结合神经表示和符号推理可以产生实用且透明的NLP系统。
Summary / 总结
The research aims to enhance the interpretability of language models in fields like healthcare and finance, where deep learning has advanced natural language processing (NLP) but lacks interpretability. The Concept Language Model Network (CLMN) is introduced, which uses continuous human-readable embeddings for concepts and fuzzy-logic reasoning to model concept interactions. CLMN outperforms existing concept-based methods in accuracy and explanation quality across various datasets and pre-trained models.
研究旨在提高语言模型在医疗和金融等领域的可解释性,尽管深度学习已提升NLP性能但缺乏透明度。CLMN是一种神经符号框架,将概念表示为连续嵌入,并使用模糊逻辑推理学习适应性交互规则。该模型在各种数据集和预训练语言模型中在准确性和解释质量上均优于现有概念基方法。
Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory
Authors: Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab, Anderson Avila, Azzam Mourad, Hadi Otrok
First: 2026-03-30T16:39:02+00:00 · Latest: 2026-03-30T16:39:02+00:00
Comments: 12 pages, 4 images, 2 tables, 2 algorithms, Regular Journal Paper
Abstract
Federated Learning (FL) is witnessing wider adoption due to its ability to benefit from large amounts of scattered data while preserving privacy. However, despite its advantages, federated learning suffers from several setbacks that directly impact the accuracy, and the integrity of the global model it produces. One of these setbacks is the presence of malicious clients who actively try to harm the global model by injecting backdoor data into their local models while trying to evade detection. The objective of such clients is to trick the global model into making false predictions during inference, thereby compromising the integrity and trustworthiness of the global model on which honest stakeholders rely. To mitigate such mischievous behavior, we propose FedBBA (Federated Backdoor and Behavior Analysis). The proposed model aims to dampen the effect of such clients on the final accuracy, creating more resilient federated learning environments. We engineer our approach through the combination of (1) a reputation system to evaluate and track client behavior, (2) an incentive mechanism to reward honest participation and penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis (PPA) to dynamically identify and minimize the impact of malicious clients on the global model. Extensive simulations on the German Traffic Sign Recognition Benchmark (GTSRB) and Belgium Traffic Sign Classification (BTSC) datasets demonstrate that FedBBA reduces the backdoor attack success rate to approximately 1.1%--11% across various attack scenarios, significantly outperforming state-of-the-art defenses like RDFL and RoPE, which yielded attack success rates between 23% and 76%, while maintaining high normal task accuracy (~95%--98%).
中文标题/摘要
标题:使用PPA和最小最大博弈理论在联邦学习中缓解后门攻击
联邦学习(FL)因其能够利用分散的大数据集并同时保护隐私而得到更广泛的应用。然而,尽管FL具有这些优势,但它也遭受多种缺陷,直接影响全局模型的准确性和完整性。其中一种缺陷是恶意客户端试图通过在其本地模型中注入后门数据来损害全局模型,同时试图逃避检测。这些客户端的目标是在推理过程中欺骗全局模型做出错误预测,从而破坏依赖于全局模型的诚实利益相关者的完整性和可信度。为了缓解这种恶意行为,我们提出了FedBBA(联邦后门和行为分析)。该模型旨在减轻这些客户端对最终准确性的负面影响,创建更健壮的联邦学习环境。我们通过结合(1)声誉系统来评估和跟踪客户端行为,(2)激励机制来奖励诚实参与并惩罚恶意行为,以及(3)结合投影追求分析(PPA)的博弈论模型来动态识别并最小化恶意客户端对全局模型的影响来构建我们的方法。在德国交通标志识别基准(GTSRB)和比利时交通标志分类(BTSC)数据集上的广泛模拟表明,FedBBA在各种攻击场景下将后门攻击成功率降低到约1.1%至11%,显著优于RDFL和RoPE等最先进的防御措施,后者在攻击成功率方面分别为23%至76%,同时保持高正常任务准确性(约95%至98%)。
Summary / 总结
The paper addresses the issue of backdoor attacks in Federated Learning (FL) by proposing FedBBA, which combines a reputation system, incentive mechanisms, and game theoretical models with projection pursuit analysis to mitigate malicious client behavior. The approach significantly reduces the backdoor attack success rate to approximately 1.1%--11% across various scenarios, outperforming existing defenses like RDFL and RoPE, while maintaining high accuracy for normal tasks.
论文提出了一种名为FedBBA的方法,结合了声誉系统、激励机制和投影追求分析(PPA)与博弈论模型,以缓解联邦学习中的后门攻击问题。在GTSRB和BTSC数据集上的广泛模拟表明,FedBBA将后门攻击的成功率降低到约1.1%–11%,优于现有防御方法如RDFL和RoPE。
3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks
Authors: Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Venue: CVPR 2025
First: 2025-05-09T05:32:40+00:00 · Latest: 2026-03-30T16:30:59+00:00
Comments: Accepted at the 1st Workshop on 3D LLM/VLA, CVPR 2025. This work has been submitted to the IEEE for possible publication
Abstract
Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to transform real-world observations into low-level control for object interaction. Recent advances in Vision-Language-Action (VLA) models have shown promise by mapping RGB images and language instructions to task space velocities, typically trained on large datasets of teleoperated demonstrations. However, these models often struggle with generalization beyond their training distributions. In this work, we introduce 3D-CAVLA, a novel finetuning framework that enhances task generalization of VLA policies by incorporating three key components: (i) chain-of-thought reasoning for structured decision-making, (ii) depth-aware perception for 3D spatial understanding, and (iii) task-oriented region-of-interest detection for focused manipulation. Extensive experiments in the LIBERO simulation environment demonstrate that 3D-CAVLA achieves an average success rate of 98.1% across diverse in-domain task suites. On unseen tasks, 3D-CAVLA delivers an absolute improvement of 8.8% in success rate, underscoring the benefits of 3D scene awareness for robust generalization. We validate our approach on real-world tabletop experiments demonstrating that the proposed model translates effectively from simulation to physical robots. 3D-CAVLA achieves over a 3X faster training convergence and delivers a 25% gain in success rate on unseen real world tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: https://3d-cavla.github.io
中文标题/摘要
标题:3D CAVLA:利用深度和3D上下文提升视觉语言动作模型在未见任务中的泛化能力
三维机器人操作需要有效计算多自由度关节空间轨迹,以实现精确和稳健的控制。为此,机器人必须将语义理解与视觉感知相结合,将现实世界的观察转化为低级控制以进行物体交互。近期视觉-语言-动作(VLA)模型的进步表明,通过将RGB图像和语言指令映射到任务空间速度具有潜力,通常在大量远程操作演示数据集上进行训练。然而,这些模型往往难以泛化到训练分布之外。在本研究中,我们提出了3D-CAVLA,这是一种新颖的微调框架,通过引入三个关键组件来增强VLA策略的任务泛化能力:(i) 有条理的推理以进行结构化决策,(ii) 深度感知以实现3D空间理解,(iii) 任务导向的感兴趣区域检测以实现集中操作。在LIBERO仿真环境中进行的大量实验表明,3D-CAVLA在多种领域内任务套件中实现了98.1%的平均成功率。在未见任务中,3D-CAVLA在成功率上绝对提高了8.8%,突显了3D场景意识对于稳健泛化的益处。我们通过真实桌面实验验证了该方法,表明所提出模型能够从仿真有效转移到物理机器人。3D-CAVLA实现了3倍于训练收敛速度,并在未见真实世界任务中提高了25%的成功率。我们将开源我们的代码和未见任务数据集,以促进社区驱动的研究:https://3d-cavla.github.io
Summary / 总结
This work introduces 3D-CAVLA, a novel finetuning framework for Vision-Language-Action models that enhances task generalization by incorporating chain-of-thought reasoning, depth-aware perception, and task-oriented region-of-interest detection. Experiments in the LIBERO simulation environment show that 3D-CAVLA achieves an average success rate of 98.1% across diverse in-domain tasks and an 8.8% improvement on unseen tasks. The model also demonstrates faster training convergence and better performance in real-world tabletop experiments compared to existing methods.
本文提出了3D CAVLA框架,通过引入深度感知、任务导向的感兴趣区域检测和链式思考推理,增强视觉-语言-动作模型在机器人操作任务中的泛化能力。该模型在多样化的在域任务中平均成功率达到了98.1%,在未见过的任务中显示出8.8%的成功率提升,突显了3D场景感知的优势。该方法在真实世界实验中也展示了更快的训练收敛速度和更好的性能,优于现有方法。
Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning
Authors: Xinqi Lucas Liu, Ruoxi Hu, Alejandro Ojeda Olarte, Zhuoran Chen, Kenny Ma, Charles Cheng Ji, Lerrel Pinto, Raunaq Bhirangi, Irmak Guzey
First: 2026-03-27T17:58:03+00:00 · Latest: 2026-03-30T16:30:40+00:00
Abstract
Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under $1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at https://ruka-hand-v2.github.io/ .
中文标题/摘要
标题:Ruka-v2:具有腕部和 abduction 的开放源代码肌腱驱动灵巧手以供机器人学习
缺乏可访问且灵巧的机器人硬件一直是实现人类水平灵巧性的主要瓶颈。去年,我们发布了Ruka,一个完全开源的肌腱驱动类人手,自由度为11个——每个手指2个,拇指3个,成本低于1300美元。它是第一个完全开源的类人手之一,并引入了一种新颖的数据驱动的指尖控制方法,捕捉了控制系统中的肌腱动力学。尽管做出了这些贡献,但Ruka缺少两个关键的自由度,即腕部的灵活性和手指的adduction/abduction,这对于模仿人类行为至关重要。在本文中,我们介绍了Ruka-v2:一个完全开源的肌腱驱动类人手,具有解耦的2-DOF并联腕部和手指的adduction/adduction。并联腕部增加了平滑的独立弯曲/伸展和桡侧/尺侧偏移,使在柜子等受限环境中进行操作成为可能。adduction使抓取细长物体、在手内旋转和书法等动作成为可能。我们介绍了Ruka-v2的设计,并通过用户研究将其与Ruka进行比较,在远程操作任务中发现完成时间减少了51.3%,成功率提高了21.2%。我们进一步展示了其在机器人学习中的全部应用范围:双臂和单臂远程操作13个灵巧任务,以及在3个任务上的自主策略学习。所有3D打印文件、组装说明、控制器软件和视频均可在https://ruka-hand-v2.github.io/ 获取。
Summary / 总结
Ruka-v2 was developed to address the limitations of its predecessor, Ruka, by adding wrist mobility and finger abduction/adduction, essential for human-like dexterity. The hand uses a tendon-driven mechanism with 13 degrees of freedom. Experimental results showed a 51.3% reduction in task completion time and a 21.2% increase in success rate compared to Ruka. Ruka-v2 was also tested in various teleoperation and autonomous learning tasks, showcasing its versatility for robot learning applications.
研究旨在解决缺乏可访问且灵巧的机器人硬件的问题,以实现人类级别的灵巧性。Ruka-v2 是一个具有 13 个自由度的开源腱驱动手,包括一个解耦的 2-DOF 平行手腕和手指外展,以增强操作能力。用户研究显示,与 Ruka 相比,任务完成时间减少了 51.3%,成功率提高了 21.2%。该手还用于双臂和单臂远程操作以及在各种任务上的自主策略学习。
FastVMT: Eliminating Redundancy in Video Motion Transfer
Authors: Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Kunyu Feng, Yuxuan Xue, Zixiang Zhao, Konrad Schindler, Qifeng Chen, Linfeng Zhang
First: 2026-02-05T11:15:59+00:00 · Latest: 2026-03-30T16:09:33+00:00
Comments: Accepted by ICLR2026, Project page: fastvmt.gitHub.io, Code: https://github.com/mayuelala/FastVMT
Abstract
Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
中文标题/摘要
标题:FastVMT:消除视频运动转移中的冗余
视频运动转移旨在根据文本提示生成视觉内容的同时,将参考视频中观察到的运动模式转移到新视频中。最近的方法主要使用扩散变换器(DiT)架构。为了实现满意的运行时,许多方法试图加速DiT中的计算,但未能解决结构上的低效问题。在本文中,我们识别并消除了早期工作中的两种计算冗余:运动冗余是因为通用的DiT架构没有反映帧间运动是小且平滑的事实;梯度冗余发生在忽略梯度沿扩散轨迹缓慢变化的情况。为了减轻运动冗余,我们对相应的注意力层进行掩码,使其仅与局部区域交互,避免不必要的远距离图像区域的交互权重计算。为了利用梯度冗余,我们设计了一种优化方案,该方案重用之前的扩散步骤中的梯度,并跳过不必要的梯度计算。平均而言,FastVMT在不降低生成视频的视觉保真度或时间一致性的情况下实现了3.43倍的加速。
Summary / 总结
FastVMT addresses the inefficiencies in video motion transfer by identifying and removing motion and gradient redundancies in the Diffusion Transformer architecture. It masks attention layers for local interactions to avoid unnecessary computations and reuses gradients from previous steps to skip unwarranted computations. As a result, FastVMT achieves a 3.43x speedup while maintaining visual fidelity and temporal consistency in generated videos.
FastVMT通过去除运动冗余和梯度冗余来解决视频运动转移方法中的低效问题。它通过局部掩蔽注意力层避免不必要的计算,并重用之前的步骤中的梯度以跳过不必要的梯度计算。因此,FastVMT实现了3.43倍的加速,同时保持生成视频的视觉保真度和时间一致性。
Dynamic Lookahead Distance via Reinforcement Learning-Based Pure Pursuit for Autonomous Racing
Authors: Mohamed Elgouhary, Amr S. El-Wakeel
First: 2026-03-30T16:09:26+00:00 · Latest: 2026-03-30T16:09:26+00:00
Abstract
Pure Pursuit (PP) is a widely used path-tracking algorithm in autonomous vehicles due to its simplicity and real-time performance. However, its effectiveness is sensitive to the choice of lookahead distance: shorter values improve cornering but can cause instability on straights, while longer values improve smoothness but reduce accuracy in curves. We propose a hybrid control framework that integrates Proximal Policy Optimization (PPO) with the classical Pure Pursuit controller to adjust the lookahead distance dynamically during racing. The PPO agent maps vehicle speed and multi-horizon curvature features to an online lookahead command. It is trained using Stable-Baselines3 in the F1TENTH Gym simulator with a KL penalty and learning-rate decay for stability, then deployed in a ROS2 environment to guide the controller. Experiments in simulation compare the proposed method against both fixed-lookahead Pure Pursuit and an adaptive Pure Pursuit baseline. Additional real-car experiments compare the learned controller against a fixed-lookahead Pure Pursuit controller. Results show that the learned policy improves lap-time performance and repeated lap completion on unseen tracks, while also transferring zero-shot to hardware. The learned controller adapts the lookahead by increasing it on straights and reducing it in curves, demonstrating effectiveness in augmenting a classical controller by online adaptation of a single interpretable parameter. On unseen tracks, the proposed method achieved 33.16 s on Montreal and 46.05 s on Yas Marina, while tolerating more aggressive speed-profile scaling than the baselines and achieving the best lap times among the tested settings. Initial real-car experiments further support sim-to-real transfer on a 1:10-scale autonomous racing platform
中文标题/摘要
标题:基于强化学习的纯追求动态前瞻距离在自主赛车中的应用
纯追求(PP)是一种广泛应用于自主车辆的路径跟踪算法,因其简单性和实时性能而受到青睐。然而,其效果对前瞻距离的选择非常敏感:较小的值可以提高过弯性能但可能导致直道上的不稳定,而较大的值可以提高平滑度但降低曲线准确性。我们提出了一种结合Proximal Policy Optimization(PPO)与经典纯追求控制器的混合控制框架,在比赛中动态调整前瞻距离。PPO代理将车辆速度和多级曲率特征映射到在线前瞻命令。该代理使用Stable-Baselines3在F1TENTH Gym模拟器中进行训练,采用KL惩罚和学习率衰减以确保稳定性,然后部署在ROS2环境中以指导控制器。模拟实验将所提方法与固定前瞻距离的纯追求方法和自适应纯追求基线进行比较。额外的实车实验将所学控制器与固定前瞻距离的纯追求控制器进行比较。结果表明,所学策略在未见过的赛道上提高了圈速性能和重复圈完成情况,同时实现了零样本到硬件的转移。所学控制器通过在直道上增加前瞻距离并在弯道上减少前瞻距离来适应前瞻距离,展示了通过在线调整单一可解释参数来增强经典控制器的有效性。在未见过的赛道上,所提方法在蒙特利尔赛道上实现了33.16秒,在亚斯马纳赛道上实现了46.05秒,同时容忍了比基线更大的速度剖面缩放,并在测试设置中实现了最佳圈速。初步的实车实验进一步支持了在1:10比例的自主赛车平台上的模拟到现实的转移
Summary / 总结
The paper addresses the sensitivity of Pure Pursuit (PP) to lookahead distance by proposing a hybrid control framework that dynamically adjusts lookahead distance using Proximal Policy Optimization (PPO). The PPO agent learns to map vehicle speed and curvature features to optimal lookahead commands. Experiments show that the proposed method outperforms both fixed-lookahead PP and an adaptive PP baseline, improving lap times and transferring well to unseen tracks and real-world scenarios.
论文提出了一种结合Proximal Policy Optimization (PPO)的混合控制框架,用于动态调整自主赛车Pure Pursuit (PP)算法中的前瞻距离。PPO代理学习将车辆速度和曲率特征映射到最优的前瞻距离命令,提高了未见过的赛道上的圈速性能和适应性。在蒙特利尔和亚斯马纳赛道上,所学控制器分别实现了33.16秒和46.05秒的成绩,优于固定前瞻距离和自适应基线,同时能够容忍更激进的速度配置。实车实验还证明了从仿真到现实的迁移能力。
Trust-Aware Routing for Distributed Generative AI Inference at the Edge
Authors: Chanh Nguyen, Erik Elmroth
First: 2026-03-30T16:07:11+00:00 · Latest: 2026-03-30T16:07:11+00:00
Comments: 11 pages, 10 figures. Preprint accepted at the 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026)
Abstract
Emerging deployments of Generative AI increasingly execute inference across decentralized and heterogeneous edge devices rather than on a single trusted server. In such environments, a single device failure or misbehavior can disrupt the entire inference process, making traditional best-effort peer-to-peer routing insufficient. Coordinating distributed generative inference therefore requires mechanisms that explicitly account for reliability, performance variability, and trust among participating peers. In this paper, we present G-TRAC, a trust-aware coordination framework that integrates algorithmic path selection with system-level protocol design to ensure robust distributed inference. First, we formulate the routing problem as a \textit{Risk-Bounded Shortest Path} computation and introduce a polynomial-time solution that combines trust-floor pruning with Dijkstra's search, achieving sub-millisecond median routing latency at practical edge scales, and remaining below 10 ms at larger scales. Second, to operationally support the routing logic in dynamic environments, the framework employs a \textit{Hybrid Trust Architecture} that maintains global reputation state at stable anchors while disseminating lightweight updates to edge peers via background synchronization. Experimental evaluation on a heterogeneous testbed of commodity devices demonstrates that G-TRAC significantly improves inference completion rates, effectively isolates unreliable peers, and sustains robust execution even under node failures and network partitions.
中文标题/摘要
标题:信任感知路由:边缘设备分布式生成AI推理
新兴部署的生成AI越来越多地在分散且异构的边缘设备上执行推理,而不是在单一可信服务器上。在这种环境中,单个设备的故障或不当行为会中断整个推理过程,使得传统的尽力而为的对等路由不足。因此,协调分布式生成推理需要机制来明确考虑可靠性、性能变异性以及参与对等方之间的信任。 在本文中,我们提出了G-TRAC,这是一种信任感知协调框架,将路径选择算法与系统级协议设计相结合,以确保稳健的分布式推理。首先,我们将路由问题表述为“风险限制最短路径”计算,并引入一种结合信任门槛剪枝与迪杰斯特拉搜索的多项式时间解决方案,在实际边缘规模下实现亚毫秒级中位路由延迟,并在更大规模下保持在10毫秒以下。其次,为了在动态环境中支持路由逻辑,该框架采用了一种“混合信任架构”,在稳定锚点处维护全局声誉状态,并通过后台同步向边缘对等方传播轻量级更新。 在异构测试床的商用设备上进行的实验评估表明,G-TRAC 显著提高了推理完成率,有效隔离了不可靠的对等方,并在节点故障和网络分区下保持了稳健执行。
Summary / 总结
G-TRAC is a trust-aware routing framework for distributed generative AI inference at the edge, addressing the need for reliability and performance in decentralized environments. It formulates the routing problem as a Risk-Bounded Shortest Path and uses a hybrid trust architecture to maintain global reputation while updating edge devices. Experimental results show that G-TRAC improves inference completion rates and maintains robust execution even under node failures and network partitions.
G-TRAC 是一种针对边缘分布式生成式 AI 推断的信任感知路由框架。它将路由问题表述为风险有界最短路径计算,并使用结合信任下限剪枝与迪杰斯特拉搜索的多项式时间解决方案,实现亚毫秒级的中位路由延迟。该框架还采用混合信任架构,在稳定锚点维护全局声誉状态,并通过后台同步向边缘节点传播轻量级更新。实验结果表明,G-TRAC 提高了推理完成率,隔离了不可靠节点,并在节点故障和网络分区下保持了稳健执行。
Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
Authors: Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao
First: 2026-03-30T16:03:56+00:00 · Latest: 2026-03-30T16:03:56+00:00
Comments: 21 pages, 15 figures, 6 tables
Abstract
Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
中文标题/摘要
标题:与你共见:感知-推理协同进化在多模态推理中的应用
可验证奖励强化学习(RLVR)显著提升了多模态大型语言模型(MLLMs)的推理能力。然而,现有的RLVR方法通常依赖于结果驱动的优化,使用基于最终答案的单一共享奖励来同时更新感知和推理。这种共享奖励模糊了责任分配,虽然经常改善了推理模式,但未能可靠地增强上游视觉证据提取的准确性。为了解决这一感知瓶颈,我们引入了PRCO(感知-推理协同进化),这是一种具有共享策略的双重角色RLVR框架。PRCO包括两个合作角色:观察者生成与问题匹配的证据描述,解算器基于此描述预测最终答案。关键的是,PRCO使用角色特定的奖励信号:解算器使用最终答案的可验证结果奖励进行优化,而观察者则根据解算器下游的成功获得效用奖励。在八个具有挑战性的多模态推理基准测试中的广泛实验表明,与基线模型相比,PRCO在平均准确率上提高了7个百分点以上,优于先前的开源RL调优基线。
Summary / 总结
This paper addresses the limitations of existing reinforcement learning with verifiable rewards (RLVR) approaches in multimodal large language models (MLLMs) by introducing PRCO (Perception-Reasoning Coevolution). PRCO uses a dual-role framework with an Observer and a Solver, where the Observer generates evidence captions and the Solver predicts answers based on these captions. PRCO employs role-specific reward signals, optimizing the Solver with verifiable outcome rewards and the Observer with a utility reward. Experiments show that PRCO improves accuracy by over 7 points on average across eight benchmarks compared to the base model and outperforms prior open-source RL-tuned baselines.
论文提出了感知-推理协同进化(PRCO)框架,这是一种双角色强化学习与可验证奖励(RLVR)方法,旨在提高多模态大型语言模型的推理能力。PRCO 包含观察者和解算器两个角色,每个角色都有特定的奖励信号,以增强视觉证据提取的准确性和最终答案预测。实验结果显示,PRCO 在八个基准测试中将平均准确率提高了超过 7 个百分点,并优于之前的开源 RL 调整基线。
History
20260331_0408 20260329_0342 20260328_0350 20260327_0407 20260326_0356 20260325_0407 20260324_0402 20260323_0334 20260322_0333 20260321_0346 20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553