arXiv 论文速递

Snapshot: 20260308_0331

Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

Authors: Leif Van Holland, Domenic Zingsheim, Mana Takhsha, Hannah Dröge, Patrick Stotko, Markus Plack, Reinhard Klein

First: 2026-03-05T18:59:59+00:00 · Latest: 2026-03-05T18:59:59+00:00

Comments: You can find the project page https://github.com/vc-bonn/transformer-based-inpainting

Abstract

High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

中文标题/摘要

标题：基于变压器的实时3D流式传输补全在稀疏多摄像头设置中

多摄像头的高保真3D流式传输对于许多AR/VR应用中的沉浸式体验至关重要。由于实时约束导致的有限视角数量，渲染图像中会出现缺失信息和不完整表面。现有方法通常依赖简单的启发式方法进行空洞填充，这可能导致不一致或视觉伪影。我们提出了一种新颖的应用导向补全方法，在新颖视图渲染之后作为基于图像的后处理步骤独立于底层表示来完成缺失的纹理。该方法设计为与任何校准的多摄像头系统兼容的独立模块。为此，我们引入了一种多视图意识的、基于变压器的网络架构，使用时空嵌入以确保帧间一致性同时保留细节点。此外，我们的分辨率无关设计允许适应不同的摄像头设置，而自适应补丁选择策略平衡了推理速度和质量，允许实时性能。我们在相同的实时约束条件下将我们的方法与最先进的补全技术进行评估，并证明我们的模型在质量和速度之间实现了最佳权衡，在图像和视频指标上均优于竞争对手。

Summary / 总结

The paper addresses the challenge of missing information in 3D streaming from multiple cameras by proposing a transformer-based inpainting method. This method fills in missing textures using a multi-view aware network that incorporates spatio-temporal embeddings, ensuring consistency and preserving fine details. The approach is designed to be resolution-independent and adaptable to different camera setups, with an adaptive patch selection strategy that balances speed and quality, enabling real-time performance. Experimental results show that the proposed method outperforms existing techniques in both image and video-based metrics under real-time constraints.

论文提出了一种基于变压器的修复方法，以解决多摄像头3D流媒体中缺失信息的问题。该方法在新颖视图渲染后填充缺失的纹理，确保一致性和细节保留。该方法使用时空嵌入和自适应补丁选择策略，以实现实时性能，并在图像和视频指标上优于现有技术。该模型设计为与任何校准的多摄像头系统兼容且分辨率无关。

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Authors: Weijie Lyu, Ming-Hsuan Yang, Zhixin Shu

Venue: CVPR 2026

First: 2026-03-05T18:59:58+00:00 · Latest: 2026-03-05T18:59:58+00:00

Comments: Accepted by CVPR 2026. Project page: https://weijielyu.github.io/FaceCam

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.

中文标题/摘要

标题：FaceCam：基于尺度感知条件的人像视频摄像机控制

我们介绍了FaceCam系统，该系统可以根据单目人像视频输入生成可定制摄像机轨迹的视频。基于大型视频生成模型的最近摄像机控制方法虽然取得了令人鼓舞的进展，但由于尺度模棱两可的摄像机表示或3D重建错误，常常在人像视频中表现出几何失真和视觉伪影。为克服这些限制，我们提出了一种针对面部的尺度感知表示，该表示提供了确定性条件，无需依赖3D先验。我们使用多视角演播室捕获和野外单目视频训练视频生成模型，并引入了两种摄像机控制数据生成策略：合成摄像机运动和多帧拼接，以利用静态训练摄像机并泛化到动态、连续的摄像机轨迹。在Ava-256数据集和多样化的野外视频上的实验表明，FaceCam在摄像机可控性、视觉质量、身份和运动保留方面表现出更优的性能。

Summary / 总结

FaceCam is a system that generates controllable video camera trajectories for monocular human portrait videos. It addresses geometric distortions and visual artifacts by using a scale-aware representation that does not rely on 3D priors. FaceCam is trained on both studio and real-world videos and uses synthetic camera motion and multi-shot stitching to handle dynamic camera movements. Experiments show that FaceCam outperforms other methods in terms of camera controllability, visual quality, and preservation of identity and motion.

FaceCam 是一个系统，用于生成单目人像视频的可定制摄像机轨迹。它通过使用针对面部的尺度感知表示来避免几何失真和视觉伪影，解决了先前方法的限制。该系统在工作室和真实世界视频上进行训练，并使用合成摄像机运动和多帧拼接来处理动态摄像机运动。实验表明，FaceCam 在摄像机可控性、视觉质量以及身份和运动保留方面优于现有方法。

RoboPocket: Improve Robot Policies Instantly with Your Phone

Authors: Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le, Yi Wang, Yuting Zhang, Jun Lv, Chuan Wen, Cewu Lu

First: 2026-03-05T18:59:38+00:00 · Latest: 2026-03-05T18:59:38+00:00

Comments: Project page: https://robo-pocket.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.

中文标题/摘要

标题：RoboPocket：利用手机即时改进机器人策略

模仿学习的扩展从根本上受到数据收集效率的限制。虽然手持设备接口已成为野外数据采集的可扩展解决方案，但它们主要以开环方式运行：操作者在不知晓策略弱点的情况下盲目收集演示，导致关键状态分布覆盖效率低下。相反，DAgger等交互方法有效解决了协变量偏移问题，但依赖于物理机器人执行，这既昂贵又难以扩展。为解决这一权衡，我们引入了RoboPocket，这是一种便携式系统，利用单个消费级智能手机实现无需机器人即时策略迭代。其核心创新是远程推理框架，通过增强现实（AR）视觉预知来可视化策略的预测轨迹。这种沉浸式反馈使收集者能够主动识别潜在故障，并专注于策略的薄弱区域，而无需物理机器人。此外，我们还实现了一个异步在线微调流水线，该流水线能够不断用新数据更新策略，从而在几分钟内有效关闭学习循环。大量实验表明，RoboPocket 遵循数据扩展定律，与离线扩展策略相比，数据效率提高了一倍，克服了其长期存在的效率瓶颈。此外，我们的即时迭代循环在分布式环境中也提高了样本效率，最多可提高2倍，每次每人只需少量交互修正。项目页面和视频：https://robo-pocket.github.io

Summary / 总结

RoboPocket aims to improve robot policies using consumer smartphones for efficient data collection. It introduces a Remote Inference framework with Augmented Reality to visualize policy predictions, allowing operators to proactively collect data on policy weaknesses. This method doubles data efficiency compared to offline strategies and increases sample efficiency by up to 2x in distributed environments with minimal interactive corrections. The system effectively closes the learning loop in minutes without requiring physical robot execution.

RoboPocket通过利用单个消费级智能手机解决模仿学习中的数据收集效率问题。它引入了使用增强现实的远程推理框架，以可视化策略的预测轨迹，使操作者能够主动收集策略薄弱区域的数据。该系统还包括一个异步在线微调流水线，能够快速更新策略以利用新数据。实验表明，RoboPocket将数据效率提高了一倍，并在分布式环境中将样本效率提高了2倍以上，超过了传统的离线扩展策略。

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

中文标题/摘要

标题：校准稀疏注意加速文本到视频生成

近期的扩散模型能够生成高质量的视频，但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络在时空注意方面存在瓶颈。本文中，我们发现大量词到词的连接在各种输入中持续产生微不足道的分数，并且它们的模式在查询之间经常重复。因此，在这些情况下可以跳过注意计算，对结果影响甚微。这一观察结果同样适用于局部词块之间的连接。受此启发，我们引入了CalibAtt，这是一种无需训练的方法，通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程，以识别在各种输入中稳定的块级稀疏性和重复模式，并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时，我们对选定的输入相关连接进行密集计算，并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和不同分辨率的少量步骤蒸馏模型上进行的广泛实验表明，CalibAtt 可以实现高达1.58倍的端到端加速，同时优于现有无需训练的方法，保持视频生成质量和文本-视频对齐。

Summary / 总结

This paper addresses the slow runtime of diffusion models used for high-quality text-to-video generation. It introduces CalibAtt, a training-free method that accelerates video generation by identifying and skipping negligible token-to-token connections through offline calibration. Experiments show that CalibAtt achieves up to 1.58x speedup while maintaining video quality and text-video alignment.

本文解决了用于高质量视频生成的扩散模型运行缓慢的问题。它提出了一种名为CalibAtt的无训练加速方法，通过离线校准跳过不重要的token-to-token连接。实验表明，CalibAtt可以实现最高1.58倍的加速，同时保持视频质量和文本-视频对齐。

Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels

Authors: Khai Nguyen, Petros Ellinas, Anvita Bhagavathula, Priya Donti

First: 2026-03-05T18:58:39+00:00 · Latest: 2026-03-05T18:58:39+00:00

Comments: in submission

Abs · PDF · Code1 · Code2

Abstract

To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.

中文标题/摘要

标题：廉价刺激：使用廉价标签的有效分摊优化

为扩大优化和模拟问题的解决方案规模，先前的工作探索了机器学习代理，这些代理可以廉价地将问题参数映射到相应的解决方案。常用的策略，包括带有软约束或硬约束的监督学习和自监督学习，面临着固有的挑战，如依赖昂贵的高质量标签或复杂的优化景观。为了解决这些权衡，我们提出了一种新的框架，首先收集“廉价”的不完美标签，然后进行监督预训练，最后通过自监督学习改进模型以提高整体性能。我们的理论分析和基于绩效的标准表明，标签数据只需将模型置于一个吸引盆地内，从而确认只需少量不准确的标签和训练周期即可。我们通过包括非凸约束优化、电力网络运行和刚性动力系统在内的具有挑战性的领域，实证验证了我们简单的三阶段策略，并展示了它能够实现更快的收敛；提高准确度、可行性和最优性；以及高达59倍的总离线成本减少。

Summary / 总结

The paper aims to improve the efficiency of solving optimization and simulation problems by using machine-learning surrogates that require inexpensive labels. It proposes a three-stage framework: collecting cheap imperfect labels, performing supervised pretraining, and refining through self-supervised learning. The study shows that this approach can achieve faster convergence, better accuracy, and up to 59x reductions in offline cost compared to existing methods.

论文旨在通过使用可以处理廉价标签的机器学习代理来提高解决优化和仿真问题的效率。提出了一种三阶段框架：收集廉价的不完美标签、进行监督预训练，以及通过自我监督学习改进模型。结果显示，这种方法可以实现更快的收敛、更好的准确性和可行性，并且与现有方法相比，离线成本最多可减少59倍。

Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding

Authors: Alex Oshin, Rahul Vodeb Ghosh, Augustinos D. Saravanos, Evangelos A. Theodorou

Venue: ICLR 2026

First: 2025-12-01T11:38:45+00:00 · Latest: 2026-03-05T18:54:48+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

We propose FlexQP, an always-feasible convex quadratic programming (QP) solver based on an $\ell_1$ elastic relaxation of the QP constraints. If the original constraints are feasible, FlexQP provably recovers the optimal solution. If the constraints are infeasible, FlexQP identifies a solution that minimizes the constraint violation while keeping the number of violated constraints sparse. Such infeasibilities arise naturally in sequential quadratic programming (SQP) subproblems due to the linearization of the constraints. We prove the convergence of FlexQP under mild coercivity assumptions, making it robust to both feasible and infeasible QPs. We then apply deep unfolding to learn LSTM-based, dimension-agnostic feedback policies for the algorithm parameters, yielding an accelerated Deep FlexQP. To preserve the exactness guarantees of the relaxation, we propose a normalized training loss that incorporates the Lagrange multipliers. We additionally design a log-scaled loss for PAC-Bayes generalization bounds that yields substantially tighter performance certificates, which we use to construct an accelerated SQP solver with guaranteed QP subproblem performance. Deep FlexQP outperforms state-of-the-art learned QP solvers on a suite of benchmarks including portfolio optimization, classification, and regression problems, and scales to dense QPs with over 10k variables and constraints via fine-tuning. When deployed within SQP, our approach solves nonlinear trajectory optimization problems 4-16x faster than SQP with OSQP while substantially improving success rates. On predictive safety filter problems, Deep FlexQP reduces safety violations by over 70\% and increases task completion by 43\% compared to existing methods.

中文标题/摘要

标题：Deep FlexQP：基于深度展开的加速非线性规划

我们提出了一种基于$\ell_1$弹性松弛QP约束的凸二次规划（QP）求解器FlexQP。如果原始约束可行，FlexQP能够恢复最优解。如果约束不可行，FlexQP将识别一个最小化约束违反的解，同时保持违反约束的数量稀疏。这种不可行性在序列二次规划（SQP）子问题中由于约束的线性化而自然出现。在温和的强制性假设下，我们证明了FlexQP的收敛性，使其对可行和不可行的QP都具有鲁棒性。然后，我们应用深度展开来学习基于LSTM的、维度无关的反馈策略，以加速算法参数，从而得到加速的Deep FlexQP。为了保持松弛的精确性保证，我们提出了一种归一化的训练损失，该损失包含拉格朗日乘子。我们还设计了一种对数缩放损失，用于PAC-Bayes泛化界，该损失提供了显著更紧的性能证书，我们使用它来构建具有保证QP子问题性能的加速SQP求解器。Deep FlexQP在包括投资组合优化、分类和回归问题的一系列基准测试中优于最先进的学习QP求解器，并通过微调扩展到具有超过10,000个变量和约束的密集QP。当在SQP中部署时，我们的方法比使用OSQP的SQP快4-16倍，同时显著提高了成功率。在预测性安全过滤器问题上，Deep FlexQP将安全违规减少了超过70%，并将任务完成率提高了43%，优于现有方法。

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

Authors: Lizhi Yang, Blake Werner, Massimiliano de Sa, Aaron D. Ames

First: 2025-10-16T17:58:58+00:00 · Latest: 2026-03-05T18:52:25+00:00

Comments: 8 pages

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.

中文标题/摘要

标题：CBF-RL：使用控制障碍函数训练时的安全过滤强化学习

强化学习（RL）虽然强大且表达能力强，但往往倾向于优先考虑性能而牺牲安全性。然而，在实际部署中，安全违规可能导致灾难性后果。控制障碍函数（CBFs）提供了一种原理性的方法来强制执行动态安全性——传统上在线通过安全过滤器部署。虽然结果是安全的行为，但RL策略不了解CBF的事实会导致保守的行为。本文提出了一种CBF-RL框架，通过在训练中强制执行CBFs来生成安全行为。CBF-RL有两个关键属性：（1）最小修改一个名义上的RL策略，通过CBF项编码安全约束，（2）在训练中对策略回放进行安全过滤。理论上，我们证明了连续时间的安全过滤器可以通过离散时间回放的闭式表达式部署。实践中，我们展示了CBF-RL将安全约束内化到学习策略中——不仅强制执行更安全的行为，还偏向于更安全的奖励，从而在无需在线安全过滤器的情况下实现安全部署。我们通过导航任务的消融研究和Unitree G1人形机器人验证了该框架，在这些研究中，CBF-RL使探索更安全，收敛更快，并在不确定性下表现出稳健的性能，使机器人能够在实际环境中避开障碍物并安全地爬楼梯，而无需运行时安全过滤器。

Summary / 总结

CBF-RL is a framework that integrates Control Barrier Functions (CBFs) into reinforcement learning (RL) training to ensure safety without needing an online safety filter. It minimally modifies the RL policy to include CBF constraints and filters policy rollouts during training. Experiments show that CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, allowing the humanoid robot to navigate safely in real-world settings.

CBF-RL 是一个框架，将控制屏障函数（CBFs）集成到强化学习（RL）训练中以确保安全，而无需在线安全过滤器。它对RL策略进行最小修改，加入CBF项，并在训练中过滤策略的回放。理论分析表明，连续时间的安全过滤器可以应用于离散时间的回放。实验结果表明，CBF-RL 使探索更安全、收敛更快，并在不确定性下表现出更强的鲁棒性，使机器人在没有在线安全过滤器的情况下能够避开障碍物并安全地爬楼梯。

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Authors: Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, Zhiqi Li, Karan Sapra, Fuxiao Liu, Yin-Dong Zheng, Yifei Huang, Limin Wang, Zhiding Yu, Andrew Tao, Guilin Liu, Tong Lu

First: 2026-03-05T18:52:12+00:00 · Latest: 2026-03-05T18:52:12+00:00

Abs · PDF · Code1 · Code2

Abstract

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

中文标题/摘要

标题：迈向多模态终身理解：一个数据集和能动基线

尽管视频理解的数据集已经扩展到小时级长度，但它们通常由紧密连接的片段组成，与自然的、未编排的日常生活不同。为弥合这一差距，我们引入了MM-Lifelong数据集，旨在用于多模态终身理解。该数据集包含181.1小时的视频，按日、周、月结构化，以捕捉不同的时间密度。广泛的评估揭示了当前范式中的两个关键失败模式：端到端的MLLMs因上下文饱和而遭受工作记忆瓶颈，而代表性的能动基线在导航稀疏的月度时间线时则经历全局定位崩溃。为解决这一问题，我们提出了递归多模态代理（ReMA），它采用动态内存管理，迭代更新递归信念状态，显著优于现有方法。最后，我们建立了数据集划分，旨在隔离时间偏见和领域偏见，为未来监督学习和离分布泛化的研究提供严格的基石。

Summary / 总结

The research aims to bridge the gap between existing video understanding datasets and natural, unscripted daily life by introducing MM-Lifelong, a dataset with 181.1 hours of footage structured across Day, Week, and Month scales. The study evaluates current methods and finds that end-to-end MLLMs face a Working Memory Bottleneck and representative agentic baselines suffer from Global Localization Collapse. To address these issues, the Recursive Multimodal Agent (ReMA) is proposed, which uses dynamic memory management to iteratively update a recursive belief state, outperforming existing methods. The dataset also includes splits to isolate temporal and domain biases, providing a rigorous foundation for future research.

研究旨在通过引入包含181.1小时 footage 的 MM-Lifelong 数据集，弥合现有视频理解数据集与自然、非脚本化日常生活之间的差距，该数据集按日、周、月时间尺度结构化。研究评估了当前方法，并发现端到端 MLLMs 面临工作记忆瓶颈，而代表性代理基线在导航稀疏的月度时间线时出现全局定位崩溃。为解决这些问题，提出了递归多模态代理 (ReMA)，它使用动态内存管理以迭代更新递归信念状态，优于现有方法。此外，数据集还包括用于隔离时间和领域偏差的划分，为未来的监督学习和分布外泛化研究提供坚实基础。

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Authors: Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen

Venue: ICLR 2026

First: 2026-03-05T18:52:02+00:00 · Latest: 2026-03-05T18:52:02+00:00

Comments: The Fourteenth International Conference on Learning Representations (ICLR 2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .

中文标题/摘要

标题：SurvHTE-Bench：生存分析中异质治疗效果估计的基准

从右删失生存数据中估计异质治疗效果（HTEs）在精准医疗和个性化政策制定等高风险应用中至关重要。然而，生存分析设置中的右删失、未观察到的反事实以及复杂的识别假设为HTE估计带来了独特挑战。尽管最近取得了进展，从因果生存森林到生存元学习者和结果插补方法，但评估实践仍然支离破碎且不一致。我们引入了SurvHTE-Bench，这是首个针对删失结果的HTE估计基准。基准涵盖了(i) 一个模块化的合成数据集套件，具有已知的真实值，系统地变化因果假设和生存动态，(ii) 半合成数据集，将实际协变量与模拟的治疗和结果配对，以及(iii) 来自双胞胎研究（具有已知真实值）和HIV临床试验的真实世界数据集。在合成、半合成和真实世界设置中，我们提供了生存HTE方法在多种条件和现实假设违反情况下的首次严格比较。SurvHTE-Bench 为因果生存方法的公平、可重复和可扩展评估奠定了基础。我们的基准数据和代码可在：https://github.com/Shahriarnz14/SurvHTE-Bench 获取。

Summary / 总结

The paper introduces SurvHTE-Bench, a comprehensive benchmark for evaluating heterogeneous treatment effect estimation in survival analysis. It addresses the unique challenges of censoring and unobserved counterfactuals by providing synthetic, semi-synthetic, and real-world datasets. The benchmark evaluates various methods including causal survival forests, survival meta-learners, and outcome imputation approaches under diverse conditions, offering a rigorous comparison and establishing a foundation for fair and reproducible evaluation.

研究旨在解决在右删失生存数据中估计异质治疗效果的挑战，这对于精准医疗和个性化政策制定至关重要。研究引入了SurvHTE-Bench，这是一个全面的基准，包含合成、半合成和真实世界的数据集，以在各种条件下严格评估生存HTE方法。主要发现表明，在这些不同的设置下，不同的方法表现不同，突显了标准化评估框架的必要性。

FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction

Authors: Jiaxin Yuan, Haizhao Yang, Maria Cameron

First: 2025-10-31T04:49:41+00:00 · Latest: 2026-03-05T18:50:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.

中文标题/摘要

标题：FMint-SDE：一种通过误差校正加速随机微分方程数值模拟的多模态基础模型

快速而准确地模拟动力系统是科学和工程领域的一项基本挑战。传统的数值积分器通常在准确性和计算效率之间存在权衡，而现有的基于神经网络的方法通常需要为每种情况训练一个单独的模型。为克服这些限制，我们提出了一种新的多模态基础模型，用于大规模模拟微分方程：FMint-SDE（基于初始化的基础模型，用于随机微分方程）。基于仅解码器的变压器并利用上下文学习，FMint-SDE 利用数值和文本模态学习一种通用的误差校正方案。它使用由传统求解器生成的粗糙解序列进行提示训练，从而在各种系统中实现广泛的泛化。我们在涵盖分子动力学、机械系统、金融和生物学应用的挑战性 SDE 基准测试集上评估了我们的模型。实验结果表明，与经典求解器相比，我们的方法在准确性和效率之间实现了更优的权衡，突显了 FMint-SDE 作为动力系统通用模拟工具的潜力。

Summary / 总结

The research aims to improve the accuracy and efficiency of simulating dynamical systems by introducing FMint-SDE, a multimodal foundation model. It uses a decoder-only transformer with in-context learning to learn an error-correction scheme from numerical and textual inputs. The model is trained on coarse solutions from conventional solvers and demonstrates a better accuracy-efficiency tradeoff compared to classical solvers across various applications such as molecular dynamics, mechanical systems, finance, and biology.

研究旨在通过引入FMint-SDE多模态基础模型解决动态系统快速而准确的模拟问题。该模型利用解码器仅变压器和上下文学习机制结合数值和文本数据，学习一个通用的误差校正方案。通过从传统求解器生成的粗略解进行训练，FMint-SDE在分子动力学、机械系统、金融和生物学等多个应用领域展现出比经典求解器更好的准确性和效率折衷，使其成为动态系统通用模拟工具的有力候选。

Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields

Authors: Scout Jarman, Zigfried Hampel-Arias, Adra Carr, Kevin R. Moon

First: 2026-03-05T18:44:45+00:00 · Latest: 2026-03-05T18:44:45+00:00

Comments: This manuscript was submitted to SPIE JARS and is under review. Code and Data can be found at https://github.com/lanl/HSI-Nerfstudio and https://zenodo.org/records/18626884 respectively. Video 1 and Video 2 can be found at https://github.com/lanl/HSI-Nerfstudio/blob/main/renders/paper/grid_Falsecolor.mp4 and https://github.com/lanl/HSI-Nerfstudio/blob/main/renders/paper/grid_ACE.mp4 respectively

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.

中文标题/摘要

标题：利用神经辐射场实现长波红外高光谱图像中气体羽流的三维场景理解

高光谱图像（HSI）在环境监测、国家安全等领域有广泛应用，可用于材料检测和识别。长波红外（LWIR）HSI可用于气体羽流的检测和分析。通常情况下，感兴趣的场景只有少量图像可用，并且这些图像会被单独分析。将来自多个图像的信息整合成一个连贯的表示，可以增强分析，提供场景几何和光谱属性的更多上下文信息。神经辐射场（NeRF）创建了体积场景属性的潜在神经表示，能够实现新颖视角渲染和几何重建，为高光谱三维场景重建提供了有前景的方法。我们探索了使用NeRF从LWIR HSI创建三维场景重建的可能性，并证明该模型可用于基本的下游分析任务——气体羽流检测。使用基于物理的DIRSIG软件套件生成了一个简单的设施的合成多视角LWIR HSI数据集，该设施具有强烈的六氟硫磺气羽流。我们的方法基于标准的Mip-NeRF架构，结合了最先进的高光谱NeRF和稀疏视角NeRF方法，以及一种新颖的自适应加权均方误差损失。最终的NeRF方法所需的训练图像数量比标准Mip-NeRF少约50%，在最少30张训练图像的情况下，平均PSNR达到39.8 dB。将自适应相干估计器应用于NeRF渲染的测试图像进行气体羽流检测，与从真实测试图像生成的检测掩模相比，平均AUC为0.821。

Summary / 总结

The research aims to enhance the analysis of gas plumes in longwave infrared hyperspectral images by creating 3D scene reconstructions using neural radiance fields (NeRFs). The method combines state-of-the-art techniques for hyperspectral NeRFs and sparse-view NeRFs, along with an adaptive weighted mean squared error loss. The model requires fewer training images and achieves an average PSNR of 39.8 dB with as few as 30 images, demonstrating its effectiveness in gas plume detection with an average AUC of 0.821.

研究旨在通过使用神经辐射场（NeRF）创建3D场景重建，以增强长波红外高光谱图像中气溶胶检测的分析。该方法结合了最先进的高光谱NeRF技术和稀疏视图NeRF技术，以及一种自适应加权均方误差损失。该模型只需要较少的训练图像，即使使用30张图像也能达到平均PSNR 39.8 dB，证明了其在3D重建中的气溶胶检测效果。

Kraus Constrained Sequence Learning For Quantum Trajectories from Continuous Measurement

Authors: Priyanshi Singh, Krishna Bhatia

Venue: ICLR 2026 Poster

First: 2026-03-05T18:37:05+00:00 · Latest: 2026-03-05T18:37:05+00:00

Comments: Poster at AI&PDE: ICLR 2026 Workshop on AI and Partial Differential Equations. 17 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Real-time reconstruction of conditional quantum states from continuous measurement records is a fundamental requirement for quantum feedback control, yet standard stochastic master equation (SME) solvers require exact model specification, known system parameters, and are sensitive to parameter mismatch. While neural sequence models can fit these stochastic dynamics, the unconstrained predictors can violate physicality such as positivity or trace constraints, leading to unstable rollouts and unphysical estimates. We propose a Kraus-structured output layer that converts the hidden representation of a generic sequence backbone into a completely positive trace preserving (CPTP) quantum operation, yielding physically valid state updates by construction. We instantiate this layer across diverse backbones, RNN, GRU, LSTM, TCN, ESN and Mamba; including Neural ODE as a comparative baseline, on stochastic trajectories characterized by parameter drift. Our evaluation reveals distinct trade-offs between gating mechanisms, linear recurrence, and global attention. Across all models, Kraus-LSTM achieves the strongest results, improving state estimation quality by 7% over its unconstrained counterpart while guaranteeing physically valid predictions in non-stationary regimes.

中文标题/摘要

标题：Kraus 约束序列学习在连续测量量子轨迹中的应用

从连续测量记录实时重构条件量子态是量子反馈控制的基本要求，但标准随机主方程（SME）求解器需要精确的模型规格、已知的系统参数，并且对参数不匹配敏感。虽然神经序列模型可以拟合这些随机动力学，但未约束的预测器可能会违反物理性，如正性或迹约束，导致不稳定的滚动和不物理的估计。我们提出了一种 Kraus 结构的输出层，将通用序列主干的隐藏表示转换为完全正且迹保持（CPTP）的量子操作，从而通过构造生成物理上有效的状态更新。我们在 RNN、GRU、LSTM、TCN、ESN 和 Mamba 等不同的主干上实例化了这一层，包括将神经 ODE 作为基准进行比较，应用于参数漂移的随机轨迹。我们的评估揭示了门控机制、线性递归和全局注意力之间的不同权衡。在所有模型中，Kraus-LSTM 达到了最佳结果，相比其未约束的版本提高了 7% 的状态估计质量，同时在非稳态区域保证了物理上有效的预测。

Summary / 总结

This study addresses the challenge of reconstructing quantum states from continuous measurement data, which is crucial for quantum feedback control. It proposes a Kraus-structured output layer that ensures physically valid state updates by construction, addressing the limitations of unconstrained neural sequence models. The method was evaluated across various sequence models, and Kraus-LSTM was found to outperform its unconstrained counterpart by 7% in state estimation quality, while maintaining physical validity in non-stationary regimes.

研究旨在改进从连续测量数据实时重构量子态的方法，这对于量子反馈控制至关重要。方法引入了一个Kraus结构的输出层，确保了物理上有效的状态更新，解决了非约束神经序列模型的限制。关键实验发现表明，Kraus-LSTM模型在状态估计质量上比其非约束版本高出7%，并且在非稳态条件下保持了物理有效性。

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou

Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

中文标题/摘要

标题：HALP：无需生成单个词元即可检测视觉语言模型中的幻觉

幻觉仍然是视觉语言模型（VLMs）的一个持续性挑战，它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作，使得干预既昂贵又不及时。我们研究了是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险，而只需进行一次前向传递。在一系列视觉语言任务和八个现代VLMs（包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL）中，我们考察了三种内部表示家族：（i）仅视觉特征而不进行多模态融合，（ii）文本解码器中的视觉词元表示，以及（iii）在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能，达到Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上0.93的AUROC。大多数模型中，后期查询词元状态最具预测性，而在少数架构中，视觉或中间层特征占主导地位（例如，Qwen2.5-VL-7B使用仅视觉特征的AUROC约为0.79）。这些结果表明：（1）幻觉风险可以在生成之前检测到；（2）最具信息量的层和模态在不同架构中有所不同；（3）轻量级探测器有可能实现早期回避、选择性路由和自适应解码，以提高安全性和效率。

Summary / 总结

The research aims to detect hallucinations in vision-language models before any text generation, using internal model representations. Three types of internal representations are explored: visual-only features, vision-token representations, and query-token representations. The probes trained on these representations achieve strong hallucination-detection performance, with late query-token states being the most predictive for most models. The study shows that hallucination risk is detectable pre-generation, the most informative layer and modality vary across architectures, and lightweight probes can enable early intervention to improve safety and efficiency.

研究通过在生成任何词之前探测模型的内部表示来预测幻觉风险，以应对视觉语言模型中的幻觉问题。使用内部表示上的探针来检测幻觉，多个模型上达到了高达0.93 AUROC的性能。结果显示，大多数模型中晚期查询词表示是最具预测性的，而视觉或中间层特征对某些架构更具信息性，表明幻觉风险可以在生成前被检测到，并且轻量级探针可以实现早期干预以提高安全性和效率。

EdgeDAM: Real-time Object Tracking for Mobile Devices

Authors: Syed Muhammad Raza, Syed Murtaza Hussain Abidi, Khawar Islam, Muhammad Ibrahim, Ajmal Saeed Mian

First: 2026-03-05T18:35:25+00:00 · Latest: 2026-03-05T18:35:25+00:00

Comments: 10 pages

Abs · PDF · Code1 · Code2

Abstract

Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.

中文标题/摘要

标题：EdgeDAM：移动设备上的实时目标跟踪

边缘设备上的单目标跟踪（SOT）是计算机视觉中的关键任务，需要在遮挡、干扰物干扰和快速运动的情况下，准确且连续地在视频帧中定位目标。然而，最近的基于分割的跟踪器中的干扰物感知记忆机制大多依赖于掩码预测和注意力驱动的记忆更新，这引入了大量计算开销，限制了在资源受限硬件上的实时部署；同时，轻量级跟踪器虽然保持了高吞吐量，但在出现视觉相似的干扰物时容易漂移。为了解决这些挑战，我们提出了一种轻量级的检测引导跟踪框架EdgeDAM，该框架在严格边缘约束下重新定义了干扰物感知记忆，用于边界框跟踪。EdgeDAM引入了两种关键策略：（1）双缓冲干扰物感知记忆（DAM），它结合了近期感知记忆以保留时间一致的目标假设，并结合了干扰物解决记忆以明确存储困难的负样本候选，并在恢复期间惩罚其重新选择；（2）置信驱动切换与持有框稳定化，其中跟踪器的可靠性和时间一致性标准在遮挡期间适应性地激活检测和记忆引导的再识别，而持有框机制暂时冻结并扩展估计以抑制干扰物污染。在包括干扰物重点的DiDi数据集在内的五个基准上的广泛实验表明，EdgeDAM在遮挡和快速运动下具有更好的鲁棒性，同时在移动设备上保持实时性能，DiDi数据集上的准确率为88.2%，iPhone 15上的帧率为25 FPS。代码将被发布。

Summary / 总结

EdgeDAM is a lightweight detection-guided tracking framework designed for real-time object tracking on edge devices. It addresses the challenges of occlusion, distractors, and fast motion by introducing Dual-Buffer Distractor-Aware Memory and Confidence-Driven Switching with Held-Box Stabilization. Experiments on five benchmarks, including the DiDi dataset, show that EdgeDAM improves robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15.

EdgeDAM 是一种轻量级的检测引导跟踪框架，旨在实现在边缘设备上的实时目标跟踪。通过引入双缓冲干扰感知记忆和置信驱动切换与持有框稳定机制，该框架在遮挡、干扰和快速运动下提高了鲁棒性，同时保持实时性能，实现了在 DiDi 数据集上的 88.2% 准确率和 iPhone 15 上的 25 FPS。

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

Venue: ICLR 2026

First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

中文标题/摘要

标题：超越零散接受：通过最长稳定前缀实现DLMs的快速和连贯推理

扩散语言模型（DLMs）承诺实现高度并行的文本生成，但在实际推理速度上往往受限于次优的解码调度器。标准方法依赖于“零散接受”——在序列中不连续的位置上提交高置信度的标记。这种方法无意中破坏了键值（KV）缓存，破坏了内存局部性，并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题，我们提出了最长稳定前缀（LSP）调度器，这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每个去噪步骤中，LSP 通过单向传递评估标记的稳定性，动态识别一个连续的左对齐的稳定预测块，并在原子提交前将其边界对齐到自然语言或结构分隔符。这种前缀优先的拓扑结构带来了双重好处：系统上，它将碎片化的KV缓存更新转换为高效的连续追加；算法上，它保留了对几何缩小的活动后缀的双向前瞻，大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明，LSP 在包括数学推理、代码生成、多语言（CJK）任务和创造性写作在内的严格基准测试中将推理加速了高达3.4倍，同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑，LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。

Summary / 总结

The paper addresses the issue of slow inference in Diffusion Language Models (DLMs) due to suboptimal decoding schedulers that commit tokens at disjoint positions, leading to inefficient KV cache updates and high token flip rates. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability and commits a contiguous block of stable predictions, thereby improving memory locality and reducing the need for repeated repairs. Experiments on LLaDA-8B and Dream-7B show that LSP can accelerate inference by up to 3.4x across various benchmarks while maintaining or slightly improving output quality.

论文针对扩散语言模型(DLMs)由于解码调度器将高置信度的标记分散在不连续位置而导致的推理速度慢问题，提出了最长稳定前缀(LSP)调度器，该调度器通过评估标记的稳定性并一次性提交连续的稳定预测，从而保持内存局部性并减少标记翻转。在LLaDA-8B和Dream-7B上的实验表明，LSP可以将推理加速至最多3.4倍，同时保持或略微提高输出质量。

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Authors: Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

First: 2026-03-05T18:22:55+00:00 · Latest: 2026-03-05T18:22:55+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

中文标题/摘要

标题：分布式部分信息谜题：在知识不对称下的共同知识构建研究

建立共同知识，即共享的一组信念和相互认可的事实，是协作的基础，但在当前的AI系统中仍然是一个挑战，尤其是在多模态、多参与者的场景中，参与者带来不同的信息。我们引入了分布式部分信息谜题（DPIP），这是一种在知识不对称下的协作构建任务，能够引发丰富的多模态交流。我们提供了一个多模态的交互数据集，这些数据在语音、手势和动作模态上进行了注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同知识（CG）的范式：（1）最先进的大型语言模型（LLMs），被提示从多模态更新中推断共享的信念，以及（2）一个基于动态知识逻辑（DEL）的公理化管道，逐步执行相同的任务。对注释的DPIP数据的评估结果表明，这给现代LLMs追踪任务进展和信念状态的能力带来了挑战。

RealWonder: Real-Time Physical Action-Conditioned Video Generation

Authors: Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu

First: 2026-03-05T18:22:54+00:00 · Latest: 2026-03-05T18:22:54+00:00

Comments: The first two authors contributed equally. The last two authors advised equally. Project website: https://liuwei283.github.io/RealWonder/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

中文标题/摘要

标题：RealWonder：实时物理动作条件化视频生成

当前的视频生成模型无法模拟3D动作如力和机器人操作的物理后果，因为它们缺乏对动作如何影响3D场景的理解。我们提出了RealWonder，这是首个从单张图像生成动作条件化视频的实时系统。我们的核心见解是使用物理模拟作为中介桥梁：我们不是直接编码连续动作，而是通过物理模拟将它们转换为视频模型可以处理的视觉表示（光流和RGB）。RealWonder集成了三个组件：单张图像的3D重建、物理模拟和仅需4个扩散步骤的精简视频生成器。我们的系统在480x832分辨率下达到13.2 FPS，能够实现实时探索力、机器人动作和摄像机控制，应用于刚体、可变形体、流体和颗粒材料。我们设想RealWonder将为视频模型在沉浸式体验、AR/VR和机器人学习中的应用开辟新的机会。我们的代码和模型权重在项目网站上公开：https://liuwei283.github.io/RealWonder/

Summary / 总结

RealWonder is a real-time system for generating action-conditioned videos from a single image. It uses physics simulation as an intermediate step to translate continuous actions into visual representations that video models can process. The system integrates 3D reconstruction, physics simulation, and a distilled video generator requiring only 4 diffusion steps, achieving 13.2 FPS at 480x832 resolution, enabling interactive exploration of various physical actions and materials. This opens new opportunities for immersive experiences and robot learning.

RealWonder 是一个从单张图像生成条件动作视频的实时系统。它通过物理模拟将 3D 动作转化为视频模型可处理的视觉表示。该系统整合了 3D 重建、物理模拟和仅需 4 步扩散的精简视频生成器。RealWonder 达到 13.2 FPS 的帧率，在 480x832 分辨率下，能够交互探索各种材料上的物理动作。这为沉浸式体验、AR/VR 和机器人学习开辟了新机会。

Residual RL--MPC for Robust Microrobotic Cell Pushing Under Time-Varying Flow

Authors: Yanda Yang, Sambeeta Das

First: 2026-03-05T18:20:07+00:00 · Latest: 2026-03-05T18:20:07+00:00

Comments: 8 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Contact-rich micromanipulation in microfluidic flow is challenging because small disturbances can break pushing contact and induce large lateral drift. We study planar cell pushing with a magnetic rolling microrobot that tracks a waypoint-sampled reference curve under time-varying Poiseuille flow. We propose a hybrid controller that augments a nominal MPC with a learned residual policy trained by SAC. The policy outputs a bounded 2D velocity correction that is contact-gated, so residual actions are applied only during robot--cell contact, preserving reliable approach behavior and stabilizing learning. All methods share the same actuation interface and speed envelope for fair comparisons. Experiments show improved robustness and tracking accuracy over pure MPC and PID under nonstationary flow, with generalization from a clover training curve to unseen circle and square trajectories. A residual-bound sweep identifies an intermediate correction limit as the best trade-off, which we use in all benchmarks.

中文标题/摘要

标题：残差RL-MPC在时变流场中鲁棒微机器人细胞推送

在微流体流中进行接触丰富的微操作具有挑战性，因为小的干扰可能会破坏推力接触并引起大的横向漂移。我们研究了在时变泊肃叶流中，磁滚动微机器人跟踪采样参考曲线的平面细胞推送。我们提出了一种混合控制器，该控制器在名义MPC的基础上增加了通过SAC训练得到的残差策略。该策略输出一个受接触控制的2D速度修正，因此仅在机器人-细胞接触时应用残差动作，从而保持可靠的接近行为并稳定学习。所有方法共享相同的执行接口和速度范围，以进行公平比较。实验表明，在非稳态流中，与纯MPC和PID相比，该方法具有更好的鲁棒性和跟踪精度，并且可以从训练的四叶曲线推广到未见过的圆形和方形轨迹。残差边界扫面确定了一个中间修正限制作为最佳权衡，并在所有基准测试中使用。

Summary / 总结

The research addresses the challenge of robust microrobotic cell pushing in time-varying microfluidic flow, where small disturbances can disrupt contact and cause large lateral drift. A hybrid controller combining model predictive control (MPC) with a learned residual policy trained by soft actor-critic (SAC) is proposed. The residual policy provides a bounded 2D velocity correction that is applied only during contact, ensuring reliable approach behavior and stabilizing learning. Experiments demonstrate improved robustness and tracking accuracy compared to pure MPC and PID under nonstationary flow, with generalization from a training clover curve to unseen circle and square trajectories. An intermediate residual bound is found to be the best trade-off for performance.

研究解决了微流体流动中微操纵的挑战，其中小扰动会破坏细胞推送。提出了一种结合模型预测控制（MPC）和由软动作评论家（SAC）训练的残差策略的混合控制器，以提高跟踪准确性和鲁棒性。实验表明，在非稳态流动下，该方法优于纯MPC和PID，并成功泛化到未见过的轨迹。

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

First: 2026-01-26T17:56:50+00:00 · Latest: 2026-03-05T18:19:57+00:00

Comments: code is release here: https://github.com/siyan-zhao/OPSD

Abs · PDF · Code1 · Code2 · Code3

Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

中文标题/摘要

标题：自我蒸馏推理器：面向大型语言模型的在线自我蒸馏

知识蒸馏通过压缩教师大型语言模型的知识来训练较小的大型语言模型，从而改善大型语言模型的推理能力。在线蒸馏通过让学生在教师大型语言模型提供密集的标记级监督的同时，自行采样其自身的轨迹，来推进这一方法，从而解决了脱机蒸馏方法中训练与推理之间的分布不匹配问题。然而，在线蒸馏通常需要一个单独的、通常更大的教师大型语言模型，并且不明确利用推理数据集中可用的真实解决方案。受足够强大的大型语言模型能够理性化外部特权推理轨迹并教导其较弱的自我（即，不访问特权信息的版本）这一直觉的启发，我们引入了在线自我蒸馏（OPSD）框架，其中单个模型同时充当教师和学生，通过不同的上下文进行条件化。教师策略通过特权信息（例如，验证的推理轨迹）进行条件化，而学生策略仅看到问题；训练通过最小化学生自身模拟过程中这些分布之间的每个标记差异来进行。我们通过多个数学推理基准展示了该方法的有效性，与GRPO等强化学习方法相比，实现了8-12倍的标记效率，并且在脱机蒸馏方法上表现更优。

Summary / 总结

The research aims to improve large language model reasoning through on-policy self-distillation, where a single model acts as both teacher and student. The method conditions the teacher on privileged information and the student on the question, minimizing token divergence during training. Experiments on mathematical reasoning benchmarks show a 8-12x improvement in token efficiency over reinforcement learning methods and outperform off-policy distillation methods.

研究旨在通过单模型作为教师和学生的自蒸馏方法提高大型语言模型的推理能力。该方法让教师基于特权信息进行条件化，而学生仅看到问题，在训练中通过最小化学生自我采样轨迹的token差异来实现。实验表明，OPSD相比强化学习方法在数学推理基准测试中实现了8-12倍的token效率，并且优于脱政策略蒸馏方法。

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Authors: Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura

Venue: CVPR 2026

First: 2026-03-05T18:12:29+00:00 · Latest: 2026-03-05T18:12:29+00:00

Comments: Accepted to CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

中文标题/摘要

标题：NaiLIA：基于密集意图描述和调色板查询的多模态指甲设计检索

我们专注于基于密集意图描述检索指甲设计图像的任务，这些描述代表了用户对指甲设计的多层意图。这具有挑战性，因为这些描述指定了未受约束的绘画元素和预先制造的装饰品，以及视觉特征、主题和整体印象。除了这些描述之外，我们假设用户通过颜色拾取器指定零个或多个颜色来提供调色板查询，这使得微妙和连续的颜色细微差别得以表达。现有的视觉-语言基础模型往往难以结合这些描述和调色板。为了解决这个问题，我们提出了NaiLIA，一种针对指甲设计图像的多模态检索方法，在检索过程中全面对齐密集意图描述和调色板查询。我们的方法引入了一种基于未标注图像置信分数的宽松损失，可以与描述对齐。为了评估NaiLIA，我们构建了一个基准，包含来自不同文化背景的10,625张图像。这些图像由超过200名注释者标注了长且密集的意图描述。实验结果表明，NaiLIA优于标准方法。

Summary / 总结

The research aims to develop a method for retrieving nail design images based on detailed user intent descriptions and color palette queries. The proposed NaiLIA method uses a relaxed loss based on confidence scores to align with dense descriptions and palette queries. Experiments show that NaiLIA outperforms standard methods in this task.

研究旨在开发一种基于详细用户意图描述和颜色调色板查询的指甲设计图像检索方法。提出的NaiLIA方法使用基于置信分数的松弛损失来与密集描述和调色板查询对齐。实验表明，NaiLIA在该任务中优于标准方法。

Agentic Very Long Video Understanding

Authors: Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim

First: 2026-01-26T05:20:47+00:00 · Latest: 2026-03-05T18:12:22+00:00

Comments: 27 pages, 7 figures, 8 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks. Code is available at https://github.com/facebookresearch/egagent.

中文标题/摘要

标题：代理型非常长视频理解

随着全天候可穿戴设备如智能眼镜所支持的始终在线个人AI助手的出现，需要一种新的上下文理解水平，这种理解超越了短暂孤立的事件，涵盖了持续的、纵向的自我中心视频流。实现这一愿景需要在长时视频理解方面取得进展，其中系统必须解释和回忆跨越数天甚至数周的视觉和音频信息。现有的方法，包括大型语言模型和检索增强生成，受限于有限的上下文窗口，并且缺乏在非常长的视频流上进行组合式、多跳推理的能力。在这项工作中，我们通过EGAgent解决这些挑战，这是一种增强的代理框架，以实体场景图为中心，这些图表示随着时间推移的人、地点、物体及其关系。我们的系统为规划代理配备了在这些图上进行结构化搜索和推理的工具，以及混合视觉和音频搜索能力，使跨模态和时间上一致的推理成为可能。在EgoLifeQA和Video-MME（长）数据集上的实验表明，我们的方法在EgoLifeQA上达到了最先进的性能（57.5%），在Video-MME（长）上达到了竞争性性能（74.1%），用于复杂的纵向视频理解任务。代码可在https://github.com/facebookresearch/egagent/获取。

Summary / 总结

This work addresses the need for long-horizon video understanding in the context of always-on personal AI assistants, proposing EGAgent, an enhanced agentic framework that uses entity scene graphs and hybrid visual and audio search capabilities. The system enables detailed, cross-modal, and temporally coherent reasoning over extended video streams. Experiments show that EGAgent outperforms existing methods on EgoLifeQA with 57.5% accuracy and achieves competitive performance on Video-MME (Long) with 74.1% accuracy.

本文针对始终在线的个人AI助手对长时段视频理解的需求，提出了EGAgent，一种增强的代理框架，使用实体场景图来表示和推理时间跨度内的人物、地点和物体。该系统包括结构化搜索和推理的工具，以及混合视觉和音频搜索能力。实验结果表明，EGAgent在EgoLifeQA和Video-MME（长）数据集上优于现有方法，分别在EgoLifeQA上达到最先进的性能，在Video-MME（长）上达到竞争性性能，用于复杂的长时段视频理解任务。

Latent Wasserstein Adversarial Imitation Learning

Authors: Siqi Yang, Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

Venue: ICLR 2026

First: 2026-03-05T18:01:49+00:00 · Latest: 2026-03-05T18:01:49+00:00

Comments: 10 pages, accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.

中文标题/摘要

标题：潜隐 Wasserstein 对抗模仿学习

模仿学习（IL）使代理能够通过学习演示来模仿专家行为。然而，传统的IL方法需要大量中等到高质量的演示以及专家演示的动作，这两种情况通常都不易获得。为了减少这种需求，我们提出了潜隐 Wasserstein 对抗模仿学习（LWAIL），这是一种新颖的对抗模仿学习框架，专注于状态分布匹配。它得益于在动态感知潜隐空间中计算的 Wasserstein 距离。这种动态感知潜隐空间不同于先前的工作，并通过预训练阶段获得，其中我们使用少量随机生成的状态数据训练意图条件价值函数（ICVF），以捕捉状态空间的动态感知结构。我们表明，这增强了策略对状态转换的理解，使学习过程能够仅使用一个或几个状态演示来达到专家水平的表现。通过在多个 MuJoCo 环境中的实验，我们证明了我们的方法优于先前的基于 Wasserstein 的 IL 方法和先前的对抗 IL 方法，在各种任务上取得了更好的结果。

Summary / 总结

The research aims to address the challenge of traditional Imitation Learning (IL) requiring large amounts of expert demonstrations and actions. The proposed Latent Wasserstein Adversarial Imitation Learning (LWAIL) framework focuses on state-only distribution matching in a dynamics-aware latent space, which is pre-trained using a small set of randomly generated state-only data. Experiments on MuJoCo environments show that LWAIL outperforms previous Wasserstein-based and adversarial IL methods, achieving expert-level performance with fewer expert demonstrations.

论文提出了Latent Wasserstein Adversarial Imitation Learning (LWAIL)，以解决传统模仿学习需要大量专家演示数据的问题。LWAIL 在一个动态感知的潜在空间中进行状态分布匹配，并通过少量随机生成的状态数据进行预训练。这种方法使得策略可以从少量专家演示中学习，实现多个MuJoCo环境中的专家级表现。实验表明，LWAIL 在性能上优于基于Wasserstein的距离方法和对抗模仿学习方法。

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim

First: 2025-05-28T03:45:42+00:00 · Latest: 2026-03-05T18:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.

中文标题/摘要

标题：OSPO：面向对象的自我改进偏好优化以实现文本到图像生成

近期多模态大型语言模型（MLLMs）的发展使统一的多模态理解和生成成为可能。然而，它们仍然难以实现精细的文本-图像对齐，经常无法准确描绘具有正确属性（如颜色、形状和空间关系）的对象。为解决这一问题，先前的研究探索了偏好优化方法，如DPO和GRPO，但这些方法在构建偏好数据和执行优化方面会带来巨大的计算成本。这促使了自我改进的偏好优化方法的发展，在这些方法中，MLLM自主生成自己的训练数据，自我估计偏好反馈，并使用由此产生的自我构建的偏好对进行自我优化。然而，现有的自我改进方法仍然忽略了精细的、对象级别的语义，允许对象幻觉持续存在。为解决这一问题，我们提出了面向对象的自我改进偏好优化（OSPO），这是一种旨在增强对象级别文本-图像对齐的自我改进框架。OSPO明确构建了面向对象的偏好数据，不依赖于任何外部数据和外部模型。我们还引入了一种新的方法，利用基于注意力的对象掩码与对象加权SimPO损失相结合，以增强对象特定的保真度。在三个组合图像生成基准上的广泛实验表明，OSPO显著提高了精细对齐并减少了对象幻觉，优于先前的自我改进方法，甚至优于专门的基于扩散的文本到图像模型。

Summary / 总结

The research aims to improve fine-grained text-image alignment by addressing the issue of object hallucination in text-to-image generation. OSPO, a self-improving preference optimization framework, is proposed to explicitly construct object-centric preference data and uses attention-based object masks with an object-weighted SimPO loss to enhance object-specific fidelity. Experiments show that OSPO outperforms previous self-improving methods and even specialized diffusion-based models in terms of fine-grained alignment and reduction of object hallucination.

研究通过提出OSPO，一种自我改进的偏好优化框架，解决了多模态生成中的细粒度图文对齐问题。OSPO自主生成训练数据并优化偏好，不依赖外部数据或模型，专注于对象级语义。实验表明，OSPO在细粒度对齐和减少对象幻觉方面优于之前的自我改进方法和专门的文本到图像模型。

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Authors: Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak

Venue: CVPR 2026

First: 2026-03-05T18:00:02+00:00 · Latest: 2026-03-05T18:00:02+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

中文标题/摘要

标题：在8个标记中规划：一种紧凑的离散标记器用于潜在世界模型

世界模型提供了一种强大的框架，用于根据动作或指令模拟环境动力学，从而支持诸如动作规划或策略学习之类的下游任务。最近的方法利用世界模型作为学习模拟器，但将其应用于决策时的规划仍然因实时控制而计算上不可行。一个关键瓶颈在于潜在表示：传统的标记器将每个观察值编码为数百个标记，这使得规划既缓慢又资源密集。为了解决这个问题，我们提出了一种名为CompACT的离散标记器，它可以将每个观察值压缩为最多8个标记，大幅降低计算成本，同时保留规划所需的关键信息。基于CompACT标记器的动作条件世界模型实现了与传统方法相当的规划性能，但速度提高了几个数量级，为世界模型在现实世界中的部署提供了一步实际的进展。

Summary / 总结

The paper addresses the computational challenges in using world models for real-time decision-making by proposing CompACT, a compact discrete tokenizer that reduces each observation to 8 tokens, significantly speeding up planning while maintaining necessary information. This approach enables competitive planning performance with much lower computational costs, paving the way for practical real-world applications.

研究旨在通过解决传统分词器的计算瓶颈，提高使用世界模型进行决策时规划的效率。CompACT是一种离散分词器，将每个观察压缩为8个分词，大幅降低计算成本同时保留足够的信息用于规划。这种方法使得规划性能与计算速度大幅提升，为世界模型的实际应用铺平了道路。

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Authors: Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim

Venue: CVPR 2026

First: 2026-03-05T17:59:58+00:00 · Latest: 2026-03-05T17:59:58+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

中文标题/摘要

标题：SAIL：基于相似性感知的指导与跨图元增强学习的弱监督密集视频描述

弱监督密集视频描述旨在仅基于字幕注解训练时，定位并描述视频中的事件，而无需时间边界。先前的工作引入了一种基于高斯掩码和互补字幕的隐式监督范式。然而，现有方法仅关注生成不重叠的掩码，而未考虑其与相应事件的语义关系，导致生成简单且均匀分布的掩码，无法捕捉到语义上有意义的区域。此外，仅依赖真实字幕会导致性能不佳，因为现有数据集的稀疏性。在本工作中，我们提出了SAIL，通过跨模态对齐构建语义感知的掩码。我们的相似性感知训练目标引导掩码强调与相应事件字幕高度相似的视频区域。此外，为了在稀疏注解设置下引导更准确的掩码生成，我们引入了一种基于LLM的增强策略，生成合成字幕以提供额外的对齐信号。这些合成字幕通过跨掩码机制整合，为精确的时间定位提供辅助指导，而不损害主要目标。在ActivityNet Captions和YouCook2上的实验表明，SAIL在描述和定位指标上均达到了最先进的性能。

Summary / 总结

The research aims to improve weakly-supervised dense video captioning by addressing the limitations of existing methods that generate simplistic, uniformly distributed masks without considering semantic relationships. SAIL proposes a similarity-aware training objective to emphasize semantically relevant video regions and introduces an LLM-based augmentation strategy to generate synthetic captions, which are used to provide additional alignment signals through an inter-mask mechanism. Experiments show SAIL outperforms existing methods on both captioning and localization metrics on ActivityNet Captions and YouCook2 datasets.

论文通过提出SAIL方法解决了弱监督密集视频字幕生成的挑战，该方法通过跨模态对齐构建语义感知的掩码。SAIL使用相似性感知的训练目标来强调与事件字幕语义相似的视频区域。此外，它引入了基于LLM的增强策略来生成合成字幕，这些合成字幕通过跨掩码机制提供辅助指导，以实现精确的时间定位。实验结果表明，SAIL在ActivityNet Captions和YouCook2数据集上的字幕生成和定位指标上优于现有方法。

RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

First: 2026-02-04T13:25:47+00:00 · Latest: 2026-03-05T17:54:01+00:00

Abs · PDF · Code1 · Code2

Abstract

As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.

中文标题/摘要

标题：RA-QA：在现实世界异质性条件下呼吸音频问答基准系统

随着对话型多模态AI工具被越来越多地用于处理患者数据以进行健康评估，需要稳健的基准来衡量在现实条件下的进步并揭示失败模式。尽管呼吸音频对于移动健康筛查至关重要，但呼吸音频问答仍被研究不足，现有研究评估狭窄且缺乏跨模态、设备和问题类型的现实世界异质性。因此，我们引入了呼吸音频问答（RA-QA）基准，包括标准化的数据生成管道、全面的多模态问答集合以及统一的评估协议。RA-QA将公共呼吸音频数据集整合为包含900万对格式多样的问答对，涵盖诊断和上下文属性。我们基准测试了经典机器学习基线和多模态音频-语言模型，建立了可重复的参考点，并展示了当前方法在异质性条件下失效的情况。

Summary / 总结

The RA-QA benchmark was created to evaluate the performance of respiratory audio question answering systems under real-world conditions. It includes a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. The benchmark covers 9 million format-diverse QA pairs and demonstrates that current approaches struggle with heterogeneity.

研究旨在开发一个基准，以评估在真实世界条件下呼吸音频问答系统的性能。方法包括创建一个标准化的数据生成管道和一个全面的多模态问答集合，其中包括900万对QA。主要发现表明，当前的方法在处理异质性方面存在困难，强调了需要改进模型以应对真实世界的变异性。

RelaxFlow: Text-Driven Amodal 3D Generation

Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00

Comments: Code: https://github.com/viridityzhu/RelaxFlow

Abs · PDF · Code1 · Code2 · Code3

Abstract

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

中文标题/摘要

标题：RelaxFlow：文本驱动的非可见3D生成

从图像到3D的生成面临着在遮挡下固有的语义模糊性，仅凭部分观察往往不足以确定物体类别。在本文中，我们形式化了文本驱动的非可见3D生成，其中文本提示引导未见区域的完成，同时严格保留输入观察。关键的是，我们发现这些目标需要不同的控制粒度：对观察进行刚性控制，而对提示进行放松的结构控制。为此，我们提出了RelaxFlow，这是一种无需训练的双分支框架，通过多先验一致性模块和放松机制解耦控制粒度。理论上，我们证明了我们的放松等同于在生成向量场中应用低通滤波器，这抑制了高频实例细节，以隔离几何结构，使其适应观察。为了便于评估，我们引入了两个诊断基准，ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明，RelaxFlow成功地引导了未见区域的生成，以匹配提示意图，而不牺牲视觉保真度。

Summary / 总结

The research addresses the challenge of generating 3D models from partial observations, using text prompts to guide the unseen parts while preserving the observed parts. It introduces RelaxFlow, a training-free framework that uses a Multi-Prior Consensus Module and a Relaxation Mechanism to achieve this. Theoretical analysis shows that the relaxation process filters out high-frequency details, focusing on geometric structure. Experiments show that RelaxFlow can generate unseen regions that align with the text prompt without losing visual quality.

研究解决了从部分观察生成3D模型的挑战，通过文本提示引导未观察区域的完成，同时保留已观察的部分。提出的RelaxFlow框架采用双分支方法，结合多先验一致性模块和放松机制实现这一目标。理论分析表明，放松过程过滤掉高频细节，专注于几何结构。实验表明，RelaxFlow可以生成与文本提示相匹配的未观察区域，同时保持视觉质量。

An interpretable prototype parts-based neural network for medical tabular data

Authors: Jacek Karolczak, Jerzy Stefanowski

First: 2026-03-05T17:43:32+00:00 · Latest: 2026-03-05T17:43:32+00:00

Comments: Proc. of EXPLIMED at ECAI 2025

Abs · PDF · Code1 · Code2

Abstract

The ability to interpret machine learning model decisions is critical in such domains as healthcare, where trust in model predictions is as important as their accuracy. Inspired by the development of prototype parts-based deep neural networks in computer vision, we propose a new model for tabular data, specifically tailored to medical records, that requires discretization of diagnostic result norms. Unlike the original vision models that rely on the spatial structure, our method employs trainable patching over features describing a patient, to learn meaningful prototypical parts from structured data. These parts are represented as binary or discretized feature subsets. This allows the model to express prototypes in human-readable terms, enabling alignment with clinical language and case-based reasoning. Our proposed neural network is inherently interpretable and offers interpretable concept-based predictions by comparing the patient's description to learned prototypes in the latent space of the network. In experiments, we demonstrate that the model achieves classification performance competitive to widely used baseline models on medical benchmark datasets, while also offering transparency, bridging the gap between predictive performance and interpretability in clinical decision support.

中文标题/摘要

标题：一种用于医疗表格数据的可解释原型部分神经网络

在医疗等领域，能够解释机器学习模型决策的能力至关重要，因为模型预测的信任度与其准确性一样重要。受计算机视觉中原型部分基于深度神经网络发展的启发，我们提出了一种针对医疗记录的新型模型，特别适用于表格数据，需要对诊断结果标准进行离散化。与原始的视觉模型依赖于空间结构不同，我们的方法通过在描述患者的特征上进行可训练的分割，来从结构化数据中学习有意义的原型部分。这些部分表示为二元或离散特征子集。这使得模型能够用人类可读的术语表达原型，从而与临床语言和案例推理对齐。我们提出的神经网络本质上是可解释的，并通过将患者的描述与网络潜在空间中学习到的原型进行比较，提供可解释的概念预测。在实验中，我们证明该模型在医疗基准数据集上的分类性能与广泛使用的基线模型相当，同时提供了透明度，弥合了预测性能与临床决策支持中的可解释性之间的差距。

Summary / 总结

The paper proposes an interpretable prototype parts-based neural network for medical tabular data, inspired by computer vision models. It discretizes diagnostic norms and uses trainable patching over patient features to learn meaningful prototypical parts. The model achieves competitive classification performance on medical datasets while providing transparent, human-readable predictions, thus bridging the gap between predictive accuracy and interpretability in clinical decision support.

论文提出了一种用于医疗表格数据的可解释原型部分神经网络，灵感来源于计算机视觉模型。它对诊断标准进行了离散化，并从患者特征中学习有意义的原型部分，表示为二元或离散特征子集。该模型在医疗基准数据集上实现了与广泛使用的基线模型相当的分类性能，同时通过与网络潜空间中学习到的原型进行比较，提供了透明且可解释的预测。

The Spatial and Temporal Resolution of Motor Intention in Multi-Target Prediction

Authors: Marie Dominique Schmidt, Ioannis Iossifidis

First: 2026-03-05T17:40:30+00:00 · Latest: 2026-03-05T17:40:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Reaching for grasping, and manipulating objects are essential motor functions in everyday life. Decoding human motor intentions is a central challenge for rehabilitation and assistive technologies. This study focuses on predicting intentions by inferring movement direction and target location from multichannel electromyography (EMG) signals, and investigating how spatially and temporally accurate such information can be detected relative to movement onset. We present a computational pipeline that combines data-driven temporal segmentation with classical and deep learning classifiers in order to analyse EMG data recorded during the planning, early execution, and target contact phases of a delayed reaching task. Early intention prediction enables devices to anticipate user actions, improving responsiveness and supporting active motor recovery in adaptive rehabilitation systems. Random Forest achieves $80\%$ accuracy and Convolutional Neural Network $75\%$ accuracy across $25$ spatial targets, each separated by $14^\circ$ azimuth/altitude. Furthermore, a systematic evaluation of EMG channels, feature sets, and temporal windows demonstrates that motor intention can be efficiently decoded even with drastically reduced data. This work sheds light on the temporal and spatial evolution of motor intention, paving the way for anticipatory control in adaptive rehabilitation systems and driving advancements in computational approaches to motor neuroscience.

中文标题/摘要

标题：多目标预测中运动意图的空间和时间分辨率

抓握和操作物体是日常生活中必不可少的运动功能。解码人类的运动意图是康复和辅助技术中的一个核心挑战。本研究专注于通过从多通道肌电图(EMG)信号中推断运动方向和目标位置来预测意图，并探讨相对于运动开始时空间和时间上的准确性。我们提出了一种计算管道，结合数据驱动的时间分割与经典和深度学习分类器，以分析延迟抓取任务中计划、早期执行和目标接触阶段记录的EMG数据。早期意图预测使设备能够预判用户动作，提高响应性并支持自适应康复系统的主动运动恢复。随机森林在25个空间目标上实现了80%的准确率，卷积神经网络则为75%，每个目标间隔14°方位/高度。此外，对EMG通道、特征集和时间窗口的系统评估表明，即使数据大幅减少，运动意图也可以高效地被解码。本研究揭示了运动意图的时间和空间演变，为自适应康复系统的预判控制铺平了道路，并推动了运动神经科学计算方法的发展。

Summary / 总结

This study aims to predict human motor intentions by analyzing multichannel EMG signals during a delayed reaching task, focusing on the spatial and temporal accuracy relative to movement onset. The authors use a computational pipeline combining temporal segmentation with classical and deep learning classifiers, achieving 80% accuracy with Random Forest and 75% with Convolutional Neural Network across 25 spatial targets. The research highlights the efficient decoding of motor intention even with reduced data, contributing to anticipatory control in adaptive rehabilitation systems.

本研究旨在通过分析延迟抓取任务中规划、早期执行和目标接触阶段的多通道EMG信号来预测人类的运动意图。计算管道结合了时间分割与经典和深度学习分类器，使用随机森林实现80%的准确率，使用卷积神经网络实现75%的准确率，针对25个间隔为14°方位/高度的目标。研究强调即使在数据减少的情况下也能高效解码运动意图，为适应性康复系统的前瞻性控制开发做出了贡献。

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Authors: Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

First: 2026-03-02T18:46:28+00:00 · Latest: 2026-03-05T17:36:07+00:00

Comments: Project page: https://showlab.github.io/Kiwi-Edit/; Huggingface Demo: https://huggingface.co/spaces/linyq/KiwiEdit

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

中文标题/摘要

标题：Kiwi-Edit：基于指令和参考指导的多功能视频编辑

基于指令的视频编辑已经取得了快速进展，但当前的方法往往难以实现精确的视觉控制，因为自然语言在描述复杂的视觉细微差别方面是有限的。尽管参考指导编辑提供了一个强大的解决方案，但其潜力目前受到高质量配对训练数据稀缺的限制。为了解决这一问题，我们引入了一种可扩展的数据生成管道，将现有的视频编辑配对转换为高保真度的训练四元组，利用图像生成模型创建合成的参考支架。使用此管道，我们构建了RefVIE，一个针对指令-参考跟随任务的大规模数据集，并建立了RefVIE-Bench进行全面评估。此外，我们提出了一种统一的编辑架构Kiwi-Edit，该架构结合了可学习查询和潜在视觉特征，以实现参考语义指导。通过逐步多阶段训练课程，我们的模型在指令跟随和参考保真度方面取得了显著的提升。广泛的实验表明，我们的数据和架构在可控视频编辑方面建立了新的最先进的水平。所有数据集、模型和代码均发布在https://github.com/showlab/Kiwi-Edit/。

Summary / 总结

The research aims to improve the precision of instruction-based video editing by addressing the limitations of natural language in describing visual nuances and the scarcity of training data. The method involves a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, and a unified editing architecture, Kiwi-Edit, which combines learnable queries and latent visual features for reference semantic guidance. The model demonstrates significant improvements in instruction following and reference fidelity, setting a new state-of-the-art in controllable video editing.

Kiwi-Edit通过提出一个可扩展的数据生成管道来创建高保真训练四元组，以解决基于指令的视频编辑中精确视觉控制的挑战。该管道结合了一个大规模数据集RefVIE和一个统一的编辑架构Kiwi-Edit，显著提高了指令跟随和参考保真度。大量实验表明，Kiwi-Edit在可控视频编辑方面优于现有方法。

Video-based Locomotion Analysis for Fish Health Monitoring

Authors: Timon Palm, Clemens Seibold, Anna Hilsmann, Peter Eisert

First: 2026-03-05T17:32:46+00:00 · Latest: 2026-03-05T17:32:46+00:00

Comments: Accepted at VISAPP 2026

Abs · PDF · Code1 · Code2

Abstract

Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.

中文标题/摘要

标题：基于视频的鱼类健康监测运动分析

监测鱼类的健康状况至关重要，因为它能够实现疾病的早期检测，保障动物福利，并促进可持续的水产养殖实践。通过分析运动活动，可以推断出养殖鱼类的生理和病理状况。在本文中，我们提出了一种系统，该系统使用多目标跟踪从视频中估计运动活动。我们方法的核心是一个嵌入在检测式跟踪框架中的YOLOv11检测器。我们研究了YOLOv11架构的各种配置以及结合多帧以提高检测准确性的扩展。我们的系统在一种人工水族箱环境下记录的苏拉威西稻鱼手动标注数据集上进行了评估，展示了其可靠地测量鱼类游泳方向和速度的能力，以用于健康监测。数据集将在发表后公开。

Summary / 总结

The research aims to develop a system for monitoring fish health by analyzing their locomotion activities from video footage. The method involves using a YOLOv11 detector within a tracking-by-detection framework to estimate swimming direction and speed. The system was evaluated on a dataset of Sulawesi ricefish, showing its capability to reliably measure these parameters for health monitoring purposes.

研究旨在通过视频分析鱼类的运动来监测其健康状况。方法是使用YOLOv11检测器嵌入跟踪-检测框架来估计游泳方向和速度。该系统在Sulawesi稻鱼的数据集上进行了评估，展示了其可靠地测量这些参数以进行健康监测的能力。

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Authors: Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler

Venue: ICLR 2026

First: 2026-03-05T17:27:07+00:00 · Latest: 2026-03-05T17:27:07+00:00

Comments: Accepted at Agents in the Wild: Safety, Security, and Beyond Workshop at ICLR 2026 - April 26, 2026, Rio de Janeiro, Brazil

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. We evaluate four state-of-the-art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on judges revealed consistency issues as measured by accuracy in judging another LLM's ability to complete a task due to simple text formatting changes, paraphrasing, changes in verbosity, and flipping the ground truth label in LLM-produced responses. The code for this tool is available at: https://github.com/RANDCorporation/judge-reliability-harness

中文标题/摘要

标题：法官可靠性框架：对LLM法官可靠性的压力测试

我们提出了法官可靠性框架，这是一个开源库，用于构建验证套件以测试LLM法官的可靠性。由于基于LLM的评分在AI基准测试中广泛应用，因此需要更多的工具来高效评估这些方法的可靠性。给定基准数据集和一个LLM法官配置，该框架生成可靠性测试，评估自由响应和代理任务格式下的二元判断准确性和等级评分性能。我们对四个最先进的法官在四个涵盖安全、说服、误用和代理行为的基准上进行了评估，发现模型和扰动类型之间存在显著的性能差异，突显了提高LLM法官稳健性的机会。我们使用该框架评估的任何法官在其所评估的基准上都不是普遍可靠的。例如，我们初步实验中发现，由于简单的文本格式更改、改写、语义变化和翻转LLM生成响应的正确答案标签，法官在判断另一个LLM完成任务的能力时的一致性存在问题。该工具的代码可在以下链接获取：https://github.com/RANDCorporation/judge-reliability-harness

Summary / 总结

The study introduces the Judge Reliability Harness, an open-source library for testing the reliability of LLM judges. It evaluates four state-of-the-art judges across four benchmarks and finds significant variations in performance, indicating room for improvement in the robustness of LLM judges. The harness generates tests to assess binary judgment accuracy and ordinal grading for free-response and agentic task formats, revealing consistency issues due to minor text changes and label flips.

研究介绍了Judge Reliability Harness，一个开源库用于测试LLM法官的可靠性。它评估了四个最先进的法官在四个基准上的表现，并发现显著的性能差异，表明需要提高LLM法官的稳健性。该工具生成测试以评估自由回答和代理任务格式中的二元判断准确性和等级评分性能。

OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

Authors: Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai

First: 2025-09-30T04:26:17+00:00 · Latest: 2026-03-05T17:25:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a lightweight wrapper. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8\times$--$2.8\times$ and improves GPU utilization by $1.4\times$--$2.1\times$ without compromising training convergence.

中文标题/摘要

标题：OPPO：通过流水线重叠加速基于PPO的RLHF

基于 proximal policy optimization (PPO) 的强化学习从人类反馈 (RLHF) 是一种广泛采用的框架，用于使大型语言模型 (LLMs) 与人类偏好对齐。然而，其训练流水线因顺序多模型依赖（例如，奖励模型依赖于行为模型的输出）和长尾响应长度而遭受重大效率损失，其中少数长响应延迟了阶段完成。我们提出了 OPPO，这是一种新颖、轻量级且模型无关的 PPO 基础的 RLHF 框架，通过流水线重叠来提高训练效率。OPPO 引入了两种新技术：(1) 内部步骤重叠，通过按大小适配的块流式传输上游模型输出（例如，行为模型），使下游模型（例如，奖励）在上游继续解码的同时开始预填充；(2) 交互步骤重叠，适当地预提交一些提示并推迟长生成到未来步骤，从而减轻尾部延迟而不丢弃部分工作。OPPO 通过一个轻量级包装器轻松集成到现有的 PPO 实现中。广泛的评估表明，OPPO 将 PPO 基础的 RLHF 训练加速了 1.8 倍至 2.8 倍，并将 GPU 利用率提高了 1.4 倍至 2.1 倍，而不会影响训练收敛。

Summary / 总结

OPPO is a framework that accelerates PPO-based RLHF training by overlapping pipeline execution, achieving $1.8\times$--$2.8\times$ speedup and $1.4\times$--$2.1\times$ GPU utilization improvement. It introduces two techniques: intra-step overlap for streaming upstream model outputs in chunks and inter-step overlap for adaptive prompt overcommitment and deferred long generations.

OPPO 是一种通过流水线重叠加速 PPO 基于 RLHF 的框架，引入了两种技术：在步骤内重叠，通过分块传输模型输出以允许早期下游处理；以及在步骤间重叠，将长响应推迟到未来步骤。实验表明，OPPO 可将训练速度提升 1.8-2.8 倍，同时提高 GPU 利用率 1.4-2.1 倍，而不影响收敛性。

Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM

Authors: Javier Laserna, Saurabh Gupta, Oscar Martinez Mozos, Cyrill Stachniss, Pablo San Segundo

First: 2026-03-05T17:24:44+00:00 · Latest: 2026-03-05T17:24:44+00:00

Comments: Accepted in the 2025 European Conference on Mobile Robots (ECMR). This is the author's version of the work

Abs · PDF · Code1 · Code2

Abstract

Reliable loop closure detection remains a critical challenge in 3D LiDAR-based SLAM, especially under sensor noise, environmental ambiguity, and viewpoint variation conditions. RANSAC is often used in the context of loop closures for geometric model fitting in the presence of outliers. However, this approach may fail, leading to map inconsistency. We introduce a novel deterministic algorithm, CliReg, for loop closure validation that replaces RANSAC verification with a maximal clique search over a compatibility graph of feature correspondences. This formulation avoids random sampling and increases robustness in the presence of noise and outliers. We integrated our approach into a real- time pipeline employing binary 3D descriptors and a Hamming distance embedding binary search tree-based matching. We evaluated it on multiple real-world datasets featuring diverse LiDAR sensors. The results demonstrate that our proposed technique consistently achieves a lower pose error and more reliable loop closures than RANSAC, especially in sparse or ambiguous conditions. Additional experiments on 2D projection-based maps confirm its generality across spatial domains, making our approach a robust and efficient alternative for loop closure detection.

中文标题/摘要

标题：基于3D LiDAR的SLAM中最大团的环路闭合

可靠的环路闭合检测仍然是3D LiDAR基于SLAM中的一个关键挑战，尤其是在传感器噪声、环境模糊和视角变化的情况下。RANSAC通常用于环路闭合中的几何模型拟合，以应对离群值。然而，这种方法可能会失败，导致地图不一致。我们提出了一种新颖的确定性算法CliReg，用于环路闭合验证，用特征对应兼容图上的最大团搜索替代RANSAC验证。这种形式避免了随机采样，并在噪声和离群值存在的情况下增加了鲁棒性。我们将该方法集成到一个实时流水线中，使用二进制3D描述符和基于汉明距离嵌入的二叉搜索树匹配。我们在多个包含不同LiDAR传感器的现实世界数据集上进行了评估。结果表明，与RANSAC相比，我们提出的技术在姿态误差和环路闭合的可靠性方面始终更优，尤其是在稀疏或模糊条件下。另外，基于2D投影的地图上的实验进一步证实了其在不同空间域中的通用性，使我们的方法成为环路闭合检测的稳健且高效的替代方案。

Summary / 总结

The paper addresses the challenge of reliable loop closure detection in 3D LiDAR-based SLAM by proposing a deterministic algorithm called CliReg, which uses a maximal clique search over a compatibility graph of feature correspondences instead of RANSAC. The method integrates binary 3D descriptors and a Hamming distance embedding binary search tree-based matching into a real-time pipeline. Experimental results on multiple real-world datasets show that CliReg achieves lower pose errors and more reliable loop closures compared to RANSAC, particularly in sparse or ambiguous conditions.

论文提出了一种名为CliReg的确定性算法，通过在特征对应关系的兼容图中进行最大团搜索来解决3D LiDAR基于SLAM中的环回闭合检测问题，而不是使用RANSAC。该方法将二值3D描述符和基于哈希距离嵌入二叉搜索树的匹配集成到实时管道中。实验结果表明，CliReg在稀疏或模糊条件下实现了更低的姿态误差和更可靠的环回闭合，优于RANSAC。

On the Necessity of Learnable Sheaf Laplacians

Authors: Ferran Hernandez Caralt, Mar Gonzàlez i Català, Adrián Bazaga, Pietro Liò

First: 2026-03-05T17:24:13+00:00 · Latest: 2026-03-05T17:24:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Sheaf Neural Networks (SNNs) were introduced as an extension of Graph Convolutional Networks to address oversmoothing on heterophilous graphs by attaching a sheaf to the input graph and replacing the adjacency-based operator with a sheaf Laplacian defined by (learnable) restriction maps. Prior work motivates this design through theoretical properties of sheaf diffusion and the kernel of the sheaf Laplacian, suggesting that suitable non-identity restriction maps can avoid representations converging to constants across connected components. Since oversmoothing can also be mitigated through residual connections and normalization, we revisit a trivial sheaf construction to ask whether the additional complexity of learning restriction maps is necessary. We introduce an Identity Sheaf Network baseline, where all restriction maps are fixed to the identity, and use it to ablate the empirical improvements reported by sheaf-learning architectures. Across five popular heterophilic benchmarks, the identity baseline achieves comparable performance to a range of SNN variants. Finally, we introduce the Rayleigh quotient as a normalized measure for comparing oversmoothing across models and show that, in trained networks, the behavior predicted by the diffusion-based analysis of SNNs is not reflected empirically. In particular, Identity Sheaf Networks do not appear to suffer more significant oversmoothing than their SNN counterparts.

中文标题/摘要

标题：可学习层拉普拉斯的必要性

层神经网络（SNNs）被引入作为图卷积网络的扩展，以通过将层附着到输入图并用由可学习的限制映射定义的层拉普拉斯替换基于邻接的操作来解决异质图上的过度平滑问题。先前的工作通过层扩散的理论性质和层拉普拉斯的核来解释这种设计，表明适当的非恒等限制映射可以避免表示在连通分支上收敛到常数。由于残差连接和规范化也可以缓解过度平滑，我们重新审视一个简单的层构造，以问是否学习限制映射的额外复杂性是必要的。我们引入了一个恒等层网络基线，其中所有限制映射都固定为恒等映射，并使用它来消除基于层学习架构报告的经验改进。在五个流行的异质基准测试中，恒等基线实现了与SNN变体相当的性能。最后，我们引入了瑞利商作为比较模型间过度平滑的归一化度量，并表明，在训练网络中，基于扩散分析的SNN行为并未在实验中得到反映。特别是，恒等层网络似乎不会比其SNN对应物遭受更严重的过度平滑。

Summary / 总结

This paper investigates the necessity of learnable sheaf Laplacians in Sheaf Neural Networks (SNNs) by comparing them to an Identity Sheaf Network baseline where all restriction maps are fixed to the identity. The study finds that the identity baseline achieves comparable performance to various SNN variants across five heterophilic benchmarks. Additionally, the paper introduces the Rayleigh quotient as a normalized measure for comparing oversmoothing and shows that Identity Sheaf Networks do not exhibit more significant oversmoothing than SNNs, contradicting theoretical predictions based on sheaf diffusion analysis.

论文通过将基态网络与固定为单位映射的基态网络进行比较，探讨了Sheaf神经网络（SNN）中可学习的sheaf拉普拉斯算子的必要性。它重新审视了SNN的设计，以确定是否需要学习这些映射来避免在异构图上发生过度平滑。在五个基准测试中，基态网络的表现与SNN变体相当，而瑞利商分析表明，基态网络并不比SNN更严重地遭受过度平滑。

Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations

Authors: Hajar Dekdegue, Moncef Garouani, Josiane Mothe, Jordan Bernigaud

First: 2026-03-05T17:16:29+00:00 · Latest: 2026-03-05T17:16:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.

中文标题/摘要

标题：Fusion-CAM：结合梯度和区域基类激活图以实现稳健的视觉解释

在实现可信赖和透明的人工智能方面，解释深度卷积神经网络的决策过程仍然是一个核心挑战。可解释人工智能（XAI）技术，特别是类激活图（CAM）方法，广泛用于可视化影响模型预测的输入区域。基于梯度的方法（例如Grad-CAM）通过计算类激活的梯度提供高度区分性的、精细的细节，但通常会产生噪声大且不完整的图，仅强调最显著的区域而非整个对象。基于区域的方法（例如Score-CAM）通过聚合较大区域的信息来捕捉更广泛的对象覆盖范围，但代价是过度平滑和对细微特征的敏感度降低。我们提出了Fusion-CAM，这是一种新颖的框架，通过专门的融合机制将这两种范式结合起来，以产生稳健且高度区分性的视觉解释。该方法首先去噪基于梯度的图，生成更清洁、更集中的激活。然后，它使用贡献权重将精炼的梯度图与基于区域的图结合起来，以增强类覆盖范围。最后，我们提出了一种基于相似性的自适应像素级融合方法，该方法评估两种范式的共识并动态调整融合强度。这种自适应机制强化了一致的激活，同时柔和地融合冲突区域，从而产生更丰富、上下文感知和输入自适应的视觉解释。在标准基准上的广泛实验表明，Fusion-CAM在定性和定量评估中均优于现有CAM变体，提供了一种稳健且灵活的工具，用于解释深度神经网络。

Summary / 总结

Fusion-CAM integrates gradient and region-based Class Activation Maps to produce robust and detailed visual explanations for deep convolutional neural networks. It first denoises gradient-based maps and then combines them with region-based maps using adaptive similarity-based fusion, resulting in more comprehensive and context-aware visualizations. Experiments show that Fusion-CAM outperforms existing CAM variants in both qualitative and quantitative evaluations.

Fusion-CAM 将梯度和区域为基础的类激活图结合起来，为深度卷积神经网络提供稳健且详细的视觉解释。它首先去噪梯度为基础的图，然后使用基于相似性的自适应像素级融合将它们与区域为基础的图结合起来，从而产生更全面且上下文相关的视觉解释。实验表明，Fusion-CAM 在定性和定量评估中均优于现有 CAM 变体。

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Authors: Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao

First: 2026-03-05T17:15:01+00:00 · Latest: 2026-03-05T17:15:01+00:00

Comments: https://github.com/chen-si-jia/ORMOT

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.

中文标题/摘要

标题：ORMOT： omnidirectional referring multi-object tracking的数据集和框架

多目标跟踪（MOT）是计算机视觉中的一个基本任务，旨在跨视频帧跟踪目标。现有的MOT方法在一般视觉场景中表现良好，但在扩展到视觉语言设置时面临重大挑战和限制。为了解决这一差距，最近提出了引用多目标跟踪（RMOT）任务，旨在跟踪与语言描述对应的物体。然而，当前的RMOT方法主要是在由传统相机拍摄的数据集上开发的，这些数据集存在视野有限的限制。这种限制往往导致目标移出画面，从而导致跟踪片段化并丢失上下文信息。在本文中，我们提出了一项新的任务，称为全方位引用多目标跟踪（ORMOT），该任务将RMOT扩展到全方位图像，旨在克服传统数据集的视野限制，并提高模型理解长时语言描述的能力。为了推进ORMOT任务，我们构建了ORSet，一个全方位引用多目标跟踪数据集，包含27个多样化的全方位场景、848个语言描述和3,401个标注物体，提供了丰富的视觉、时间和语言信息。此外，我们提出了ORTrack，一种针对全方位引用多目标跟踪的大型视觉-语言模型驱动框架。在ORSet数据集上的广泛实验表明，我们的ORTrack框架的有效性。数据集和代码将在https://github.com/chen-si-jia/ORMOT开放。

Summary / 总结

The research aims to address the limitations of existing Multi-Object Tracking (MOT) methods in visual-language settings by proposing a new task called Omnidirectional Referring Multi-Object Tracking (ORMOT). The authors developed ORSet, a dataset for ORMOT containing diverse omnidirectional scenes and language descriptions, and ORTrack, a framework using a Large Vision-Language Model to track objects in these scenes. Experiments show the effectiveness of ORTrack in handling long-horizon language descriptions and overcoming field-of-view limitations. The dataset and code are open-sourced at https://github.com/chen-si-jia/ORMOT.

研究旨在通过提出一个新的任务——全方位引用多目标跟踪（ORMOT），解决现有目标跟踪（MOT）方法在视觉语言场景中的局限性。作者开发了ORSet数据集，包含多样化的全方位场景和语言描述，并提出了ORTrack框架，该框架使用大型视觉语言模型。在ORSet上的实验表明，ORTrack在处理长时语言描述和克服视野限制方面具有有效性。

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Authors: Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum

First: 2026-03-05T17:02:22+00:00 · Latest: 2026-03-05T17:02:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

中文标题/摘要

标题：OpenFrontier：基于视觉-语言引导边界的通用导航

开放世界导航要求机器人在复杂日常环境中做出决策并适应灵活的任务需求。传统导航方法通常依赖密集的3D重建和手工制作的目标度量标准，这限制了它们在不同任务和环境中的泛化能力。视觉-语言导航（VLN）和视觉-语言-动作（VLA）模型的最新进展使基于自然语言的端到端策略成为可能，但通常需要交互式训练、大规模数据收集或针对移动代理的任务特定微调。我们将导航问题形式化为稀疏子目标识别和到达问题，并观察到提供视觉锚定目标以支持高层语义先验可以实现高效的基于目标的导航。基于这一洞察，我们选择导航边界作为语义锚点，并提出OpenFrontier，这是一种无需训练的导航框架，能够无缝集成多种视觉-语言先验模型。OpenFrontier 通过轻量级系统设计实现了高效的导航，无需密集的3D映射、策略训练或模型微调。我们在多个导航基准上评估了OpenFrontier，并展示了其强大的零样本性能，以及在移动机器人上的有效实际部署。

Summary / 总结

The research aims to enable robots to navigate in complex environments with flexible task requirements. The method involves formulating navigation as a sparse subgoal identification problem and using visual frontiers as semantic anchors to guide navigation. Key findings show that OpenFrontier achieves strong zero-shot performance across multiple benchmarks and effective real-world deployment on a mobile robot without the need for dense 3D mapping, policy training, or model fine-tuning.

论文旨在解决机器人在复杂环境中进行开放世界导航的问题，面对灵活的任务要求。提出了一种名为OpenFrontier的训练免费导航框架，利用视觉语言引导的前沿作为语义锚点，实现高效的目标导向导航。实验表明，OpenFrontier在多个基准测试中表现出色，能够在无需密集3D建图、策略训练或模型微调的情况下，有效部署在移动机器人上。

Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation

Authors: Bastian Pfeifer, Michael G. Schimek

First: 2026-03-05T17:00:59+00:00 · Latest: 2026-03-05T17:00:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.

中文标题/摘要

标题：基于Jaccard偏差随机游走和排名聚合的稳健节点亲和性

节点相似性估计是网络分析和图基机器学习中的基本任务，应用于聚类、社区检测、分类和推荐。我们提出TopKGraphs方法，基于起始节点锚定的随机游走，偏向于结构相似的邻域节点，通过Jaccard相似性衡量。不同于计算稳态分布，游走被视为随机邻域采样器，生成部分节点排名，通过稳健的排名聚合构建可解释的节点到节点亲和矩阵。TopKGraphs提供了一种非参数化、可解释且通用的节点相似性表示，适用于网络分析和机器学习工作流。我们在合成图（随机块模型、兰其钦尼蒂-福图纳托-拉迪奇基准图）、表格数据集的k-最近邻图以及一个高置信度的蛋白质-蛋白质相互作用网络上评估了该方法。在所有场景中，TopKGraphs在与标准相似性度量（Jaccard、Dice）、基于扩散的方法（个性化PageRank）和基于嵌入的方法（Node2Vec）相比时，表现出竞争力或更优性能，证明了其在稀疏、嘈杂或异构网络中的鲁棒性。这些结果表明，TopKGraphs是一种将简单的局部相似性度量与更复杂的嵌入方法相结合的多功能且可解释的工具，有助于数据挖掘和网络分析应用。

Summary / 总结

TopKGraphs is a method for estimating node similarity in networks using Jaccard-biased random walks and rank aggregation. It provides a non-parametric, interpretable, and general-purpose representation of node similarity, outperforming standard measures and other methods in various scenarios, including synthetic graphs, k-nearest-neighbor graphs, and a protein-protein interaction network.

TopKGraphs 是一种通过 Jaccard 偏向随机游走生成部分节点排名并聚合形成可解释的亲和矩阵的方法，用于在网络中估计节点相似性。它在各种图类型中优于标准相似性度量和个人化 PageRank、Node2Vec 等其他方法，显示出在稀疏和嘈杂网络中的鲁棒性。

Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models

Authors: Haidong Kang, Jun Du, Lihong Lin

First: 2025-12-08T10:52:55+00:00 · Latest: 2026-03-05T16:57:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Mixed-Precision Quantization (MPQ) liberates Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck and has garnered increasing research attention. However, conventional methods either rely on costly differentiable optimization search, which is neither efficient nor flexible, or learn a quantized DNN from a proxy (e.g., HAWQ) manually designed by human experts, which is labor-intensive and requires extensive expert knowledge. Can we design a proxy without involving any human experts or training? In this paper, we provide an affirmative answer by proposing a novel Large Language Model (LLM)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework. It reforms the design paradigm of MPQ by utilizing LLMs and evolutionary search strategies to automatically find superior TAP tailored for MPQ. In addition, to bridge the gap between black-box LLMs and the challenging MPQ task, we introduce a lightweight Direct Preference Optimization (DPO)-based strategy controller that dynamically reweights the selection probabilities of the three prompt templates for evolutionary search strategies according to fitness signals, without fine-tuning the LLM. This forms a task-aware feedback loop that improves proxy generation across evolutions. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.

中文标题/摘要

标题：混合精度量化革命：通过大型语言模型实现无需训练的自动代理发现

混合精度量化（MPQ）使深度神经网络（DNNs）摆脱了内存不足（OOM）的瓶颈，并引起了越来越多的研究关注。然而，传统方法要么依赖于昂贵的可微优化搜索，这既不高效也不灵活，要么从人类专家手动设计的代理（例如HAWQ）中学习量化DNN，这既耗时又需要大量专家知识。我们能否设计一个无需任何人类专家或训练的代理？在本文中，我们通过提出一种新颖的大型语言模型（LLM）驱动的无需训练的自动代理（简称TAP）发现框架，给出了肯定的答案。该框架通过利用LLM和进化搜索策略，自动发现适用于MPQ的优质TAP，改革了MPQ的设计范式。此外，为了弥合黑盒LLM与挑战性的MPQ任务之间的差距，我们引入了一种轻量级的直接偏好优化（DPO）为基础的策略控制器，根据适应度信号动态调整进化搜索策略中三种提示模板的选择概率，无需微调LLM。这形成了一种任务感知的反馈循环，提高了代理生成的性能。在主流基准上的广泛实验表明，TAP达到了最先进的性能。最后，我们认为，我们的TAP将通过提供一种LLM驱动设计算法的新视角，对MPQ社区产生重大贡献。

Summary / 总结

The paper addresses the challenge of designing a proxy for Mixed-Precision Quantization (MPQ) without human intervention or training, which is a common requirement in conventional methods. It introduces a novel framework called TAP, which leverages Large Language Models (LLMs) and evolutionary search strategies to automatically discover a superior proxy tailored for MPQ. The framework also includes a lightweight Direct Preference Optimization (DPO)-based strategy controller to dynamically adjust the selection probabilities of prompt templates. Experimental results on mainstream benchmarks show that TAP outperforms existing methods, demonstrating its effectiveness in MPQ design.

本文解决了无需人工干预或训练来设计混合精度量化（MPQ）代理的问题。它提出了一种TAP框架，利用大型语言模型（LLMs）和进化搜索策略自动发现最优代理。TAP框架包含一个轻量级的直接偏好优化（DPO）策略控制器，根据适应度信号动态调整提示模板的选择概率，从而增强代理生成过程。实验表明，TAP在主流基准上优于现有方法，为MPQ设计提供了新的思路。

History

20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553