arXiv 论文速递

Snapshot: 20260212_0404

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Authors: Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei

First: 2026-02-10T18:59:55+00:00 · Latest: 2026-02-10T18:59:55+00:00

Comments: Project Page: https://nvlabs.github.io/sage

Abs · PDF · Code1 · Code2 · Project1

Abstract

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.

中文标题/摘要

标题：SAGE：可扩展的代理3D场景生成用于具身AI

现实世界的数据收集对具身代理来说仍然代价高昂且不安全，因此需要可扩展、逼真且可模拟的3D环境。然而，现有的场景生成系统往往依赖于基于规则或特定任务的管道，导致生成不自然且物理上无效的场景。我们提出了SAGE，这是一种代理框架，给定用户指定的具身任务（例如，“拿起一个碗并把它放在桌子上”），它能够理解意图并自动大规模生成可模拟的环境。代理结合了用于布局和对象组合的多个生成器以及评估语义合理性、视觉逼真性和物理稳定性的批评者。通过迭代推理和适应性工具选择，它不断自我完善场景，直到满足用户意图和物理有效性。生成的环境是逼真、多样且可以直接部署在现代模拟器中进行策略训练的。仅基于此数据训练的策略表现出明显的扩展趋势，并能泛化到未见过的对象和布局，展示了基于模拟的扩展对具身AI的潜力。项目代码、演示和SAGE-10k数据集可以在项目页面上找到：https://nvlabs.github.io/sage。

Summary / 总结

SAGE is an agentic framework designed to generate realistic 3D environments for embodied AI tasks. It uses multiple generators and critics to iteratively create and refine scenes based on user-specified tasks, ensuring semantic plausibility, visual realism, and physical stability. The resulting environments are diverse, realistic, and directly usable in modern simulators for training policies, which show clear scaling trends and generalization to unseen objects and layouts.

SAGE 是一个基于代理的框架，用于生成适用于体感AI任务的现实3D环境。它通过多个生成器和批评者迭代创建和优化场景，确保语义合理性、视觉真实性和物理稳定性。生成的环境多样且现实，可以直接在现代模拟器中用于训练策略，这些策略在处理未见过的对象和布局时表现出明显的扩展趋势和泛化能力。

Quantum Multiple Rotation Averaging

Authors: Shuteng Wang, Natacha Kuete Meli, Michael Möller, Vladislav Golyanik

Venue: International Conference on 3D Vision (3DV) 2026

First: 2026-02-10T18:59:54+00:00 · Latest: 2026-02-10T18:59:54+00:00

Comments: 16 pages, 13 figures, 4 tables; project page: https://4dqv.mpi-inf.mpg.de/QMRA/

Abs · PDF · Code1 · Code2

Abstract

Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS's performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.

中文标题/摘要

标题：量子多重旋转平均

多重旋转平均（MRA）是3D视觉和机器人技术中的一项基本优化问题，旨在从噪声相对测量中恢复全局一致的绝对旋转。现有的经典方法，如L1-IRLS和Shonan，存在局部极小值的倾向，并依赖于无法精确保持流形几何的凸松弛，导致在高噪声场景中准确性降低。我们提出了IQARS（迭代量子退火旋转同步），这是第一个将MRA重新表述为可在量子退火器上执行的二元化后的局部二次非凸子问题序列的算法，以利用硬件固有的优势。IQARS消除了对凸松弛的依赖，更好地保持了非欧几里得旋转流形的几何结构，同时利用量子隧道效应和并行性高效探索解空间。我们在合成和真实数据集上评估了IQARS的性能。尽管当前的退火器仍处于初级阶段，仅支持解决有限规模且性能受限的问题，但我们在D-Wave退火器上观察到，IQARS的准确性比Shonan（评估中表现最佳的经典方法）高出约12%。

Summary / 总结

The research addresses the limitations of classical methods for multiple rotation averaging (MRA) by introducing IQARS, an algorithm that reformulates MRA as local quadratic sub-problems solvable on quantum annealers. IQARS improves accuracy by avoiding convex relaxation and better preserving the non-Euclidean rotation manifold geometry. Experiments show that IQARS achieves approximately 12% higher accuracy than the best classical method, Shonan, on both synthetic and real-world datasets, despite current quantum annealers' limitations in scale and performance.

论文通过引入IQARS，将多旋转平均（MRA）问题重新表述为可在量子退火器上求解的局部二次子问题，从而避免了凸松弛，并更好地保持了旋转流形的非欧几里得几何结构。在高噪声场景下，该方法相比现有方法如Shonan，实现了约12%的更高准确性，并在合成和真实世界数据集上进行了评估。

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Authors: Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu

First: 2026-02-10T18:59:51+00:00 · Latest: 2026-02-10T18:59:51+00:00

Comments: Project page: https://myangwu.github.io/ConsID-Gen

Abs · PDF · Code1 · Code2 · Project1

Abstract

Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.

中文标题/摘要

标题：ConsID-Gen: 视图一致且身份保留的图像到视频生成

图像到视频生成（I2V）根据文本指令将静态图像动画化为时间上连贯的视频序列，但保持在不同视角下细粒度对象的身份仍然是一项持续的挑战。与文本到视频模型不同，现有的I2V管道往往遭受外观漂移和几何失真等缺陷，我们认为这些缺陷源于单视角2D观测的稀疏性和跨模态对齐的薄弱性。我们从数据和模型两个方面来解决这个问题。首先，我们构建了ConsIDVid，这是一个大规模的对象中心数据集，使用可扩展的管道构建高质量、时间对齐的视频，并建立了ConsIDVid-Bench，其中我们提出了一种新的多视角一致性基准测试和评估框架，使用对细微几何和外观偏差敏感的度量标准。我们进一步提出了ConsID-Gen，这是一种视图辅助的I2V生成框架，通过未摆姿势的辅助视图增强第一帧，并通过双流视觉-几何编码器以及文本-视觉连接器融合语义和结构线索，为扩散变换器骨干网络提供统一的条件。ConsIDVid-Bench上的实验表明，ConsID-Gen在多个指标上表现一致优于其他模型，整体性能最佳，超越了领先的视频生成模型如Wan2.1和HunyuanVideo，在具有挑战性的现实场景中提供更好的身份保真度和时间连贯性。我们将在https://myangwu.github.io/ConsID-Gen/发布我们的模型和数据集。

Summary / 总结

The research addresses the challenge of preserving object identity in image-to-video generation while maintaining temporal coherence under changing viewpoints. It introduces ConsID-Gen, a view-assisted generation framework that uses a dual-stream visual-geometric encoder and a text-visual connector to condition a Diffusion Transformer. Experiments show that ConsID-Gen outperforms existing models like Wan2.1 and HunyuanVideo in multiple metrics, achieving superior identity fidelity and temporal coherence. The work also presents ConsIDVid, a large-scale dataset for benchmarking multi-view consistency in I2V generation.

研究旨在解决在图像到视频生成过程中保持物体身份和时间连贯性的问题。提出了一种基于视图辅助的生成框架ConsID-Gen，该框架使用双流视觉几何编码器和文本视觉连接器来条件化扩散变换器。实验表明，ConsID-Gen 在身份保真度和时间连贯性方面优于现有模型，特别是在具有视角变化和几何失真的真实世界场景中表现更佳。

Olaf-World: Orienting Latent Actions for Video World Modeling

Authors: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

First: 2026-02-10T18:58:41+00:00 · Latest: 2026-02-10T18:58:41+00:00

Comments: Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

中文标题/摘要

标题：Olaf-World：为视频世界建模定向潜在动作

扩展可控制的世界模型受到动作标签稀缺性的限制。虽然潜在动作学习有望从未标记的视频中提取控制接口，但学习到的潜在动作往往无法在不同上下文中转移：它们会纠缠场景特定的线索，缺乏共享的坐标系统。这是因为标准目标仅在每个片段内操作，无法提供在不同上下文中对齐动作语义的机制。我们的关键见解是，尽管动作未被观察到，但它们的语义效果是可以观察到的，并可以作为共享参考。我们引入了Seq$Δ$-REPA，这是一种序列级控制效果对齐目标，将集成的潜在动作锚定到冻结的自监督视频编码器的时序特征差异上。在此基础上，我们提出了Olaf-World，一种从大规模被动视频中预训练动作条件化视频世界模型的流水线。广泛的实验表明，我们的方法学习到一个更结构化的潜在动作空间，导致更强的零样本动作转移和更高效的数据适应新的控制接口，优于最先进的基线。

Summary / 总结

The research addresses the challenge of scaling action-controllable world models by introducing Seq$Δ$-REPA, a sequence-level control-effect alignment objective. This method aligns latent actions to temporal feature differences from a frozen, self-supervised video encoder, enabling better transfer of actions across contexts. Experiments show that Olaf-World, a pipeline using this approach, learns a more structured latent action space, improving zero-shot action transfer and data efficiency compared to existing methods.

研究旨在通过解决动作标签稀缺的问题来扩展动作可控的世界模型。它引入了Seq$Δ$-REPA，一种序列级控制效果对齐目标，以在不同上下文之间对齐潜在动作。该方法使用大规模被动视频数据预训练动作条件下的视频世界模型，从而获得一个更结构化的潜在动作空间，能够实现更好的零样本动作转移和更高效的对新控制界面的适应，优于现有方法。

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Authors: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

First: 2026-02-10T18:58:19+00:00 · Latest: 2026-02-10T18:58:19+00:00

Comments: Code and models are released at: https://maverickren.github.io/VideoWorld2.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.

中文标题/摘要

标题：VideoWorld 2：从真实世界视频中学习可迁移的知识

智能代理从未标记的视频数据中学习可迁移知识并在新环境中应用这一基本能力是至关重要的。这项工作介绍了VideoWorld 2，它扩展了VideoWorld，并首次直接从原始真实世界视频中学习可迁移知识。其核心在于引入了一种动态增强的潜在动力学模型（dLDM），该模型将动作动力学与视觉外观分离：预训练的视频扩散模型处理视觉外观建模，使dLDM能够学习专注于紧凑且有意义的任务相关动力学的潜在代码。这些潜在代码随后被自回归建模以学习任务策略并支持长期推理。我们在具有挑战性的实际手工艺任务上评估了VideoWorld 2，而先前的视频生成和潜在动力学模型在这些任务上难以可靠地运行。令人惊讶的是，VideoWorld 2在任务成功率上提高了高达70%，并生成了连贯的长执行视频。在机器人学中，我们展示了VideoWorld 2可以从Open-X数据集中获得有效的操作知识，这在CALVIN任务上显著提高了任务性能。这项研究揭示了直接从原始视频中学习可迁移世界知识的潜力，所有代码、数据和模型都将开源以供进一步研究。

Summary / 总结

VideoWorld 2 aims to learn transferable knowledge from raw real-world videos and apply it in new environments. It introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance, allowing the model to focus on task-related dynamics. Evaluation on handcraft making tasks shows VideoWorld 2 improves task success rate by up to 70% and produces coherent long execution videos. In robotics, it enhances task performance on CALVIN by acquiring effective manipulation knowledge from the Open-X dataset.

研究旨在开发一种能够从原始真实世界视频中学习可转移知识并应用于新环境的智能代理。VideoWorld 2引入了一种动态增强的潜在动力模型（dLDM），将动作动力学与视觉外观分离，使模型能够专注于任务相关的动力学。该模型在复杂的现实世界手工艺任务中实现了高达70%的任务成功率的提升，并生成了连贯的长执行视频。在机器人领域，VideoWorld 2通过从Open-X数据集中获取有效的操作知识，显著提高了CALVIN上的任务性能。

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen

First: 2026-02-10T18:58:01+00:00 · Latest: 2026-02-10T18:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

中文标题/摘要

标题：VLA-JEPA: 通过潜在世界模型增强视觉-语言-行动模型

在互联网规模的视频上预训练视觉-语言-行动（VLA）策略是诱人的，但当前的潜在行动目标往往学到错误的东西：它们仍然锚定在像素变化上，而不是行动相关的状态转换，这使它们容易受到外观偏差、无关运动和信息泄露的影响。我们引入了VLA-JEPA，这是一种JEPA风格的预训练框架，通过设计避免了这些陷阱。关键思想是“无泄漏状态预测”：目标编码器从未来帧中生成潜在表示，而学生路径仅看到当前观察——未来信息仅用作监督目标，从不作为输入。通过在潜在空间而不是像素空间中进行预测，VLA-JEPA学会了对摄像机运动和无关背景变化具有鲁棒性的动力学抽象。这提供了一个简单的两阶段食谱——JEPA预训练后跟动作头微调——而无需先前潜在行动管道的多阶段复杂性。在LIBERO、LIBERO-Plus、SimplerEnv和真实世界的操作任务上的实验表明，VLA-JEPA在现有方法上实现了一致的泛化和鲁棒性提升。

Summary / 总结

VLA-JEPA is designed to enhance VLA models by addressing the limitations of current latent-action objectives, which are prone to appearance bias and nuisance motion. It introduces a leakage-free state prediction method where the target encoder predicts future latent representations, while the student pathway only sees the current observation. This approach leads to improved robustness and generalization in tasks such as LIBERO, LIBERO-Plus, SimplerEnv, and real-world manipulation tasks, outperforming existing methods.

研究针对VLA模型因像素变化学习错误动作的问题，提出了一种JEPA风格的VLA-JEPA框架，通过使用无泄漏状态预测来避免外观偏差和多余运动。通过在潜在空间中进行预测，VLA-JEPA增强了鲁棒性和泛化能力，在包括LIBERO、LIBERO-Plus、SimplerEnv和真实世界操作任务等多种任务中实现了对现有方法的一致改进。

Causality in Video Diffusers is Separable from Denoising

Authors: Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, Zongze Wu

First: 2026-02-10T18:57:21+00:00 · Latest: 2026-02-10T18:57:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.

中文标题/摘要

标题：视频去噪器中的因果性可与去噪分离

因果性——指时间上单向的原因-结果关系——是许多复杂生成过程的基础，包括视频、语言和机器人轨迹。当前的因果扩散模型将时间推理与迭代去噪交织在一起，在所有层、每个去噪步骤以及整个上下文中应用因果注意力。在本文中，我们展示了这些模型中的因果推理可以与多步去噪过程分离。通过对自回归视频去噪器的系统探查，我们发现了两个关键规律：(1) 早期层在去噪步骤中产生高度相似的特征，表明沿扩散轨迹的冗余计算；(2) 深层层表现出稀疏的跨帧注意力，并主要执行帧内渲染。受这些发现的启发，我们引入了可分离因果扩散（SCD），这是一种新的架构，通过因果变换器编码器显式地将每帧一次的时间推理与多步帧间渲染通过轻量级扩散解码器分离。在合成和真实基准的预训练和后训练任务上的广泛实验表明，SCD 在提高吞吐量和每帧延迟的同时，能够匹配或超越强因果扩散基线的生成质量。

Summary / 总结

This paper investigates the separability of causality from denoising in video diffusers. Through systematic analysis, it reveals that early layers in these models perform redundant computations, while deeper layers focus on intra-frame rendering. Motivated by these findings, the authors propose Separable Causal Diffusion (SCD), which decouples temporal reasoning from denoising. Experiments show that SCD enhances throughput and reduces per-frame latency without compromising generation quality compared to existing causal diffusion models.

该论文研究了视频扩散器中的因果推理与去噪过程的可分离性。通过探针自回归视频扩散器，作者发现早期层在去噪步骤中产生相似的特征，而深层层主要进行帧内渲染。基于这些观察，他们提出了分离式因果扩散（SCD），该模型将每帧的时序推理与多步渲染过程分离。实验表明，SCD在提高吞吐量和每帧延迟的同时，能够保持或超越现有因果扩散模型的生成质量。

4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere

Authors: Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy

First: 2026-02-10T18:57:04+00:00 · Latest: 2026-02-10T18:57:04+00:00

Comments: Project page: https://yihangluo.com/projects/4RC/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.

中文标题/摘要

标题：4RC：通过条件查询随时随地进行4D重建

我们提出了4RC，这是一种统一的前馈框架，用于从单目视频中进行4D重建。与现有方法通常将运动与几何分离或仅生成有限的4D属性（如稀疏轨迹或两视图场景流）不同，4RC学习了一个整体的4D表示，可以同时捕捉密集的场景几何和运动动力学。其核心，4RC引入了一种新颖的一次编码、随时随地查询的范式：一个变压器骨干将整个视频编码到一个紧凑的时空潜在空间中，从其中可以高效地查询任何查询帧在任何目标时间戳的3D几何和运动。为了促进学习，我们将每个视图的4D属性以最小因子化的形式表示，通过将其分解为基本几何和时间依赖的相对运动。广泛的实验表明，4RC在多种4D重建任务中均优于先前和当前的方法。

Summary / 总结

4RC is a unified feed-forward framework for 4D reconstruction from monocular videos. It learns a holistic 4D representation that captures dense scene geometry and motion dynamics, unlike previous methods that decouple motion from geometry or produce limited 4D attributes. 4RC uses a transformer backbone to encode the entire video into a compact spatio-temporal latent space, allowing efficient querying of 3D geometry and motion for any frame and timestamp. Experiments show that 4RC outperforms prior and concurrent methods in various 4D reconstruction tasks.

4RC 是一种统一的前馈框架，用于从单目视频进行4D重建。它学习一个同时捕捉密集场景几何和运动动态的整体4D表示，不同于以往方法将运动与几何分离或仅生成稀疏轨迹。关键方法是将整个视频编码到一个紧凑的时空潜在空间中，并使用条件解码器查询任意帧在任意时间戳的3D几何和运动。实验表明，4RC 在各种4D重建任务中优于现有方法。

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Authors: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

First: 2026-02-10T18:55:41+00:00 · Latest: 2026-02-10T18:55:41+00:00

Comments: 41 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

中文标题/摘要

标题：代理世界模型：无限合成环境下的自主强化学习

大型语言模型（LLM）的最新进展使自主代理能够执行需要与工具和环境进行多轮交互的复杂任务。然而，这种代理训练的扩展受到缺乏多样性和可靠环境的限制。在本文中，我们提出了代理世界模型（AWM），这是一种完全合成环境生成管道。使用此管道，我们扩展到涵盖日常场景的1,000个环境，在这些环境中，代理可以与丰富的工具集（每个环境平均35种工具）进行交互并获得高质量的观察结果。值得注意的是，这些环境是通过代码驱动并依托数据库的，提供了比LLM模拟环境更可靠和一致的状态转换。此外，它们与从现实环境中收集轨迹相比，使代理交互更加高效。为了展示此资源的有效性，我们对多轮工具使用代理进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态，我们还可以设计可靠的奖励函数。在三个基准上的实验表明，仅在合成环境中训练，而不是在特定基准环境中训练，可以实现更强的分布外泛化。代码可在https://github.com/Snowflake-Labs/agent-world-model/ 获取。

Summary / 总结

The paper addresses the challenge of scaling agent training in complex and diverse environments by proposing Agent World Model (AWM), a synthetic environment generation pipeline. AWM generates 1,000 environments with rich toolsets and reliable state transitions, enabling efficient agent interaction. Experiments show that training agents in these synthetic environments leads to strong out-of-distribution generalization on three benchmarks, demonstrating the effectiveness of AWM for agentic reinforcement learning.

研究旨在通过提出Agent World Model (AWM) 合成环境生成管道来解决在复杂环境中扩展代理训练的挑战。AWM 生成了1,000个多样化的环境，配备了丰富的工具集，使代理能够进行多轮交互并获得高质量的观察。实验表明，在这些合成环境中进行训练可以实现三个基准上的强泛化性能，证明了该资源的有效性。代码可在GitHub上获得。

Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization

Authors: Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou

First: 2024-10-08T17:59:30+00:00 · Latest: 2026-02-10T18:53:34+00:00

Comments: 31 pages, 33 figures, The project page and associated code can be accessed via https://jwmao1.github.io/storyiter/

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.

中文标题/摘要

标题：Story-Iter：一种无需训练的迭代范式以增强长故事生成

本文介绍了Story-Iter，一种新的无需训练的迭代范式，用于增强长故事生成。与现有方法依赖固定参考图像构建完整故事不同，我们的方法采用了一种新颖的外部迭代范式，超越了扩散模型内部的去噪迭代步骤，通过结合上一轮所有参考图像来不断细化生成的每一幅图像。为此，我们提出了一种即插即用、无需训练的全局参考交叉注意力（GRCA）模块，使用全局嵌入表示所有参考帧，确保长序列中的语义一致性。通过逐步引入整体视觉上下文和文本约束，我们的迭代范式能够实现精确生成，逐步优化故事可视化。在官方故事可视化数据集和我们的长故事基准中的广泛实验表明，Story-Iter在长故事可视化（多达100帧）方面的性能处于领先地位，同时在语义一致性和细粒度交互方面表现出色。

Summary / 总结

Story-Iter is a training-free iterative method for enhancing long-story generation. Unlike existing methods that use fixed reference images, Story-Iter continuously refines each generated image by incorporating all reference images from the previous round through a global reference cross-attention module. This approach ensures semantic consistency and fine-grained interactions, leading to state-of-the-art performance in long-story visualization with up to 100 frames.

Story-Iter 是一种无需训练的迭代方法，用于增强长故事生成。不同于现有方法使用固定参考图像，Story-Iter 通过全局参考交叉注意力模块在每轮生成中不断融合上一轮的所有参考图像，确保语义一致性和细粒度交互，从而在长达100帧的故事可视化中达到最先进的性能。

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Authors: Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully

First: 2026-02-10T18:51:39+00:00 · Latest: 2026-02-10T18:51:39+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2 · Project1

Abstract

Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos $\href{https://sites.google.com/view/code-sharp/homepage}{here}$.

中文标题/摘要

标题：CODE-SHARP：连续开放发现和演化技能作为层次奖励程序

开发能够在开放环境中不断发现和学习新技能的代理是人工智能领域的重大挑战。虽然强化学习为训练代理掌握复杂技能提供了强大的框架，但它通常依赖于手工设计的奖励函数。对于开放发现技能而言，由于有意义的技能集事先未知，这种方法是不可行的。虽然最近的方法在自动设计奖励函数方面取得了令人鼓舞的结果，但它们仍然局限于对预定义任务的奖励进行细化。为了解决这一局限，我们提出了连续开放发现和演化技能作为层次奖励程序（CODE-SHARP）的新框架，该框架利用基础模型（FM）来开放地扩展和细化层次技能档案，该档案以可执行的代码奖励函数有向图形式结构化。我们展示了仅在发现的SHARP技能生成的奖励上进行训练的目标条件代理学会了在Craftax环境中解决越来越长的时序目标。当由高级FM基计划组成时，发现的技能使单个目标条件代理能够解决复杂的、长时序任务，平均性能比预训练代理和任务特定专家策略高出134%以上。我们将开源我们的代码，并在此提供额外的视频：https://sites.google.com/view/code-sharp/homepage

Summary / 总结

CODE-SHARP is a framework that uses Foundation Models to continuously discover and evolve hierarchical reward programs for agents to learn novel skills. The method enables agents to solve increasingly complex tasks without predefined rewards, outperforming pretrained agents and task-specific expert policies by over 134% on average. The goal-conditioned agent trained on the discovered skills can solve long-horizon goals in the Craftax environment more effectively when composed with a high-level FM-based planner.

该论文提出了CODE-SHARP框架，利用层次奖励程序进行开放式的技能发现和进化。该框架利用基础模型扩展和细化技能档案，使智能体能够在没有预定义奖励的情况下学习复杂的长期任务。训练于发现技能的智能体在Craftax环境中平均性能比预训练智能体和任务特定专家策略高出134%以上。

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

Authors: Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

First: 2026-02-08T15:56:22+00:00 · Latest: 2026-02-10T18:48:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.

中文标题/摘要

标题：CyberExplorer：在实际攻击模拟环境中评估LLM的进攻性安全能力

实际的进攻性安全操作本质上是开放式的：攻击者探索未知的攻击面，在不确定性下修订假设，并且没有保证的成功。现有的基于LLM的进攻性代理评估依赖于封闭世界设置，具有预定义的目标和二元的成功标准。为了解决这一差距，我们引入了CyberExplorer，这是一个评估套件，包含两个核心组件：(1) 一个基于虚拟机的开放环境基准，该虚拟机托管了40个源自真实CTF挑战的漏洞Web服务，代理在没有先验漏洞位置知识的情况下自主进行侦察、目标选择和利用；(2) 一个反应式多代理框架，支持动态探索而无需预定义计划。CyberExplorer使评估超越了旗帜恢复，捕捉交互动态、协调行为、失败模式和漏洞发现信号，填补了基准与现实多目标攻击场景之间的差距。

Summary / 总结

The research aims to evaluate the offensive security capabilities of LLMs in a realistic, open-ended environment. CyberExplorer, an evaluation suite, consists of an open-environment benchmark and a reactive multi-agent framework. In this benchmark, agents autonomously perform reconnaissance, target selection, and exploitation on 40 vulnerable web services derived from real-world CTF challenges, without prior knowledge of vulnerabilities. Key findings include the ability of LLMs to discover vulnerabilities and coordinate attacks effectively, bridging the gap between benchmarks and real-world multi-target attack scenarios.

CyberExplorer 评估了LLM在开放环境中的进攻性安全能力。它包括一个基于40个来自真实CTF挑战的漏洞Web服务的开放环境基准和一个反应式多代理框架。关键发现包括能够自主进行侦察、目标选择和利用攻击，以及在攻击过程中捕捉交互动态和失败模式。

Anagent For Enhancing Scientific Table & Figure Analysis

Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang

First: 2026-02-10T18:46:28+00:00 · Latest: 2026-02-10T18:46:28+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.

中文标题/摘要

标题：增强科学表格与图表分析的代理工具

在科学研究中，分析需要准确解读复杂的多模态知识，整合来自不同来源的证据，并基于领域特定知识得出推断。然而，当前的人工智能（AI）系统在持续展示这些能力方面存在困难。科学表格与图表的复杂性和变异性，以及异构结构和长上下文需求，构成了科学表格与图表分析的基本障碍。为了量化这些挑战，我们引入了AnaBench，这是一个包含来自九个科学领域的63,178个实例的大规模基准，系统地按七个复杂维度分类。为应对这些挑战，我们提出了Anagent，这是一种通过四个专门代理增强科学表格与图表分析的多代理框架：规划者将任务分解为可执行的子任务，专家通过有针对性的工具执行检索特定任务信息，解决者综合信息生成连贯的分析，评论家通过五维质量评估进行迭代优化。我们进一步开发了模块化训练策略，利用监督微调和专门的强化学习来优化个体能力，同时保持有效的协作。在170个子领域进行全面评估表明，Anagent实现了显著的改进，在无微调设置中最高可达13.43%的提升，在微调设置中可达42.12%，同时揭示了任务导向的推理和上下文感知的问题解决对于高质量的科学表格与图表分析至关重要。我们的项目页面：https://xhguo7.github.io/Anagent/.

Summary / 总结

The paper addresses the challenge of accurately analyzing complex scientific tables and figures, which current AI systems often struggle with. It introduces AnaBench, a benchmark with 63,178 instances, and proposes Anagent, a multi-agent framework comprising Planner, Expert, Solver, and Critic agents. Anagent shows significant improvements in analysis quality, up to 13.43% in training-free settings and 42.12% with finetuning, highlighting the importance of task-oriented reasoning and context-aware problem-solving in scientific analysis.

论文旨在解决当前AI系统在分析复杂科学表格和图表时遇到的挑战，这些挑战源于其多样性和复杂性。为了量化这些挑战，作者引入了包含63,178个实例的AnaBench基准，这些实例来自九个科学领域。他们提出了Anagent多代理框架，包括分解任务的Planner、检索相关信息的Expert、综合分析的Solver和迭代改进的Critic。Anagent在无训练和有训练的设置中分别展示了高达13.43%和42.12%的分析质量提升，强调了任务导向的推理和上下文感知问题解决在科学分析中的重要性。

Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach

Authors: Soumyaroop Nandi, Prem Natarajan

First: 2026-02-10T18:46:04+00:00 · Latest: 2026-02-10T18:46:04+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.

中文标题/摘要

标题：图像拼接和复制移动伪造能否由同一模型检测？Forensim：一种基于注意力的状态空间方法

我们介绍了Forensim，一种基于注意力的状态空间框架，用于图像伪造检测，同时定位被篡改（目标）区域和源区域。与传统方法仅依赖于伪迹线索检测拼接或伪造区域不同，Forensim旨在捕捉理解上下文至关重要的复制模式。例如，在抗议图像中，仅检测伪造区域（例如，暴力行为被插入到和平人群中的复制行为）可能会误导解释，突显了联合源-目标定位的必要性。Forensim输出三类掩码（原始、源、目标），并在统一架构中支持拼接和复制移动伪造的检测。我们提出了一种视觉状态空间模型，利用归一化注意力图识别内部相似性，并配以基于区域的块注意力模块以区分篡改区域。这种设计使端到端训练和精确定位成为可能。Forensim在标准基准测试中达到了最先进的性能。我们还发布了CMFD-Anything，一个新数据集，解决了现有复制移动伪造数据集的局限性。

Summary / 总结

Forensim is an attention-based state-space framework for detecting image splicing and copy-move forgery by jointly localizing both manipulated and source regions. Unlike traditional methods that focus on artifact cues, Forensim captures duplication patterns to better understand the context. It outputs three-class masks and supports the detection of both splicing and copy-move forgeries within a unified architecture, achieving state-of-the-art performance on standard benchmarks. The model uses a visual state-space model with normalized attention maps and a region-based block attention module for precise localization. A new dataset, CMFD-Anything, is also introduced to address limitations of existing copy-move forgery datasets.

Forensim 是一种基于注意力的状态空间框架，旨在通过联合定位篡改和源区域来检测拼接和复制移动伪造的图像。不同于传统方法仅依赖于特征，Forensim 捕捉复制模式以提供更准确的上下文。该模型输出三类掩码并支持在统一架构中检测拼接和复制移动伪造。它使用视觉状态空间模型和归一化注意力图以及区域块注意力模块，实现了在基准上的最先进性能。此外，还引入了一个新数据集 CMFD-Anything，以解决现有复制移动伪造数据集的局限性。

Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals

Authors: Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal, Sameep Shrestha, Chia-wei Tang, Michael F. Lighthiser, Michael R. Hieb, Xuesu Xiao, Chris Thomas, Sungsoo Ray Hong

First: 2026-02-09T16:43:37+00:00 · Latest: 2026-02-10T18:41:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools.

中文标题/摘要

标题：多机器人地面视频智能分析设计与公共安全专业人员合作

来自地面机器人编队的视频可以通过提供可扩展的情境意识来促进公共安全，从而减轻专业人员的负担。然而，关于如何设计和整合多机器人视频到公共安全工作流程中知之甚少。与六家警局合作，我们研究了如何使这些视频变得实用。在研究1中，我们提出了第一个多机器人地面视频智能分析测试床。该测试床包括38个与公共安全相关的事件，涵盖20个机器人巡逻视频（10天/夜对）的数据集，以及6项旨在改进当前视频智能分析实践的设计要求。在研究2中，我们构建了MRVS工具，该工具通过一个根据需求设计的视频理解模型增强了多机器人巡逻视频流。参与者报告称，使用基于LLM的解释后，手动工作量减少且更加自信，但也提到了误报和隐私方面的担忧。最后，我们提出了设计未来多机器人视频智能分析工具的建议。

Summary / 总结

The research aims to enhance public safety by designing and integrating multi-robot ground video sensemaking. In Study 1, a testbed was developed with 38 events-of-interest and 20 robot patrol videos to improve current video sensemaking practices. In Study 2, MRVS was created to augment multi-robot patrol videos with a video understanding model, reducing manual workload and increasing confidence in LLM-based explanations, though concerns about false alarms and privacy were noted.

研究旨在通过多机器人地面视频感知技术提升公共安全，解决大规模态势感知的需求。研究构建了一个包含38个事件-of-兴趣和20个机器人巡逻视频的数据集，并提出了6项设计要求。第二项研究引入了MRVS工具，该工具结合了视频理解模型以辅助视频感知。参与者发现MRVS减少了手动工作量并增加了对基于LLM解释的信心，但也提出了关于误报和隐私的担忧。

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Authors: Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana

First: 2026-02-10T18:33:45+00:00 · Latest: 2026-02-10T18:33:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.

中文标题/摘要

标题：特征作为奖励：通过可解释性实现开放任务的可扩展监督

大规模数据集训练的语言模型已被证明能够学习编码抽象概念（如事实性或意图）的特征。这些特征传统上用于测试时的监控或引导。我们提出了一种替代方案：将特征作为开放任务的可扩展监督。我们考虑减少幻觉作为一种理想但开放的任务行为，并设计了一个基于强化学习（RL）的管道，名为RLFR（基于特征奖励的强化学习），该管道使用特征作为奖励函数。基于一个新颖的探针框架，该框架识别候选的幻觉声明，我们的管道教会模型在不确定其事实性时干预并纠正其完成内容。此外，该管道还通过我们的奖励特征实现了可扩展的测试时计算。整个端到端过程在Gemma-3-12B-IT上实现了一个政策，该政策与原始模型相比，幻觉的可能性降低了58%，同时在标准基准测试上保持了性能。通过将监督基于特征的语言，本文引入了一种新的可解释性在学习开放任务中的应用范式。

Summary / 总结

The paper explores the use of learned features as scalable supervision for open-ended tasks, specifically focusing on reducing hallucinations in language models. It introduces RLFR (Reinforcement Learning from Feature Rewards), a reinforcement learning pipeline that uses features as reward functions to teach models to correct uncertain completions. The approach, grounded in a novel probing framework, results in a 58% reduction in hallucinations compared to the original model while maintaining performance on standard benchmarks.

该论文探讨了使用学习到的特征作为开放任务的可扩展监督方法，特别是减少语言模型中的幻觉。它引入了RLFR（基于特征奖励的强化学习）管道，使用特征作为奖励函数来训练模型纠正不确定的完成。该方法在Gemma-3-12B-IT上实施，减少了58%的幻觉，同时不牺牲标准基准上的性能。

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Authors: Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou

Venue: ICLR 2026

First: 2025-10-20T11:26:45+00:00 · Latest: 2026-02-10T18:32:44+00:00

Comments: ICLR 2026, Project page: https://falcon-vla.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

中文标题/摘要

标题：从空间到行动：在空间基础先验中接地的视觉-语言-行动模型

现有的视觉-语言-行动（VLA）模型在3D真实世界中行动，但通常基于2D编码器，这导致了空间推理的缺口，限制了泛化能力和适应性。最近的3D集成技术要么需要专门的传感器并且在不同模态间转移效果差，要么注入弱线索缺乏几何信息并降低视觉-语言对齐。在本工作中，我们引入了FALCON（从空间到行动），这是一种新颖的范式，将丰富的3D空间令牌注入到行动头中。FALCON 利用空间基础模型从单色RGB图像中提供强大的几何先验，并包含一个可选地融合深度或姿态的体感空间模型，以提高精度，而无需重新训练或架构更改。为了保持语言推理，空间令牌被空间增强的行动头消费而不是被连接到视觉-语言主干中。这些设计使FALCON能够解决空间表示、模态转移性和对齐的限制。在三个模拟基准和十一个真实世界任务的全面评估中，我们提出的FALCON达到了最先进的性能，始终超越了竞争性基线，并且在杂乱、空间提示条件、物体尺度和高度变化下保持稳健。

Summary / 总结

FALCON addresses the spatial reasoning gap in vision-language-action models by integrating rich 3D spatial tokens into the action head, leveraging spatial foundation models to provide strong geometric priors from RGB inputs. It includes an Embodied Spatial Model that can optionally fuse depth or pose for higher fidelity. Comprehensive evaluations show that FALCON outperforms competitive baselines across various simulation and real-world tasks, demonstrating robustness under clutter and variations in object scale and height.

FALCON 通过将丰富的 3D 空间令牌集成到动作头中，并利用空间基础模型从 RGB 输入中提供强大的几何先验，解决了视觉-语言-动作模型中的空间推理缺口。结合可选融合深度或姿态数据的嵌入式空间模型，FALCON 在各种基准测试和真实世界任务中表现出色，即使在杂乱环境中和物体尺度变化的情况下也能保持稳健性。

Chain of Mindset: Reasoning with Adaptive Cognitive Modes

Authors: Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Yuhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen

First: 2026-02-10T18:31:47+00:00 · Latest: 2026-02-10T18:31:47+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96\% and 4.72\% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \href{https://github.com/QuantaAlpha/chain-of-mindset}{https://github.com/QuantaAlpha/chain-of-mindset}.

中文标题/摘要

标题：思维链：适应性认知模式推理

人类解决问题绝非单一思维模式的重复，我们指的是特定的认知处理模式。面对特定任务时，我们不会依赖单一思维模式，而是将多种思维模式整合到单一的解决方案过程中。然而，现有的大语言模型推理方法往往陷入一个常见陷阱：它们在所有步骤中应用相同的固定思维模式，忽视了解决同一问题的不同阶段需要根本不同的思维模式。这种单一思维模式的假设阻碍了模型达到更高层次的智能。为解决这一局限，我们提出了一种无需训练的代理框架——思维链（CoM），该框架能够实现步骤级别的适应性思维模式编排。CoM 将推理分解为四个功能异质的思维模式：空间思维、收敛思维、发散思维和算法思维。一个元代理根据推理状态的演变动态选择最优思维模式，而双向上下文门控则过滤模块间的信息流，以保持有效性和效率。跨六个涵盖数学、代码生成、科学问答和空间推理的挑战性基准实验表明，CoM 达到了最先进的性能，在 Qwen3-VL-32B-Instruct 和 Gemini-2.0-Flash 上的整体准确率分别比最强基线高出 4.96% 和 4.72%，同时平衡了推理效率。我们的代码已公开发布于 https://github.com/QuantaAlpha/chain-of-mindset。

Summary / 总结

The paper addresses the limitation of existing large language model (LLM) reasoning methods that apply a single fixed mindset throughout the problem-solving process, which overlooks the need for different mindsets at different stages. It introduces Chain of Mindset (CoM), a training-free framework that dynamically selects among four heterogeneous mindsets (Spatial, Convergent, Divergent, and Algorithmic) based on the evolving reasoning state. Experiments show CoM outperforms strong baselines by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, respectively, while balancing efficiency across six benchmarks in mathematics, code generation, scientific QA, and spatial reasoning.

研究旨在通过解决现有模型在整个问题解决过程中使用单一固定思维模式的局限性，改进语言模型的推理能力。提出的Chain of Mindset (CoM)框架根据推理状态的演变动态选择四种异质思维模式——空间、收敛、发散和算法。实验表明，CoM在Qwen3-VL-32B-Instruct和Gemini-2.0-Flash上分别在总体准确率上比现有最佳基线高出4.96%和4.72%，同时在数学、代码生成、科学问答和空间推理等各个基准上保持了效率和有效性。

Vendi Novelty Scores for Out-of-Distribution Detection

Authors: Amey P. Pasarkar, Adji Bousso Dieng

First: 2026-02-10T18:30:29+00:00 · Latest: 2026-02-10T18:30:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

中文标题/摘要

标题：Vendi新颖性评分在离群值检测中的应用

离群值检测对于机器学习系统的安全部署至关重要。现有的后验检测器通常依赖于模型置信度评分或特征空间中的似然估计，往往在分布假设上受到限制。在本工作中，我们引入了第三种范式，并从多样性角度提出了离群值检测的新框架。我们提出了Vendi新颖性评分（VNS），这是一种基于Vendi评分（VS）的离群值检测器，Vendi评分是一系列基于相似性的多样性度量。VNS量化了测试样本增加分布内特征集多样性度量的量，提供了一种无需密度建模的原理性新颖性概念。VNS是线性时间的、非参数的，并且自然地结合了类条件（局部）和数据集级（全局）的新颖性信号。在多个图像分类基准和网络架构上，VNS实现了最先进的离群值检测性能。令人惊讶的是，当仅使用训练数据的1%计算时，VNS仍能保持这一性能，使其能够在内存或访问受限的环境中部署。

Summary / 总结

The research aims to improve out-of-distribution (OOD) detection for machine learning systems by introducing a novel approach based on diversity metrics. The Vendi Novelty Score (VNS) is proposed, which quantifies how much a test sample increases the Vendi Score (VS) of the in-distribution feature set. This method does not require density modeling and provides state-of-the-art OOD detection performance across various benchmarks and network architectures. Notably, VNS performs well even when using only 1% of the training data, making it suitable for resource-constrained environments.

研究旨在通过引入基于多样性度量的新方法来提高机器学习系统的域外（OOD）检测能力。提出了Vendi Novelty Score (VNS)，该方法量化测试样本增加训练数据集中Vendi Score (VS) 的程度，无需进行密度建模，并在多个基准测试和网络架构上实现了最先进的OOD检测性能。特别地，即使仅使用训练数据的1%，VNS 也能保持良好的性能，适用于资源受限的环境。

WildCat: Near-Linear Attention in Theory and Practice

Authors: Tobias Schröder, Lester Mackey

First: 2026-02-10T18:22:32+00:00 · Latest: 2026-02-10T18:22:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.

Summary / 总结

WildCat is a method to compress the attention mechanism in neural networks, reducing the computational cost while maintaining high accuracy. It uses a fast subsampling algorithm called randomly pivoted Cholesky to select a small weighted coreset, and optimally weights the elements to minimize reconstruction error. Experiments show that WildCat approximates exact attention with super-polynomial error decay while running in near-linear time, outperforming previous methods that either lack error guarantees or require quadratic runtime for high fidelity.

WildCat 是一种压缩神经网络中注意力机制的方法，通过使用快速子采样算法随机 pivoted Cholesky 选择一个小的 coreset，并最优加权元素以最小化重构误差，从而降低计算成本同时保持高准确性。实验表明，WildCat 的误差衰减呈超多项式级，运行时间为接近线性，优于之前既没有误差保证又需要二次运行时间才能达到高保真度的方法。

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Authors: Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

First: 2025-12-02T18:31:18+00:00 · Latest: 2026-02-10T18:19:05+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

中文标题/摘要

标题：从审查到调解：大语言模型能否作为在线争吵的调解人？

大型语言模型（LLMs）的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流，它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本文探讨LLMs是否不仅能作为检测有害内容的审查员，还能作为理解并缓解在线冲突的调解人。我们的框架将调解分解为两个子任务：判断，即LLM评估对话的公平性和情感动态；引导，即生成同理心、缓解冲突的消息，引导参与者走向解决。为了评估调解质量，我们构建了一个基于Reddit的大规模数据集，并提出了一种结合原则评分、用户模拟和人工对比的多阶段评估管道。实验表明，基于API的模型在调解时在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。

Summary / 总结

This study investigates whether large language models (LLMs) can act as mediators in online conflicts, focusing on their ability to understand and de-escalate tensions. The research decomposes mediation into judgment and steering tasks, and evaluates LLMs using a multi-stage pipeline. Experiments indicate that API-based models outperform open-source models in both reasoning and intervention alignment during mediation, suggesting both the potential and current limitations of LLMs in this role.

研究探讨了大型语言模型（LLMs）是否能在在线冲突中充当调解者的角色，重点在于它们理解和缓解紧张局势的能力。研究将调解分解为判断和引导两个子任务，并使用多阶段评估管道进行评估。实验表明，基于API的模型在推理和干预一致性方面优于开源模型，这既显示了LLMs在这一角色中的潜力，也揭示了当前的局限性。

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

Authors: Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag

First: 2026-02-10T18:18:37+00:00 · Latest: 2026-02-10T18:18:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.

中文标题/摘要

标题：时空注意力机制在自动驾驶中一致视频语义分割的应用

深度神经网络，尤其是基于变换器的架构，在环境感知的语义分割方面取得了显著的成功。然而，现有的模型独立处理视频帧，未能利用时间一致性，这在动态场景中可以显著提高准确性和稳定性。在本文中，我们提出了一种时空注意力（STA）机制，将变换器注意力块扩展到多帧上下文，以实现视频语义分割的稳健时间特征表示。我们的方法修改了标准的自我注意力机制，以处理时空特征序列，同时保持计算效率并仅对现有架构进行少量修改。STA在各种变换器架构中具有广泛的适用性，并且在轻量级和更大规模的模型中仍然有效。在Cityscapes和BDD100k数据集上的全面评估显示，与单帧基线相比，STA在时间一致性指标上的改进达到了9.20个百分点，在平均交并比上的改进达到了1.76个百分点。这些结果表明，STA是视频基语义分割应用的有效架构增强。

Summary / 总结

This work addresses the limitation of existing models that process video frames independently, thereby missing out on temporal consistency. It introduces a Spatio-Temporal Attention (STA) mechanism to extend transformer attention blocks, allowing for the incorporation of multi-frame context. The evaluation on Cityscapes and BDD100k datasets shows STA improves temporal consistency by 9.20 percentage points and mean intersection over union by up to 1.76 percentage points compared to single-frame baselines.

这项工作针对现有模型独立处理视频帧，忽视动态场景中至关重要的时序一致性的问题。它引入了一种时空注意力（STA）机制，以增强基于变压器的架构在视频语义分割中的表现。STA通过引入多帧上下文来改进时序特征表示，同时保持计算效率。在Cityscapes和BDD100k数据集上的评估显示，STA显著提升了时序一致性9.20个百分点，并将平均交并比提高了最多1.76个百分点，相较于单帧基线模型而言。

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Authors: Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin

Venue: ICASSP

First: 2026-02-10T18:15:58+00:00 · Latest: 2026-02-10T18:15:58+00:00

Comments: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.

中文标题/摘要

标题：通过细粒度组策略优化实现长链思考压缩

大型语言模型（LLMs）经常生成不必要的冗长链思考（CoT）推理，这增加了计算成本和延迟，但没有相应的性能提升。本文提出了一种细粒度组策略优化（FGO）算法，这是一种强化学习（RL）算法，通过细分和分配适当的权重来精炼组响应，基于长度和熵，从而实现有效的CoT压缩。同时，作为组相对策略优化（GRPO）的增强变体，FGO成功解决了GRPO的两个主要局限性：数据利用效率低和熵坍塌。我们在多个推理LLMs和基准上评估了FGO，包括MATH500、AIME24、AMC23和Minerva。实验结果表明，FGO在不降低性能的情况下实现了有效的CoT压缩，并同时解决了GRPO的关键局限性。

Summary / 总结

This paper addresses the issue of verbose Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs), which increases computational costs without significant performance gains. The authors propose Fine-grained Group Policy Optimization (FGO), a Reinforcement Learning algorithm that subdivides and assigns weights to group responses based on length and entropy, effectively compressing CoT. FGO improves upon Group Relative Policy Optimization (GRPO) by addressing its limitations of inefficient data utilization and entropy collapse. Experiments on MATH500, AIME24, AMC23, and Minerva show that FGO can compress CoT efficiently without performance degradation.

本文针对大型语言模型（LLMs）中冗长的链式思考（CoT）推理问题，这种推理增加了计算成本但并未显著提升性能。作者提出了一种细粒度分组优化（FGO）算法，该算法通过细分并根据长度和熵分配权重来有效压缩CoT。FGO改进了分组相对策略优化（GRPO），解决了数据利用效率低和熵坍塌的问题。实验结果表明，FGO可以在不牺牲性能的情况下有效压缩CoT。

How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Authors: Xiwen Huang, Pierre Pinson

First: 2025-11-25T18:34:33+00:00 · Latest: 2026-02-10T18:12:04+00:00

Comments: Accepted for publication in INFORMS Journal on Data Science (IJDS). This is the authors' preprint

Abs · PDF · Code1 · Code2

Abstract

We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to benchmark baselines including random sampling and a greedy knapsack heuristic. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

中文标题/摘要

标题：如何购买标签？使用主动学习市场的一种成本效益方法

我们介绍了主动学习市场作为一种购买标签的方法，适用于分析师希望获取额外数据以改进模型拟合或更好地训练用于预测分析的应用模型的情况。这与已经存在的许多购买特征和示例的提议形成了对比。通过将市场出清形式化为优化问题，我们将预算约束和改进阈值整合到标签获取过程中。我们专注于单买家多卖家的设置，并提出使用两种主动学习策略（基于方差和基于委员会查询），并配以不同的定价机制。这些策略与随机抽样和贪婪背包启发式方法的基准进行了比较。所提出的方法在两个关键应用领域的真实世界数据集上进行了验证：房地产定价和能源预测。结果表明，我们的方法具有鲁棒性，与传统方法相比，使用更少的标签实现了更优的性能。我们的提议提供了一种在资源受限环境中优化数据获取的易于实现的实用解决方案。

Summary / 总结

The paper introduces active learning markets as a cost-effective method for acquiring labels, integrating budget constraints and improvement thresholds. It compares variance-based and query-by-committee-based strategies with pricing mechanisms against random sampling and a greedy knapsack heuristic. The approach is validated on real-world datasets from real estate pricing and energy forecasting, showing superior performance with fewer labels acquired compared to conventional methods.

论文提出了一种使用主动学习市场以成本效益的方式获取标签的方法，整合了预算限制和性能阈值。它将基于方差和基于委员会的主动学习策略与定价机制与随机抽样和贪婪背包启发式算法进行了比较。该方法在房地产定价和能源预测的实际数据集上进行了验证，显示了在获取更少标签的情况下比传统方法具有更优性能。

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Authors: Akshay Mete, Shahid Aamir Sheikh, Tzu-Hsiang Lin, Dileep Kalathil, P. R. Kumar

First: 2026-02-10T18:11:00+00:00 · Latest: 2026-02-10T18:11:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.

中文标题/摘要

标题：乐观的世界模型：基于模型的深度强化学习中的高效探索

高效的探索仍然是强化学习（RL）中的一个核心挑战，尤其是在稀疏奖励环境中。我们提出了乐观的世界模型（OWMs），这是一种原理上和可扩展的乐观探索框架，将自适应控制中的奖励偏向最大似然估计（RBMLE）引入到深度RL中。与基于上置信界（UCB）的探索方法不同，OWMs 通过增加乐观动力学损失直接将乐观性融入到模型学习中，使想象中的过渡偏向高奖励结果。这种完全基于梯度的损失不需要不确定性估计也不需要约束优化。我们的方法可以无缝集成到现有的世界模型框架中，保持可扩展性，只需对标准训练过程进行少量修改。我们将在两种最先进的世界模型架构中实例化OWMs，分别得到乐观DreamerV3和乐观STORM，它们在样本效率和累计回报方面相对于基线模型有显著改进。

Summary / 总结

The paper addresses the challenge of efficient exploration in reinforcement learning, especially in sparse-reward environments. It introduces Optimistic World Models (OWMs), which incorporate optimism directly into model learning through an optimistic dynamics loss. This method improves sample efficiency and cumulative return compared to existing approaches like UCB-style exploration. OWMs are compatible with existing world model frameworks and require minimal modifications to standard training procedures.

论文针对强化学习中稀疏奖励环境下的高效探索问题，提出了一种名为Optimistic World Models (OWMs)的方法，通过增加乐观动力学损失直接将乐观性引入模型学习中。这种方法提高了样本效率和累计回报，无需估计不确定性或使用约束优化。OWMs与现有的世界模型框架兼容，使其具有可扩展性和易于实现的特点。

Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI

Authors: Gaurang Sharma, Harri Polonen, Juha Pajula, Jutta Suksi, Jussi Tohka

First: 2026-02-10T18:10:12+00:00 · Latest: 2026-02-10T18:10:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.

中文标题/摘要

标题：简单的图像处理和相似性度量可以通过脑MRI链接数据库中的数据样本

头部磁共振成像(MRI)在严格监管框架下被常规收集和共享用于研究。这些框架要求在共享前去除潜在的标识符。但在去头骨处理后，脑实质仍包含独特的特征，可以在同一参与者跨数据库的其他MRI之间匹配，如果存在其他数据特征，这将构成隐私风险。当前的监管框架通常要求基于一定合理性的评估来评估此类风险。先前的研究已经表明，脑MRI可以实现参与者链接，但它们依赖于基于训练的方法或计算密集型方法。在这里，我们证明了使用标准预处理后通过图像相似性计算可以链接个体的去头骨T1加权MRI，如果存在其他标识符，这可能导致重新识别。尽管存在潜在的认知衰退，我们模拟了跨数据库的MRI匹配，实现了几乎完美的链接准确性，匹配了不同时间间隔、扫描类型、空间分辨率和采集协议的数据样本。这些结果旨在为医疗数据共享的发展提供有意义的政策建议。

Summary / 总结

The study aims to highlight the privacy risks associated with sharing brain MRI data even after skull stripping, as unique signatures in the brain parenchyma can still match other MRIs from the same participants. The researchers used standard image processing and similarity measures to link MRI samples across different databases, achieving nearly perfect accuracy despite variations in time intervals, scanner types, and acquisition protocols. This work underscores the need for more stringent policies in medical data sharing to address these risks.

该研究探讨了即使移除了潜在标识符，通过脑MRI数据重新识别个体仍存在的隐私风险。通过使用简单的图像处理和相似性度量方法，作者展示了在不同时间间隔、扫描器类型、空间分辨率和采集协议下，几乎完美地链接了MRI样本。这强调了在医疗数据共享中需要制定更为严格的政策以降低重新识别风险的重要性。

Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection

Authors: Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu

Venue: ICASSP 2026

First: 2026-02-10T18:10:08+00:00 · Latest: 2026-02-10T18:10:08+00:00

Comments: Accepted by ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.

中文标题/摘要

标题：Fake-HR1：重新思考视觉语言模型在合成图像检测中的推理机制

近期研究表明，在检测过程中引入链式思考（CoT）推理可以增强模型检测合成图像的能力。然而，过长的推理过程会带来显著的资源开销，包括令牌消耗和延迟，特别是在处理明显伪造的图像时尤为冗余。为解决这一问题，我们提出了一种大规模混合推理模型Fake-HR1，据我们所知，这是首个能够根据生成检测任务的特性自适应地决定是否需要推理的模型。为此，我们设计了一个两阶段训练框架：首先进行混合微调（HFT）以进行冷启动初始化，然后通过混合推理组策略优化（HGRPO）进行在线强化学习，以隐式学习何时选择合适的推理模式。实验结果表明，Fake-HR1能够在不同类型的查询中自适应地进行推理，不仅在推理能力和生成检测性能上超越现有语言模型，还在响应效率上显著提升。

Summary / 总结

The research aims to improve the ability of vision-language models to detect synthetic images by incorporating adaptive reasoning. The proposed Fake-HR1 model uses a two-stage training framework, including Hybrid Fine-Tuning and Hybrid-Reasoning Grouped Policy Optimization, to determine when reasoning is necessary. Experimental results demonstrate that Fake-HR1 outperforms existing models in both reasoning ability and generative detection performance while enhancing response efficiency.

论文针对检测合成图像时视觉-语言模型中过度推理导致的高资源消耗问题，提出了一个能够根据任务特性自适应决定是否需要推理的混合推理模型Fake-HR1。通过两阶段训练框架，Fake-HR1在推理能力和生成检测性能上均优于现有模型，同时显著提高了响应效率。

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

Authors: Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham

First: 2024-11-01T14:04:36+00:00 · Latest: 2026-02-10T18:09:04+00:00

Comments: 40 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. GitHub: \href{https://github.com/Fsoft-AIC/LibMoE}{https://github.com/Fsoft-AIC/LibMoE}.

中文标题/摘要

标题：LIBMoE：大规模语言模型中混合专家库的全面基准测试

混合专家（MoE）架构已成为扩展的关键基石，并且是大多数大型语言模型（如GPT-OSS、DeepSeek-V3、Llama-4和Gemini-2.5）的重要组成部分。然而，由于训练和评估的高昂计算成本，系统性的MoE研究受到严重限制，限制了大多数研究人员可访问的大规模研究。我们引入了LibMoE，这是一个统一的框架，支持可重复、高效和可扩展的MoE研究，涵盖预训练和稀疏上行编译两种模式。除了统一的实现，该框架还提供了透明的分析工具，用于探究路由和专家动态。基于这一基础，我们从三个维度进行了全面分析：(i) 路由动态，涵盖专家选择模式、路由稳定性和最优性，以及路由熵如何揭示任务专业化和专家多样性；(ii) 轻量级初始化对负载均衡的影响，展示了路由器初始化的细微变化如何影响早期专家的利用；(iii) 训练模式差异，揭示了稀疏上行编译和全预训练表现出不同的路由模式和稳定性特征。通过降低进入门槛、标准化评估，并结合我们的全面分析，LibMoE 扩大了MoE研究的可访问性，并建立了可靠的基准，以指导未来创新。GitHub：https://github.com/Fsoft-AIC/LibMoE。

Summary / 总结

The research introduces LibMoE, a unified framework for MoE (mixture of experts) research in large language models, addressing the computational challenges by providing reproducible, efficient, and extensible implementations. The study analyzes routing dynamics, the impact of lightweight initialization on load balancing, and the differences in training regimes, revealing insights into expert selection patterns, routing stability, and task specialization. By lowering the barrier to entry, LibMoE enhances access to MoE research and establishes a benchmark for future innovations.

研究旨在通过引入LibMoE统一框架解决MoE架构研究中的计算挑战，该框架支持预训练和稀疏升级，并提供分析路由动态和专家行为的工具。关键发现包括专家选择模式、路由稳定性和路由器初始化对负载平衡的影响，以及稀疏升级和全预训练之间路由模式的差异。

Perception with Guarantees: Certified Pose Estimation via Reachability Analysis

Authors: Tobias Ladner, Yasser Shoukry, Matthias Althoff

First: 2026-02-10T17:55:49+00:00 · Latest: 2026-02-10T17:55:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.

中文标题/摘要

标题：有保障的感知：基于可达性分析的认证姿态估计

在计算物理系统中的代理越来越多地承担关键安全任务。确保这些代理的安全通常需要对其姿态进行定位以便后续操作。姿态估计可以从激光雷达传感器、摄像头以及外部服务（如GPS）等多种组合中获得。关键在于，在关键安全领域，粗略的估计不足以正式确定安全性，即在最坏情况下保证安全性，而外部服务也可能不可靠。我们通过仅从摄像头图像和已知目标几何学出发，提出了一种认证的3D姿态估计方法来解决这一问题。这通过正式限制姿态实现，姿态是通过利用最近的可达性分析和形式神经网络验证的最新成果来计算的。我们的实验表明，该方法在合成和真实世界实验中都能高效且准确地定位代理。

Summary / 总结

The research aims to ensure the safety of agents in cyber-physical systems by providing certified pose estimation, which guarantees safety in worst-case scenarios. The method uses reachability analysis and formal neural network verification to compute the pose from a camera image and a known target geometry. The experiments show that the approach accurately localizes agents in both synthetic and real-world settings.

研究旨在通过提供认证的姿态估计来确保网络物理系统中代理的安全性，这种估计在最坏情况下也能保证安全。方法利用可达性分析和形式神经网络验证来从相机图像和已知目标几何中计算姿态。实验表明，该方法在合成和真实世界环境中都能准确地定位代理。

ALIVE: Animate Your World with Lifelike Audio-Video Generation

Authors: Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu, Pan Xie, Dongjun Qian, Yu Zhang, Ruiqi Li, Yuqi Zhang, Ruibiao Lu, Xiaofeng Mei, Bo Han, Xiang Yin, Bingyue Peng, Zehuan Yuan

First: 2026-02-09T14:06:03+00:00 · Latest: 2026-02-10T17:53:52+00:00

Comments: Technical report for ALIVE. Bytedance ALIVE Team. Homepage: https://foundationvision.github.io/Alive/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.

中文标题/摘要

标题：ALIVE：以逼真音视频生成技术激活世界

音视频生成正迅速向统一的音视频生成发展。本文介绍了ALIVE，一种将预训练的文本到视频（T2V）模型适应为Sora风格的音视频生成和动画的生成模型。特别是，该模型相比T2V基础模型解锁了文本到视频和音频（T2VA）以及参考到视频和音频（动画）的能力。为了支持音视频同步和参考动画，我们扩展了流行的MMDiT架构，加入了一个联合音视频分支，包括TA-CrossAttn进行时间对齐的跨模态融合和UniTemp-RoPE进行精确的音视频对齐。同时，我们精心设计了一个包含音视频字幕、质量控制等的全面数据管道，以收集高质量的微调数据。此外，我们引入了一个新的基准测试，以进行全面的模型测试和比较。经过数百万级高质量数据的持续预训练和微调，ALIVE展示了卓越的性能，持续超越开源模型，并与最先进的商业解决方案相当或超越。通过详细的配方和基准测试，我们希望ALIVE能够帮助社区更高效地开发音视频生成模型。官方网站：https://github.com/FoundationVision/Alive.

Summary / 总结

ALIVE is a generation model that enhances a pretrained Text-to-Video model to support Text-to-Video-Audio and Reference-to-Video-Audio capabilities. It uses a modified MMDiT architecture with a joint audio-video branch for precise synchronization and alignment. After extensive training on high-quality data, ALIVE shows superior performance, outperforming open-source models and matching state-of-the-art commercial solutions in audio-video generation tasks.

ALIVE 是一个增强的 Text-to-Video 模型，能够实现音频视频生成和动画。它使用了修改后的 MMDiT 架构，包含联合音频视频分支以提高同步和对齐效果。经过大量高质量数据的训练后，ALIVE 展现出优越的性能，超越开源模型并达到或超过商业解决方案的水平。还引入了一个新的基准测试以进行全面的模型评估和比较。

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Authors: Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang

First: 2026-02-10T17:42:31+00:00 · Latest: 2026-02-10T17:42:31+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.

中文标题/摘要

标题：解耦推理与隐式事实标记（DRIFT）：一种高效的长上下文推理双模型框架

将大量动态知识整合到大型语言模型（LLMs）中仍然是一个重大挑战，因为事实数据和推理模式之间存在固有的纠缠。现有的解决方案，从非参数检索增强生成（RAG）到参数知识编辑，通常由于有限的上下文窗口、检索器噪声或灾难性遗忘的风险而受到实践中的限制。在本文中，我们提出了一种名为DRIFT的新颖双模型架构，旨在显式地将知识提取与推理过程解耦。与静态提示压缩不同，DRIFT使用一个轻量级的知识模型，根据查询动态压缩文档片段为隐式事实标记。这些密集表示被投影到推理模型的嵌入空间中，替代原始冗余文本，同时保持推理准确性。大量实验表明，DRIFT在长上下文任务上的性能显著提高，优于同等规模模型的强基线。我们的方法为扩展LLMs的有效上下文窗口和推理能力提供了一种可扩展且高效的范式。我们的代码可在https://github.com/Lancelot-Xie/DRIFT获取。

Summary / 总结

The paper proposes DRIFT, a dual-model framework that decouples knowledge extraction from reasoning to address the limitations of existing knowledge integration methods in Large Language Models. DRIFT uses a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens, which are then projected into the reasoning model's embedding space. Experiments demonstrate that DRIFT outperforms strong baselines on long-context tasks while maintaining inference accuracy and providing a scalable solution for extending the effective context window of LLMs.

论文提出了一种名为DRIFT的双模型框架，通过将知识提取与推理过程解耦来解决现有知识集成方法在大型语言模型中的局限性。DRIFT使用轻量级的知识模型动态压缩文档片段为隐式事实令牌，然后将这些密集表示投影到推理模型的嵌入空间中。实验表明，DRIFT在长上下文任务上优于强基线模型，同时保持推理准确性，并提供了一种扩展大型语言模型有效上下文窗口和推理能力的可扩展解决方案。

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

Authors: Qingnan Ren, Shiting Huang, Zhen Fang, Zehui Chen, Lin Chen, Lijun Li, Feng Zhao

First: 2026-02-10T17:40:39+00:00 · Latest: 2026-02-10T17:40:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.

中文标题/摘要

标题：ADORA：通过动态优势估计训练推理模型

强化学习已成为开发复杂任务中推理模型的核心技术，从数学问题求解到想象推理。这些模型的优化通常依赖于策略梯度方法，其效果取决于优势函数的准确估计。然而，现有的方法通常采用静态优势估计，这导致了信用分配效率低下，忽略了随时间变化的训练样本的动态效用。这一限制导致了次优的策略更新，进而表现为收敛速度较慢和学习不稳定，因为模型无法有效地适应样本效用的变化。为了解决这个问题，我们提出了ADORA（基于在线滚动适应的优势动态），一种新的策略优化框架。ADORA通过在线模型滚动期间临时优势和劣势样本的自适应分类，动态调整优势函数的权重。这种定制化的数据分类策略使ADORA能够无缝集成到现有的策略优化算法中，无需显著的架构修改，从而使策略能够优先学习更有信息量的经验，从而实现更高效的策略更新。在不同模型家族和数据规模下的广泛评估表明，ADORA是一个稳健且高效的框架。它显著提高了几何和数学任务中的长推理能力，且无需敏感的超参数调整即可实现持续的性能提升。

Summary / 总结

The paper introduces ADORA, a framework for policy optimization in reinforcement learning that dynamically adjusts the advantage function's weighting. It addresses the inefficiency of static advantage estimation by categorizing training data into temporarily advantageous and disadvantageous samples based on their evolving utility. ADORA enables more efficient policy updates, leading to faster convergence and reduced learning instability, as demonstrated in evaluations across various model families and data scales.

论文提出了ADORA方法，通过在线模型滚动过程中动态调整优势函数的权重，将训练数据分为暂时有利和不利样本。这种方法通过优先学习更有信息量的经验来改进策略更新，从而加快收敛速度并提高稳定性。评估结果显示，ADORA在几何和数学任务中的长期推理方面表现出色，且在不同模型家族和数据规模下均能获得一致的性能提升。

RoboSubtaskNet: Temporal Sub-task Segmentation for Human-to-Robot Skill Transfer in Real-World Environments

Authors: Dharmendra Sharma, Archit Sharma, John Reberio, Vaibhav Kesharwani, Peeyush Thakur, Narendra Kumar Dhar, Laxmidhar Behera

First: 2026-02-10T17:37:35+00:00 · Latest: 2026-02-10T17:37:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.

中文标题/摘要

标题：RoboSubtaskNet：在实际环境中的类人到机器人技能转移的时序子任务分割

在长且未剪辑的视频中，准确地定位和分类细粒度的子任务片段对于安全的人机协作至关重要。与通用活动识别不同，协作操作需要可以直接由机器人执行的子任务标签。我们提出了RoboSubtaskNet，这是一种多阶段的人机子任务分割框架，结合了注意力增强的I3D特征（RGB加上光流）与修改后的MS-TCN，并采用斐波那契扩张计划来捕捉更好的短期过渡，如伸手-拾取-放置。该网络通过复合目标训练，包括交叉熵和时间正则化（截断均方误差和一个过渡感知项），以减少过度分割并鼓励有效的子任务进展。为了缩小视觉基准与控制之间的差距，我们引入了RoboSubtask数据集，该数据集包含医疗保健和工业演示，并在子任务级别进行了注释，旨在确定性地映射到操作器原语。实验中，RoboSubtaskNet在GTEA和我们的RoboSubtask基准测试（边界敏感和序列指标）上优于MS-TCN和MS-TCN++，在长时序Breakfast基准测试上保持竞争力。具体而言，RoboSubtaskNet在GTEA上达到F1@50=79.5%，编辑距离=88.6%，准确率=78.9%；在Breakfast上达到F1@50=30.4%，编辑距离=52.0%，准确率=53.5%；在RoboSubtask上达到F1@50=94.2%，编辑距离=95.6%，准确率=92.2%。我们进一步在7自由度Kinova Gen3操作器上验证了完整的感知到执行管道，在物理试验中实现了可靠的端到端行为（总体任务成功率约为91.25%）。这些结果表明，从子任务级别视频理解到实际环境中的部署机器人操作的实用路径。

Summary / 总结

The research aims to develop a framework for segmenting fine-grained sub-tasks in long videos to enable safe human-robot collaboration. RoboSubtaskNet uses a multi-stage approach with attention-enhanced I3D features and a modified MS-TCN to achieve this. The network is trained with a composite objective to reduce over-segmentation and encourage valid sub-task progressions. On benchmarks, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ with F1 scores and Edit distances that are significantly higher than those of the compared methods. Physical trials on a 7-DoF Kinova Gen3 manipulator show reliable end-to-end behavior with an overall task success rate of about 91.25%.

研究旨在开发一种框架，用于在长视频中分割细粒度子任务，以实现安全的人机协作。RoboSubtaskNet 使用结合了注意力增强的 I3D 特征和修改后的 MS-TCN 的多阶段方法来准确分类子任务。该框架使用复合目标进行训练，以减少过度分割并鼓励有效的子任务进展。在基准测试中，RoboSubtaskNet 在 F1 分数和编辑距离方面优于 MS-TCN 和 MS-TCN++。在 7 自由度 manipulator 的物理试验中，展示了可靠的端到端行为。

Discovering High Level Patterns from Simulation Traces

Authors: Sean Memery, Kartic Subr

First: 2026-02-10T17:31:39+00:00 · Latest: 2026-02-10T17:31:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the default choice, as an AI tool, they struggle with tasks involving physics. The LM's capability for physical reasoning is learned from observational data, rather than being grounded in simulation. A common approach is to include simulation traces as context, but this suffers from poor scalability as simulation traces contain larger volumes of fine-grained numerical and semantic data. In this paper, we propose a natural language guided method to discover coarse-grained patterns (e.g., 'rigid-body collision', 'stable support', etc.) from detailed simulation logs. Specifically, we synthesize programs that operate on simulation logs and map them to a series of high level activated patterns. We show, through two physics benchmarks, that this annotated representation of the simulation log is more amenable to natural language reasoning about physical systems. We demonstrate how this method enables LMs to generate effective reward programs from goals specified in natural language, which may be used within the context of planning or supervised learning.

中文标题/摘要

标题：从仿真轨迹中发现高级模式

嵌入具有物理交互环境的人工智能（AI）代理面临许多挑战，包括推理、规划、总结和问答。当人类用户希望指导或以自然语言与代理交互时，这一问题会进一步加剧。尽管语言模型（LMs）是默认选择，但它们在涉及物理的任务上表现不佳。LMs的物理推理能力是从观察数据中学习的，而不是基于模拟。一种常见方法是将仿真轨迹作为上下文包含进来，但这种方法在可扩展性方面存在问题，因为仿真轨迹包含大量细粒度的数值和语义数据。在本文中，我们提出了一种自然语言引导的方法，从详细的仿真日志中发现粗粒度模式（例如，“刚体碰撞”、“稳定支撑”等）。具体来说，我们合成了操作仿真日志的程序，并将它们映射到一系列高级激活模式。通过两个物理基准，我们展示了这种注释的仿真日志表示形式更易于进行关于物理系统的自然语言推理。我们展示了这种方法如何使LMs能够从自然语言指定的目标中生成有效的奖励程序，这些程序可以在规划或监督学习的上下文中使用。

Summary / 总结

This paper addresses the challenge of enabling AI agents to reason about physical interactions in simulation environments, particularly when guided by natural language. The authors propose a method that synthesizes programs from simulation logs to discover high-level patterns, such as 'rigid-body collision' and 'stable support'. Through two physics benchmarks, they demonstrate that this approach makes the simulation data more accessible for natural language reasoning, allowing language models to generate effective reward programs from natural language goals.

本文旨在解决AI代理在基于物理的环境中进行推理和规划的挑战，特别是在通过自然语言与人类交互时。作者提出了一种从仿真日志中发现高层模式的方法，然后将这些模式映射为一系列粗粒度模式。通过两个物理基准测试，他们展示了这种方法使仿真日志更适合自然语言对物理系统的推理，从而使语言模型能够从自然语言目标生成有效的奖励程序。

A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging

Authors: Bharathkumar Hegde, Melanie Bouroche

First: 2026-02-10T17:30:09+00:00 · Latest: 2026-02-10T17:30:09+00:00

Comments: Accepted in IEEE IV 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi-Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi-agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state-of-the-art Multi-Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL-MASS, and evaluate it in a congested on-ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL-MASS effectively balances the trade-off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL-MASS is available with an open-source licence at https://github.com/hkbharath/MARL-MASS

中文标题/摘要

标题：一种协作安全防护罩以确保拥堵入匝道并流中连接车辆的安全高效变道

在密集交通中变道是连接和自主车辆（CAVs）面临的重要挑战。现有的变道控制器主要确保安全或协作提高交通效率，但没有同时考虑这些相互冲突的目标。为了解决这一问题，我们提出了多智能体安全防护罩（MASS），该防护罩利用控制障碍函数（CBFs）来实现安全和协作的变道。MASS通过使用简单算法构建图来捕捉CAVs之间的多智能体交互，从而实现协作。进一步地，通过将MASS集成到最先进的多智能体强化学习（MARL）变道控制器中，并定义自定义奖励函数以优先提高效率，扩展了MARL变道控制器。因此，我们提出了一种变道控制器，称为MARL-MASS，并在拥堵入匝道并流中进行了仿真评估。结果表明，MASS通过严格遵守安全约束，实现了具有安全保证的协作变道。此外，提出的自定义奖励函数提高了使用安全防护罩训练的MARL策略的稳定性。总体而言，通过鼓励探索协作变道策略同时遵守安全约束，MARL-MASS有效地在拥堵交通中平衡了确保安全和提高交通效率之间的权衡。MARL-MASS的代码在开源许可下可在https://github.com/hkbharath/MARL-MASS上获得

Summary / 总结

The research addresses the challenge of safe and efficient lane changes for Connected and Autonomous Vehicles (CAVs) in dense traffic. It proposes the Multi-Agent Safety Shield (MASS) using Control Barrier Functions to enable safe and collaborative lane changes. The method integrates a state-of-the-art Multi-Agent Reinforcement Learning (MARL) controller with MASS and a custom reward function. Experimental results show that MARL-MASS ensures safety while improving traffic efficiency in congested on-ramp merging scenarios.

论文针对密集交通中连接和自主车辆（CAVs）的车道变换安全与效率问题，提出了多智能体安全盾（MASS）方法，使用控制障碍函数（CBFs）确保安全，并通过自定义奖励函数优先考虑效率。该方法将多智能体强化学习（MARL）控制器与MASS集成，以实现协作车道变换。实验结果表明，MASS在拥挤的匝道合并模拟中确保了安全并提高了交通效率。

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Authors: Shijie Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Xiaozhao Wang, Guanjun Jiang, Kevin Zhang

First: 2026-02-10T17:28:12+00:00 · Latest: 2026-02-10T17:28:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.

中文标题/摘要

标题：先回答，后解释：通过模式平衡强化学习对齐搜索相关性

在搜索引擎领域，构建既能实现低延迟又能保持高性能的搜索相关性模型一直是一个长期挑战。为满足在线系统毫秒级响应要求的同时保留大型语言模型（LLMs）的可解释推理痕迹，我们提出了一种新颖的“先回答，后解释（AFRL）”范式。该范式要求模型在第一个词就输出确定的相关性得分，随后给出结构化的逻辑解释。受推理模型成功的启发，我们采用“监督微调（SFT）+强化学习（RL）”管道来实现AFRL。然而，直接应用现有的RL训练往往会导致搜索相关性任务中的“模式崩溃”，即模型为了获得高奖励而忘记复杂的长尾规则。从信息论角度看，RL本质上最小化“逆KL散度”，倾向于寻找概率峰值（模式寻求），容易导致“奖励作弊”。另一方面，SFT最小化“前向KL散度”，迫使模型覆盖数据分布（模式覆盖），并有效锚定专家规则。基于这一洞察，我们提出了一种“模式平衡优化”策略，在逐步-GRPO训练中结合SFT辅助损失来平衡这两种特性。此外，我们构建了一个自动指令进化系统和多阶段课程，以确保专家级数据质量。大量实验表明，我们的32B教师模型达到了最先进的性能。此外，AFRL架构使知识蒸馏变得高效，成功地将专家级逻辑转移到了0.6B模型上，从而在推理深度与部署延迟之间达成平衡。

Summary / 总结

This paper addresses the challenge of creating a search relevance model that balances low latency and high performance. It introduces the Answer-First, Reason Later (AFRL) paradigm, where the model outputs a definitive relevance score immediately followed by a logical explanation. To achieve this, the authors use a Supervised Fine-Tuning (SFT) plus Reinforcement Learning (RL) pipeline and propose a Mode-Balanced Optimization strategy to avoid mode collapse. The strategy combines SFT with RL to ensure the model covers the data distribution while seeking high rewards. Experiments show that the 32B teacher model outperforms existing models, and the AFRL architecture allows efficient knowledge distillation to a 0.6B model, maintaining reasoning depth and deployment efficiency.

论文解决了构建同时具备低延迟和高性能的搜索相关性模型的挑战。提出了Answer-First, Reason Later (AFRL)范式，该范式要求模型在第一个token中输出确定的相关性得分，随后给出结构化的逻辑解释。为实现这一目标，作者使用了监督微调(SFT)加上强化学习(RL)的管道，并提出了一种模式平衡优化策略来防止模式崩溃。该策略结合SFT和RL，确保模型覆盖数据分布的同时寻求高奖励。AFRL架构还支持高效的知识蒸馏，能够将专家级逻辑转移到较小的模型中，从而平衡推理深度和部署延迟。广泛的实验表明，所提出的模型在性能上优于现有方法。

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

Authors: Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman

First: 2026-02-10T17:27:26+00:00 · Latest: 2026-02-10T17:27:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated <stop> signals, and (iii) <stop>-aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.

中文标题/摘要

标题：ESTAR：早期停止的标记感知推理以提高高效推理

大型推理模型（LRMs）通过生成长链条的思考达到最先进的性能，但通常会在正确答案已经得出后浪费计算资源。我们提出了早期停止的标记感知推理（ESTAR），该方法检测并减少了这种推理冗余，以提高效率而不牺牲准确性。我们的方法结合了(i) 一种基于轨迹的分类器，用于识别何时可以安全停止推理，(ii) 监督微调以教导LRMs提出自动生成的停止信号，以及(iii) 停止感知的强化学习，该学习在自动生成的停止点处截断计算感知的奖励。在四个推理数据集上的实验表明，ESTAR将推理长度减少了约3.7倍（从4,799减少到1,290），同时保持了准确性（74.9% vs. 74.2%），并且具有强大的跨域泛化能力。这些结果突显了早期停止作为提高LRMs推理效率的简单而强大的机制。

Summary / 总结

ESTAR is designed to reduce unnecessary reasoning in large reasoning models (LRMs) by detecting when the correct answer has been reached and stopping further computation. It uses a trajectory-based classifier to identify safe stopping points, supervised fine-tuning to teach LRMs to generate self-stop signals, and reinforcement learning to truncate reasoning at these points. Experiments on four datasets show that ESTAR can reduce reasoning length by about 3.7 times while maintaining accuracy, demonstrating strong generalization across domains.

研究旨在通过在找到正确答案后减少冗余推理来提高大型推理模型（LRMs）的效率。方法Early-Stopping for Token-Aware Reasoning (ESTAR) 包括基于轨迹的分类器来检测何时可以停止推理、监督微调来教会LRMs生成自我停止信号，以及使用带有计算感知奖励的强化学习来在这些停止点处截断推理。实验结果显示，ESTAR 将推理长度减少了约3.7倍，同时保持了准确性，并且具有良好的跨域泛化能力。

Faster-GS: Analyzing and Improving Gaussian Splatting Optimization

Authors: Florian Hahlbohm, Linus Franke, Martin Eisemann, Marcus Magnor

First: 2026-02-10T17:22:59+00:00 · Latest: 2026-02-10T17:22:59+00:00

Comments: Project page: https://fhahlbohm.github.io/faster-gaussian-splatting

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.

中文标题/摘要

标题：Faster-GS：分析与改进高斯点积优化

近年来，3D高斯点积（3DGS）的进展集中在加速优化的同时保持重建质量。然而，许多提出的方法将实现层面的改进与基本算法修改纠缠在一起，或者以性能换取保真度，导致研究景观支离破碎，难以公平比较。在本文中，我们整合并评估了先前3DGS研究中最具成效和广泛适用性的策略，并在此基础上增加了几种新的优化。我们进一步探讨了框架中尚未充分探索的方面，包括数值稳定性、高斯截断和梯度近似。由此产生的系统Faster-GS提供了一个严格优化的算法，我们通过一系列全面的基准测试对其进行评估。我们的实验表明，Faster-GS在保持视觉质量的同时，训练速度可提高5倍，为3DGS优化建立了新的成本效益和资源效率基线。此外，我们还证明了这些优化可以应用于4D高斯重建，从而实现高效的非刚性场景优化。

Summary / 总结

This work addresses the challenge of balancing speed and quality in 3D Gaussian Splatting (3DGS) by consolidating and enhancing effective strategies from prior research. The authors introduce Faster-GS, which includes novel optimizations and a rigorous evaluation framework. Experiments show that Faster-GS can train up to 5 times faster while maintaining visual quality, setting a new standard for cost-effective and resource-efficient 3DGS optimization. Additionally, the optimizations are applicable to 4D Gaussian reconstruction, enabling efficient non-rigid scene optimization.

研究旨在通过整合和评估先前工作中有效的策略，并引入新的优化措施，来改进3D高斯散射（3DGS）的优化。研究探讨了数值稳定性、高斯截断和梯度近似等方面。最终得到的Faster-GS系统在保持视觉质量的同时，训练速度可提高至5倍，成为一种成本效益高且资源高效的3DGS优化基准。此外，这些优化措施还适用于4D高斯重建，能够实现高效的非刚性场景优化。

Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting

Authors: Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun

First: 2025-09-17T17:12:39+00:00 · Latest: 2026-02-10T17:08:41+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although contrastive and other representation-learning methods have long been explored in vision and NLP, their adoption in modern time series forecasters remains limited. We believe they hold strong promise for this domain. To unlock this potential, we explicitly align past and future representations, thereby bridging the distributional gap between input histories and future targets. To this end, we introduce TimeAlign, a lightweight, plug-and-play framework that establishes a new representation paradigm, distinct from contrastive learning, by aligning auxiliary features via a simple reconstruction task and feeding them back into any base forecaster. Extensive experiments across eight benchmarks verify its superior performance. Further studies indicate that the gains arise primarily from correcting frequency mismatches between historical inputs and future outputs. Additionally, we provide two theoretical justifications for how reconstruction improves forecasting generalization and how alignment increases the mutual information between learned representations and predicted targets. The code is available at https://github.com/TROUBADOUR000/TimeAlign.

中文标题/摘要

标题：连接过去与未来：面向时间序列预测的分布感知对齐

尽管对比和其他表示学习方法在视觉和自然语言处理中已被长期探索，但在现代时间序列预测器中的应用仍然有限。我们认为它们在这个领域具有巨大的潜力。为了释放这种潜力，我们明确地对齐过去和未来的表示，从而弥合输入历史与未来目标之间的分布差距。为此，我们引入了TimeAlign，这是一个轻量级、即插即用的框架，通过简单的重构任务对齐辅助特征，并将它们反馈到任何基础预测器中，从而建立一种不同于对比学习的新表示范式。在八个基准上的广泛实验验证了其优越性能。进一步的研究表明，这些收益主要来自于纠正历史输入与未来输出之间的频率不匹配。此外，我们还提供了两种理论依据，说明重构如何提高预测泛化能力，以及对齐如何增加学习表示与预测目标之间的互信息。代码可在https://github.com/TROUBADOUR000/TimeAlign 获取。

Summary / 总结

This paper addresses the limited use of contrastive and other representation-learning methods in time series forecasting. It introduces TimeAlign, a lightweight framework that aligns past and future representations to bridge the distributional gap. Experiments across eight benchmarks show superior performance, with gains mainly attributed to correcting frequency mismatches. Theoretical justifications support how reconstruction enhances forecasting generalization and alignment increases mutual information between learned representations and predicted targets.

研究旨在通过对齐过去和未来表示来增强时间序列预测，以弥合分布差距。方法引入了TimeAlign框架，使用简单的重建任务对齐辅助特征，并将其集成到任何基预测器中。在八个基准上的广泛实验表明，其性能优越主要是因为纠正了历史输入与未来输出之间的频率不匹配。理论证明支持如何重建可以提高预测泛化能力，以及对齐如何增加学习表示与预测目标之间的互信息。

Supervised Metric Regularization Through Alternating Optimization for Multi-Regime Physics-Informed Neural Networks

Authors: Enzo Nicolas Spotorno, Josafat Ribeiro Leal, Antonio Augusto Frohlich

First: 2026-02-10T17:06:57+00:00 · Latest: 2026-02-10T17:06:57+00:00

Comments: 5 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

Standard Physics-Informed Neural Networks (PINNs) often face challenges when modeling parameterized dynamical systems with sharp regime transitions, such as bifurcations. In these scenarios, the continuous mapping from parameters to solutions can result in spectral bias or "mode collapse", where the network averages distinct physical behaviors. We propose a Topology-Aware PINN (TAPINN) that aims to mitigate this challenge by structuring the latent space via Supervised Metric Regularization. Unlike standard parametric PINNs that map physical parameters directly to solutions, our method conditions the solver on a latent state optimized to reflect the metric-based separation between regimes, showing ~49% lower physics residual (0.082 vs. 0.160). We train this architecture using a phase-based Alternating Optimization (AO) schedule to manage gradient conflicts between the metric and physics objectives. Preliminary experiments on the Duffing Oscillator demonstrate that while standard baselines suffer from spectral bias and high-capacity Hypernetworks overfit (memorizing data while violating physics), our approach achieves stable convergence with 2.18x lower gradient variance than a multi-output Sobolev Error baseline, and 5x fewer parameters than a hypernetwork-based alternative.

中文标题/摘要

标题：交替优化驱动的拓扑感知监督度规化多区域物理感知神经网络

标准物理感知神经网络（PINNs）在建模具有尖锐区域转换的参数化动力系统时，如分岔，常常面临挑战。在这种情况下，参数到解的连续映射可能导致光谱偏差或“模式崩溃”，网络会平均不同物理行为。我们提出了一种拓扑感知PINN（TAPINN），通过监督度规化结构化潜在空间，旨在缓解这一挑战。与直接将物理参数映射到解的标准参数化PINNs不同，我们的方法将求解器条件化在优化以反映区域间度规化分离的潜在状态上，显示出约49%更低的物理残差（0.082 vs. 0.160）。我们使用基于相位的交替优化（AO）训练此架构，以管理度规化和物理目标之间的梯度冲突。初步实验表明，对于杜芬振子，标准基线遭受光谱偏差和高容量超网络过拟合（记住数据同时违反物理定律），而我们的方法实现了稳定收敛，梯度方差比多输出Sobolev误差基线低2.18倍，参数数量比基于超网络的替代方案少5倍。

Summary / 总结

The paper addresses the challenge of modeling parameterized dynamical systems with sharp regime transitions using Physics-Informed Neural Networks (PINNs). It introduces Topology-Aware PINN (TAPINN), which uses Supervised Metric Regularization to optimize a latent state that reflects the metric-based separation between regimes, reducing the physics residual by 49%. TAPINN employs an Alternating Optimization (AO) training schedule to manage gradient conflicts between the metric and physics objectives, demonstrating stable convergence and lower gradient variance compared to other methods.

研究针对使用物理信息神经网络（PINNs）建模具有尖锐阶段转换的参数化动力系统时遇到的挑战。提出了一种拓扑感知PINN（TAPINN），通过监督度量正则化优化潜在空间，相比标准PINNs将物理残差降低了49%。该方法采用交替优化（AO）训练计划来处理梯度冲突，实验表明在Duffing振子上实现了稳定的收敛，并且梯度方差比多输出Sobolev误差基线低2.18倍，参数量也比基于超网络的替代方案少5倍。

History

20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553