arXiv 论文速递

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Authors: Omkat Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

First: 2025-11-20T18:59:54+00:00 · Latest: 2025-11-20T18:59:54+00:00

Comments: 9 Pages, 6 Figures, 4 Tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

中文标题/摘要

标题：EvoLMM：自我进化的大型多模态模型及其连续奖励

近年来，大型多模态模型（LMMs）的发展使其实现了令人印象深刻的推理和感知能力，但大多数现有的训练管道仍然依赖于人工策划的数据或外部验证的奖励模型，这限制了它们的自主性和可扩展性。在本工作中，我们旨在以完全无监督的方式（无需任何标注数据或奖励蒸馏）提高LMM的推理能力。为此，我们提出了一种自我进化的框架，名为EvoLMM，该框架从单一骨干模型中实例化两个合作的代理：一个提案者，它生成多样化的、基于图像的问题；一个解决者，它通过内部一致性解决这些问题，其中学习过程通过连续的自我奖励过程进行。这种动态反馈鼓励生成信息性查询并改进结构化推理，而无需依赖真实数据或人工判断。当使用流行的Qwen2.5-VL作为基础模型时，我们的EvoLMM在仅使用原始训练图像的情况下，在多模态数学推理基准测试中，包括ChartQA、MathVista和MathVision，取得了高达约3%的一致性收益。我们希望我们的简单而有效的方法能够作为坚实的基础，简化未来在完全无监督方式下自我改进的LMMs的研究。我们的代码和模型可在https://github.com/mbzuai-oryx/EvoLMM上获得。

Summary / 总结

EvoLMM is a self-evolving framework for large multimodal models (LMMs) that improves reasoning capabilities in an unsupervised manner. It uses a single backbone model to instantiate two cooperative agents: a Proposer that generates diverse image-grounded questions, and a Solver that solves them through internal consistency. This continuous self-rewarding process enhances both query generation and structured reasoning. When using Qwen2.5-VL as the base model, EvoLMM achieves consistent gains of up to 3% on multimodal math-reasoning benchmarks, such as ChartQA, MathVista, and MathVision, using only raw training images.

EvoLMM 是一种自进化的大型多模态模型框架，无需依赖标注数据或外部奖励即可提升推理能力。它使用单一的骨干模型实例化两个代理：一个生成器生成多样化的、基于图像的问题，另一个解决者通过内部一致性解决这些问题。这种持续的自我奖励过程增强了问题生成和结构化推理。当使用 Qwen2.5-VL 作为基础模型时，EvoLMM 在 ChartQA、MathVista 和 MathVision 等多模态数学推理基准测试中实现了高达 3% 的一致改进，仅使用原始训练图像。

NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Authors: Jing Wen, Alexander G. Schwing, Shenlong Wang

Venue: NeurIPS

First: 2025-11-20T18:59:54+00:00 · Latest: 2025-11-20T18:59:54+00:00

Comments: NeurIPS'25; project page: https://wenj.github.io/NoPo-Avatar/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

Summary / 总结

The research aims to recover an animatable 3D human avatar from a single or sparse set of images, without relying on accurate human poses. NoPo-Avatar is introduced to reconstruct avatars solely from images, which improves robustness against noisy pose estimates. Experiments on THuman2.0, XHuman, and HuGe100K datasets demonstrate that NoPo-Avatar outperforms existing methods in practical settings without ground-truth poses and delivers comparable results in lab settings with ground-truth poses.

研究旨在仅从单张或多张稀疏图像中重建可动画化的3D人体avatar，而不依赖于人体姿态。引入了NoPo-Avatar来克服姿态依赖性重建在姿态估计噪声较大时导致结果退化的問題。实验表明，NoPo-Avatar在没有地面真实姿态的情况下比现有方法表现更好，并且在有地面真实姿态的实验室环境中也能达到类似的结果。

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Authors: Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng

First: 2025-11-20T18:59:52+00:00 · Latest: 2025-11-20T18:59:52+00:00

Comments: Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

中文标题/摘要

标题：生成时思考：在整个视觉生成过程中交替进行文本推理

视觉生成领域的最新进展越来越多地探索了推理能力的整合。这些方法在生成过程中引入了文本推理，即思考，要么在生成之前（作为预规划），要么在生成之后（作为后精炼），但缺乏生成过程中的实时多模态交互。在本初步研究中，我们引入了生成时思考（TwiG），这是第一个交替框架，能够在视觉生成过程中实现文本推理的同步发展。随着视觉内容的逐步生成，文本推理被交替插入以引导即将生成的局部区域并反思之前合成的区域。这种动态互动产生了更具上下文意识和语义丰富的视觉输出。为了揭示该框架的潜力，我们研究了三种候选策略：零样本提示、在我们策划的TwiG-50K数据集上进行监督微调（SFT）以及通过自定义TwiG-GRPO策略进行强化学习（RL），每种策略都提供了关于交替推理动态的独特见解。我们希望这项工作能够激发进一步研究，以增强视觉生成中的文本推理整合。代码将在以下链接发布：https://github.com/ZiyuGuo99/Thinking-while-Generating。

Summary / 总结

This study introduces Thinking-while-Generating (TwiG), a framework that interleaves textual reasoning during the visual generation process to produce more context-aware and semantically rich outputs. The researchers investigate three strategies: zero-shot prompting, supervised fine-tuning on a curated dataset, and reinforcement learning. These strategies demonstrate the potential of interleaved reasoning for enhancing visual generation. The work aims to inspire further research in this area.

本研究引入了Thinking-while-Generating (TwiG) 框架，该框架在视觉生成过程中交错使用文本推理。研究探索了三种策略：零样本提示、监督微调和强化学习。结果显示，TwiG 生成的视觉输出更具上下文相关性和语义丰富性，优于传统在生成前或生成后加入推理的方法。本研究旨在激发更多关于文本推理集成以增强视觉生成的研究。

Learning to Think Fast and Slow for Visual Language Models

Authors: Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

First: 2025-11-20T18:59:48+00:00 · Latest: 2025-11-20T18:59:48+00:00

Abs · PDF · Code1 · Code2

Abstract

When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

中文标题/摘要

标题：学习快速与深入思考：视觉语言模型的思考方式

面对复杂问题时，我们倾向于深入思考；而对于简单问题，则会快速作答。这种两系统思考机制使我们能够高效分配认知资源，对于简单问题快速决策，而对于复杂问题则进行深入分析。然而，现有的以推理为导向的视觉语言模型（VLMs），无论是通过显式的推理链标注训练，还是基于规则的强化学习奖励，主要追求冗长详细的推理链，这往往导致计算成本过高。在本工作中，我们提出了一种简单的强化学习方法，使VLMs能够根据任务难度自动切换快速和深入思考模式。该方法分为两个阶段：在第一阶段，我们根据模型输出长度将数据标记为需要快速思考或深入思考，这启发于观察到预训练VLMs通常会为不同类型的问题生成不同长度的答案；在第二阶段，我们使用GRPO结合思考模式标签进行训练，以发展双模式思考。尽管简单，我们的模型DualMindVLM显著优于基线模型，并且在保持极高标记效率的同时，达到了与最先进的视觉推理模型相当的性能。

Summary / 总结

This paper addresses the inefficiency of existing visual language models (VLMs) in handling tasks with varying complexity by proposing DualMindVLM, which can switch between fast and slow thinking modes. The method involves labeling data based on the model output length and training the model using GRPO with these labels. The model demonstrates superior performance and high token efficiency compared to the base model and state-of-the-art visual reasoning models.

该研究针对现有视觉语言模型（VLMs）在处理复杂推理任务时的低效问题，提出了一种名为DualMindVLM的简单强化学习方法。该方法通过根据模型输出长度对数据进行标记，区分需要快速思考或慢速思考的任务，随后使用GRPO对模型进行训练以发展双模式思考能力。该方法显著提高了性能和标记效率，并达到了与最先进的视觉推理模型相当的结果。

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Authors: Junhao Cheng, Liang Hou, Xin Tao, Jing Liao

First: 2025-11-20T18:59:44+00:00 · Latest: 2025-11-20T18:59:44+00:00

Comments: Project page: https://video-as-answer.github.io/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

中文标题/摘要

标题：视频作答：联合GRPO预测和生成下一视频事件

尽管语言模型在许多实际应用中产生了重大影响，但视频生成仍主要局限于娱乐领域。受视频展示物理世界信息的独特能力启发（例如，仅通过文字很难教人打领带），我们发现将视频扩展为下一事件预测（NEP）的新答案模态的机会，形式化为视频下一事件预测（VNEP）。虽然传统的NEP任务以包含程序性或预测性问题的视频作为输入来预测下一个事件，VNEP则需要动态视频响应。这一从讲述到展示的转变解锁了更直观和个性化的程序学习和创意探索答案。然而，现有模型仍难以完成此任务，因为它要求理解多模态输入、指令条件推理以及生成视觉和语义一致的视频。为解决这一问题，我们引入了VANS模型，该模型利用强化学习将视觉语言模型（VLM）与视频扩散模型（VDM）联合起来用于VNEP。VANS的核心是我们提出的联合GRPO，它协调VLM和VDM作为一个整体运行。通过共享奖励优化VLM，使其生成既准确又易于视觉化的字幕，同时引导VDM生成忠实于这些字幕和输入视觉上下文的视频。为了实现这一学习，我们构建了VANS-Data-100K，一个专门用于VNEP任务的数据集。在程序性和预测性基准上的实验表明，VANS在视频事件预测和可视化方面均达到了最先进的性能。代码发布在https://github.com/KlingTeam/VANS。

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Authors: Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You

First: 2025-11-20T18:59:42+00:00 · Latest: 2025-11-20T18:59:42+00:00

Comments: Project Page: https://oahzxl.github.io/VReasonBench

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

中文标题/摘要

标题：V-ReasonBench：面向视频生成模型统一推理基准套件

生成视频模型（如Veo-3）的近期进展展示了惊人的零样本推理能力，这迫切需要系统和可靠的评估。我们提出了V-ReasonBench，一个旨在从四个关键维度评估视频推理的基准：结构化问题解决、空间认知、基于模式的推理和物理动力学。该基准从合成和真实世界的图像序列中构建，提供了一组可验证答案的任务，这些任务是可重复的、可扩展的和无歧义的。对六种最先进的视频模型的评估揭示了明显的维度差异，结构化、空间、基于模式和物理推理存在显著差异。我们进一步将视频模型与强大的图像模型进行比较，分析常见的幻觉行为，并研究视频时长如何影响帧链推理。总体而言，V-ReasonBench 提供了一个统一且可重复的框架来衡量视频推理，并旨在支持开发具有更可靠、更符合人类推理能力的模型。

Summary / 总结

V-ReasonBench is designed to evaluate video reasoning across structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. It uses both synthetic and real-world image sequences to provide a diverse set of answer-verifiable tasks. Evaluations of six state-of-the-art video models show clear differences in reasoning abilities across these dimensions, with significant variation in structured, spatial, pattern-based, and physical reasoning. The benchmark also compares video models with image models and analyzes hallucination behaviors and the impact of video duration on reasoning. V-ReasonBench supports the development of more reliable and human-aligned reasoning skills in video generation models.

V-ReasonBench 旨在评估视频推理能力，涵盖结构化问题解决、空间认知、模式推理和物理动态四个维度。它使用合成和真实世界的图像序列，提供了一系列可验证答案的任务。六种最先进的视频模型的评估显示了在这些维度上的明显差异，特别是在结构化、空间、模式和物理推理方面存在显著差异。该基准还比较了视频模型和图像模型，分析了幻觉行为以及视频时长对链式帧推理的影响。V-ReasonBench 支持开发更可靠和符合人类认知的视频生成模型。

SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Authors: Zhenyuan Qin, Xincheng Shuai, Henghui Ding

Venue: NeurIPS 2025 Spotlight

First: 2025-11-20T18:59:25+00:00 · Latest: 2025-11-20T18:59:25+00:00

Comments: NeurIPS 2025 (Spotlight), Project Page: https://henghuiding.com/SceneDesigner/

Abs · PDF · Code1 · Code2 · Code3

Abstract

Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

中文标题/摘要

标题：SceneDesigner：具有9自由度姿态操控的可控多对象图像生成

可控图像生成近年来引起了越来越多的关注，使用户能够操控诸如身份和风格等视觉内容。然而，同时控制多个对象的9D姿态（位置、大小和方向）仍然是一个开放的挑战。尽管取得了进展，现有方法往往在可控性和质量上存在局限，无法实现全面的多对象9D姿态控制。为了解决这些限制，我们提出了SceneDesigner，一种用于准确和灵活的多对象9自由度姿态操控的方法。SceneDesigner将分支网络集成到预训练的基础模型中，并利用一种新的表示方法CNOCS图，该图从相机视角编码了9D姿态信息。这种表示方法具有强大的几何解释特性，导致更高效的训练和更稳定的训练。为了支持训练，我们构建了一个新的数据集ObjectPose9D，该数据集汇集了来自不同来源的图像以及9D姿态注释。为了进一步解决数据不平衡问题，特别是低频姿态上的性能下降，我们引入了一种基于强化学习的两阶段训练策略，其中第二阶段使用基于奖励的目标在重新平衡的数据上微调模型。在推理阶段，我们提出了解耦对象采样技术，该技术缓解了复杂多对象场景中对象生成不足和概念混淆的问题。此外，通过整合用户特定的个性化权重，SceneDesigner能够为参考主体提供定制的姿态控制。广泛的定性和定量实验表明，SceneDesigner在可控性和质量上显著优于现有方法。代码可在https://github.com/FudanCVL/SceneDesigner/公开获取。

Summary / 总结

SceneDesigner is a method for controllable multi-object image generation with 9-DoF pose manipulation. It uses a branched network and a new CNOCS map representation to achieve accurate and flexible pose control. The method introduces a two-stage training strategy with reinforcement learning to address data imbalance and improve performance on low-frequency poses. SceneDesigner outperforms existing approaches in both controllability and quality, as demonstrated by extensive experiments.

SceneDesigner 是一种用于多对象 9-DoF 姿态操纵的可控图像生成方法。它使用分支网络和新的 CNOCS 图像表示来实现准确和灵活的姿态控制。该方法通过引入两阶段训练策略和解纠缠对象采样技术来解决现有方法的局限性。实验结果表明，SceneDesigner 在可控性和质量上均优于现有方法。

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han

First: 2025-11-20T18:59:25+00:00 · Latest: 2025-11-20T18:59:25+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.

中文标题/摘要

标题：驯服长尾效应：通过自适应招募能手实现高效的推理RL训练

大型语言模型（LLMs）的出现标志着推理能力的显著提升，开启了复杂问题解决的新领域。然而，使用强化学习（RL）训练这些推理模型时，遇到了关键的效率瓶颈：RL训练中的响应生成呈现出持久的长尾分布，其中少数非常长的响应主导了执行时间，浪费了资源并增加了成本。为了解决这一问题，我们提出了一种名为TLT的系统，通过集成自适应推测解码来无损地加速推理RL训练。在RL中应用推测解码具有挑战性，因为工作负载动态变化、目标模型不断演进以及招募能手模型的训练开销。TLT通过两个协同工作的组件克服了这些障碍：（1）自适应招募能手，一种在长尾生成期间连续在空闲GPU上训练的轻量级招募能手模型，以零额外成本保持与目标模型的对齐；（2）自适应展开引擎，维护一个内存高效的CUDAGraphs预捕获池，并为每个输入批次适配选择合适的SD策略。评估表明，与最先进的系统相比，TLT实现了超过1.7倍的端到端RL训练加速，保持了模型的准确性，并且生成了一个高质量的招募能手模型作为免费副产品，适合高效部署。代码发布在https://github.com/mit-han-lab/fastrl。

Summary / 总结

The paper addresses the efficiency bottleneck in training reasoning models using RL, where long responses dominate execution time. It introduces TLT, which uses adaptive speculative decoding through Adaptive Drafter and Adaptive Rollout Engine to accelerate RL training. TLT achieves over 1.7x speedup, maintains model accuracy, and provides a high-quality draft model for efficient deployment.

论文针对使用RL训练推理模型时由于响应生成时间的长尾分布导致的效率瓶颈问题，提出了TLT系统，通过使用自适应投机解码和自适应展开引擎来保持与目标模型的一致性，从而将端到端的训练时间提高1.7倍，同时不牺牲模型精度，并且生成的草稿模型质量高，可以高效部署。

TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Authors: Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie, Alan Yuille

First: 2025-11-20T18:59:03+00:00 · Latest: 2025-11-20T18:59:03+00:00

Comments: 8 pages, 10 figures, Under review at a conference

Abs · PDF · Code1 · Code2

Abstract

With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

中文标题/摘要

标题：TriDiff-4D：基于扩散的三平面重新定位快速生成4D模型

随着3D动画需求的增加，从文本描述生成高保真、可控的4D化身仍然是一个重大挑战。尽管在4D生成建模方面做出了显著努力，但现有方法仍存在根本性限制，阻碍了其更广泛的适用性，包括时间与几何不一致、感知伪影、运动不规则性、高计算成本以及对动态控制的限制。为解决这些挑战，我们提出TriDiff-4D，一种新颖的4D生成流水线，采用基于扩散的三平面重新定位生成高质量、时间上一致的4D化身。我们的模型采用自回归策略生成任意长度的4D序列，通过单次扩散过程合成每个3D帧。通过从大规模3D和运动数据集中显式学习3D结构和运动先验，TriDiff-4D能够实现骨架驱动的4D生成，具有时间一致性、运动准确性、计算效率和视觉保真度。具体而言，TriDiff-4D首先从文本提示生成一个标准3D化身及其相应的运动序列，然后使用第二个扩散模型根据运动序列使化身动画化，支持任意长度的4D生成。实验结果表明，TriDiff-4D显著优于现有方法，通过消除优化过程将生成时间从数小时缩短到数秒，同时大幅提高了复杂运动的生成保真度和3D几何准确性。

Summary / 总结

TriDiff-4D addresses the challenges of generating high-fidelity 4D avatars from textual descriptions by proposing a diffusion-based triplane re-posing method. It uses an auto-regressive strategy to generate 4D sequences efficiently, with each 3D frame synthesized by a single diffusion process. The model learns 3D structure and motion priors from large datasets, enabling temporally consistent and visually faithful 4D avatars. Experiments show that TriDiff-4D reduces generation time from hours to seconds and improves the generation of complex motions with high fidelity.

TriDiff-4D通过提出基于扩散的三平面重新定位方法，解决了从文本描述生成高质量4D头像的挑战。它通过自回归策略生成时空一致的4D序列，将生成时间从数小时缩短到数秒。实验结果表明，TriDiff-4D在时间一致性、运动准确性、计算效率和视觉保真度等方面优于现有方法，特别是在复杂运动的高保真外观和准确的3D几何形状方面表现出色。

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

Authors: Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin, Krishna Bodduluri, Mike Lambeta, Lerrel Pinto, Akshara Rai, Jitendra Malik, Tingfan Wu, Akash Sharma, Homanga Bharadhwaj

First: 2025-11-20T18:59:02+00:00 · Latest: 2025-11-20T18:59:02+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.

中文标题/摘要

标题：智能眼镜带来的灵巧性：使用野外人类演示的多指机器人操作

从人类在自然环境中执行日常任务中学习多指机器人策略一直是机器人学社区的一个宏伟目标。实现这一目标将标志着在人类环境中实现通用机器人操作的重要进展，因为它将减少对劳动密集型机器人数据收集的依赖。尽管付出了大量努力，但这一目标的进展一直受到人类与机器人之间体感差距以及从野外人类视频中提取相关上下文和运动提示以学习自主策略的困难的瓶颈。我们认为，借助简单的但足够强大的硬件来获取人类数据以及我们提出的AINA框架，我们现在距离实现这一梦想更近了一步。AINA使任何人都可以在任何地方、任何环境中使用Aria Gen 2眼镜收集数据并从中学习多指策略成为可能。这些眼镜轻便且便携，配备高分辨率RGB摄像头，提供准确的机载3D头部和手部姿态，并提供宽视野立体视图，可用于场景深度估计。这种设置使我们能够学习对背景变化具有鲁棒性的3D点基策略，并可以直接部署而无需任何机器人数据（包括在线修正、强化学习或模拟）。我们对比了我们的框架与先前的人类到机器人策略学习方法，消融了我们的设计选择，并在九个日常操作任务中展示了结果。机器人演示最好在我们的网站上观看：https://aina-robot.github.io.

Summary / 总结

The research aims to develop a method for learning multi-fingered robot manipulation policies from human demonstrations in natural environments. The key method involves using ARIA Gen 2 glasses to capture high-resolution data, which is processed by the AINA framework to enable learning of 3D point-based policies. The main experimental findings show that this approach can achieve robust multi-fingered manipulation without the need for extensive robot data collection or online corrections, demonstrating success across nine everyday tasks.

研究旨在通过人类在自然环境中的演示来学习多手指机器人操作策略。关键方法是使用ARIA Gen 2眼镜捕捉高分辨率数据，并通过AINA框架处理以实现基于3D点的策略学习。主要实验发现表明，这种方法可以在无需大量机器人数据收集或在线修正的情况下实现稳健的多手指操作，并在九个日常任务中取得了成功。

PartUV: Part-Based UV Unwrapping of 3D Meshes

Authors: Zhaoning Wang, Xinyue Wei, Ruoxi Shi, Xiaoshuai Zhang, Hao Su, Minghua Liu

Venue: www

First: 2025-11-20T18:58:39+00:00 · Latest: 2025-11-20T18:58:39+00:00

Comments: project page: https://www.zhaoningwang.com/PartUV

Abs · PDF · Code1 · Code2

Abstract

UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.

中文标题/摘要

标题：PartUV：基于部件的3D网格UV展开

UV展开将3D表面平铺到2D上，同时尽量减少失真，通常需要将复杂的表面分解成多个图表。尽管已经进行了广泛的研究，但现有的UV展开方法在处理AI生成的网格时经常遇到困难，这些网格通常噪声大、表面粗糙且条件不佳。这些方法往往会产生大量碎片化的图表和次优边界，引入伪影并阻碍下游任务。我们提出了PartUV，这是一种基于部件的UV展开流水线，能够生成数量显著减少且与部件对齐的图表，同时保持低失真。该流水线基于最近的基于学习的部件分解方法PartField构建，在自上而下的递归框架中结合了高层次语义部件分解和新颖的几何启发式方法。它确保每个图表的失真保持在用户指定的阈值以下，同时尽量减少图表的总数。该流水线整合和扩展了参数化和打包算法，专门处理非流形和退化网格，并进行了广泛的并行化以提高效率。在包括人造、CAD、AI生成和通用形状在内的四个不同数据集上进行评估，PartUV在图表数量和缝线长度方面优于现有工具和最近的神经方法，在失真方面达到可比水平，在具有挑战性的网格上表现出高成功率，并使新的应用如部件特定的多瓦片打包成为可能。我们的项目页面为https://www.zhaoningwang.com/PartUV。

Summary / 总结

PartUV is a part-based UV unwrapping method designed to handle AI-generated meshes with noise and poor conditions. It uses a combination of semantic part decomposition and geometric heuristics to produce fewer, part-aligned charts with low distortion. PartUV outperforms existing tools in chart count and seam length, and shows high success rates on challenging meshes, enabling new applications such as part-specific multi-tile packing.

PartUV 是一种基于部分的 UV 展开方法，旨在处理具有噪声和不良条件的 AI 生成的 3D 网格。它结合了语义部分分解和几何启发式方法，生成较少的部分对齐的图表，且失真较低。PartUV 在图表数量和接缝长度方面优于现有工具，并在处理具有挑战性的网格时表现出高成功率，从而支持如部分特定的多瓦片包装等新应用。

Distance-Preserving Representations for Genomic Spatial Reconstruction

Authors: Wenbin Zhou, Jin-Hong Du

First: 2024-08-01T21:04:27+00:00 · Latest: 2025-11-20T18:57:06+00:00

Abs · PDF · Code1 · Code2

Abstract

The spatial context of single-cell gene expression data is crucial for many downstream analyses, yet often remains inaccessible due to practical and technical limitations, restricting the utility of such datasets. In this paper, we propose a generic representation learning and transfer learning framework dp-VAE, capable of reconstructing the spatial coordinates associated with the provided gene expression data. Central to our approach is a distance-preserving regularizer integrated into the loss function during training, ensuring the model effectively captures and utilizes spatial context signals from reference datasets. During the inference stage, the produced latent representation of the model can be used to reconstruct or impute the spatial context of the provided gene expression by solving a constrained optimization problem. We also explore the theoretical connections between distance-preserving loss, distortion, and the bi-Lipschitz condition within generative models. Finally, we demonstrate the effectiveness of dp-VAE in different tasks involving training robustness, out-of-sample evaluation, and transfer learning inference applications by testing it over 27 publicly available datasets. This underscores its applicability to a wide range of genomics studies that were previously hindered by the absence of spatial data.

Summary / 总结

The paper proposes dp-VAE, a framework for reconstructing spatial coordinates from gene expression data by integrating a distance-preserving regularizer into the loss function. The model effectively captures spatial context signals from reference datasets and can be used to reconstruct or impute spatial context during inference. Experiments on 27 datasets show that dp-VAE performs well in various tasks, including robust training, out-of-sample evaluation, and transfer learning applications, thereby enhancing the utility of gene expression data in genomics studies.

该研究提出dp-VAE框架，通过将距离保持正则化器集成到损失函数中，从单细胞基因表达数据中重建空间坐标。模型能够有效捕捉空间上下文信号，并在推理阶段用于重建或补充提供的基因表达的空间上下文。在27个公开数据集上的实验表明，dp-VAE在各种任务中表现良好，包括鲁棒训练、离样本评估和迁移学习应用，从而提高了缺乏空间数据的基因表达数据的实用性。

Solving Spatial Supersensing Without Spatial Supersensing

Authors: Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge, Ameya Prabhu

First: 2025-11-20T18:57:05+00:00 · Latest: 2025-11-20T18:57:05+00:00

Comments: Tech Report

Abs · PDF · Code1 · Code2 · Code3

Abstract

Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

中文标题/摘要

标题：无需空间超感知解决空间超感知

Cambrian-S旨在通过引入(i)两个基准，VSI-Super-Recall (VSR)和VSI-Super-Counting (VSC)，以及(ii)针对每个基准量身定制的预测感知推理策略，迈出改善视频世界模型的第一步。在这项工作中，我们从这两个方面对Cambrian-S进行了关键分析。首先，我们引入了一个简单的基线，NoSense，它几乎忽略了所有的时间结构，仅使用了SigLIP词袋模型，却几乎完美地解决了VSR，即使在4小时的视频中也达到了95%的准确率。这表明，像VSR这样的基准几乎可以在没有空间认知、世界建模或空间超感知的情况下解决。其次，我们假设Cambrian-S提出的定制推理方法可能利用了基准中的捷径。我们通过在VSC基准上进行一个简单的合理性检查来说明这一点，称为VSC-Repeat：我们将每个视频与其自身连接1-5次，这不会改变唯一物体的数量。然而，这种简单的扰动使Cambrian-S的平均相对准确率从42%降至0%。一个进行空间超感知并跨体验整合信息的系统应该识别相同的场景视图并保持物体计数预测不变；相反，Cambrian-S的推理算法主要依赖于VSC基准中的一个捷径，即房间从未被重新访问。综上所述，我们的发现表明：(i) 当前的VSI-Super基准尚不能可靠地衡量空间超感知；(ii) Cambrian-S使用的预测感知推理配方通过无意中利用捷径而非稳健的空间超感知来提高性能。我们附上了Cambrian-S作者的回应（附录A），以提供一个平衡的观点。我们发布了我们的代码：https://github.com/bethgelab/supersanity

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Authors: Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie

First: 2025-10-27T02:59:57+00:00 · Latest: 2025-11-20T18:56:11+00:00

Comments: Preprint. Work in progress

Abs · PDF · Code1 · Code2

Abstract

Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

中文标题/摘要

标题：LightFusion：一种轻量级双融合框架，用于统一多模态理解和生成

统一多模态模型最近在能力和灵活性方面取得了显著进展，但大多数领先系统仍然从头开始训练，需要大量计算资源。在本文中，我们通过战略性地融合专门用于生成或理解的公共模型，展示了可以更高效地获得竞争性性能。我们的关键设计是在保留原始模块的同时，在网络中交错插入多模态自注意力模块。这种双融合机制（1）有效地实现了丰富的多模态融合，同时保留了基础模型的原始优势；（2）促进了理解编码器的高层语义表示与生成编码器的低级空间信号的协同融合。通过仅使用约350亿个令牌进行训练，这种方法在多个基准测试中取得了良好的结果：GenEval上的合成文本到图像生成得分为0.91，DPG-Bench上的复杂文本到图像生成得分为82.16，GEditBench得分为6.06，ImgEdit-Bench上的图像编辑得分为3.77。通过完全释放整个代码套件、模型权重和数据集，我们希望支持未来统一多模态建模的研究。

Summary / 总结

The research aims to develop a lightweight framework for unified multimodal understanding and generation by fusing existing specialized models. The method involves interleaving multimodal self-attention blocks with the original blocks of the base models, achieving strong results across various benchmarks without requiring extensive training data. Key findings include a score of 0.91 on GenEval for text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing, all achieved with only ~35B tokens of training data.

研究旨在通过融合现有专门化的模型来开发一种轻量级的统一多模态理解和生成框架。方法是将多模态自注意力模块与基模型的原始模块交错，从而实现丰富的多模态融合，同时保留其优势。关键实验结果表明，这种方法在多个基准测试中表现出色，包括GenEval的0.91分、DPG-Bench的82.16分、GEditBench的6.06分和ImgEdit-Bench的3.77分，所有这些都仅使用了约350亿个训练令牌。

Evolution Strategies at the Hyperscale

Authors: Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, Uljad Berdica, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster

First: 2025-11-20T18:56:05+00:00 · Latest: 2025-11-20T18:56:05+00:00

Comments: 48 pages, 12 figures, Website at https://eshyperscale.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.

Summary / 总结

EGGROLL is an evolution strategies algorithm designed to scale backprop-free optimization for large neural networks. It uses low-rank matrix perturbations to reduce computational and memory costs, enabling efficient optimization at large population sizes. Experiments show that EGGROLL maintains performance in reinforcement learning tasks, is competitive with GRPO for language model reasoning, and supports stable pre-training of nonlinear recurrent language models using integer datatypes.

EGGROLL 是一种针对大规模神经网络优化的进化策略算法。它通过使用低秩矩阵扰动来解决传统 ES 的计算和内存成本问题，从而降低存储和前向传递的成本。实验表明，EGGROLL 在强化学习中保持了性能，并且在语言模型推理方面与 GRPO 竞争，同时能够使用整数数据类型稳定预训练非线性递归语言模型。

Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

Authors: Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman, Nabeel Mohammed

Venue: IEEE BigData 2025

First: 2025-11-20T18:56:05+00:00 · Latest: 2025-11-20T18:56:05+00:00

Comments: Accepted at 2025 IEEE International Conference on Big Data (IEEE BigData 2025)

Abs · PDF · Code1 · Code2

Abstract

Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

中文标题/摘要

标题：基于上下文感知知识蒸馏的教师引导单次剪枝

无结构剪枝仍然是压缩深度神经网络的强大策略，但通常需要迭代的训练-剪枝-再训练循环，导致显著的计算开销。为了解决这一挑战，我们提出了一种新颖的教师引导剪枝框架，该框架将知识蒸馏（KD）与重要性评分估计紧密集成。与先前方法将KD作为后剪枝恢复步骤不同，我们的方法在计算重要性评分时利用教师提供的梯度信号来识别和保留对任务性能和知识转移都至关重要的参数。我们的方法促进了高效的全局单次剪枝策略，能够高效地消除冗余权重同时保留关键表示。剪枝后，我们使用稀疏感知再训练，有和无KD，来恢复准确率而不重新激活剪枝连接。在包括CIFAR-10、CIFAR-100和TinyImageNet在内的多个图像分类基准测试中，我们的方法在高稀疏度水平下始终能够实现高稀疏度水平，同时性能下降最小。值得注意的是，我们的方法在高稀疏度水平上优于最先进的基线方法如EPG和EPSD，同时提供了一种比COLT等迭代剪枝方案更计算高效的替代方案。所提出的框架提供了一种计算高效、性能保持的解决方案，适用于资源受限环境的部署。

Summary / 总结

This paper introduces a teacher-guided pruning framework that integrates Knowledge Distillation (KD) with importance score estimation to achieve one-shot pruning of deep neural networks. Unlike previous methods, this approach uses gradient signals from the teacher model to identify and retain critical parameters during the pruning process. Experiments on image classification benchmarks show that the method can achieve high sparsity levels with minimal performance loss, outperforming state-of-the-art baselines like EPG and EPSD at high sparsity levels and offering a more efficient alternative to iterative pruning schemes.

研究旨在通过引入一种教师引导的剪枝框架来解决迭代训练-剪枝-重新训练循环中的计算开销问题，该框架将知识蒸馏（KD）与重要性评分估计紧密结合。该方法在剪枝过程中识别并保留关键参数，实现了一次性全局剪枝策略。实验结果显示，该方法在各种基准测试中实现了高稀疏度水平，并且性能下降最小，同时在高稀疏度水平上优于最先进的基线方法。

Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

Authors: Shuyu Cao, Chongshou Li, Jie Xu, Tianrui Li, Na Zhao

First: 2025-11-20T18:54:31+00:00 · Latest: 2025-11-20T18:54:31+00:00

Abs · PDF · Code1 · Code2

Abstract

3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

中文标题/摘要

标题：基于语义原型鉴别双分支监督的晚期解耦3D分层语义分割

3D分层语义分割（3DHS）对于需要对3D场景进行多粒度和多层次理解的体态智能应用至关重要。尽管取得了进展，但之前的3DHS方法忽视了两个挑战：I）参数共享模型的多标签学习会导致跨层次优化中的多层次冲突，II）3D场景的多个层次不可避免地存在类别不平衡问题，这使得模型性能主要由主要类别主导。为了解决这些问题，我们提出了一种新的框架，包含一个主3DHS分支和一个辅助鉴别分支。具体来说，为了缓解多层次冲突，我们提出了一种晚期解耦3DHS框架，采用自上而下的层次指导和一致性，使用多个解码器。晚期解耦架构可以缓解多个层次之间的欠拟合和过拟合冲突，也可以约束每个层次内的类别不平衡问题。此外，我们引入了一种面向3DHS的语义原型基于双分支监督机制，该机制还学习类别区分的点云特征，并在辅助分支和3DHS分支之间进行相互监督，以增强类别不平衡分割。在多个数据集和骨干网络上的广泛实验表明，我们的方法在3DHS性能上达到了最先进的水平，其核心组件也可以作为即插即用增强来改进先前的方法。

Summary / 总结

This paper addresses the challenges of multi-hierarchy conflicts and class imbalance in 3D hierarchical semantic segmentation by proposing a late-decoupled 3DHS framework with a primary segmentation branch and an auxiliary discrimination branch. The framework employs multiple decoders for coarse-to-fine hierarchical guidance and consistency, mitigating conflicts and constraining class imbalance. Additionally, a bi-branch supervision mechanism is introduced to enhance discriminative point cloud feature learning and mutual supervision between branches. Experiments show that the proposed method outperforms existing approaches on multiple datasets and backbones.

该论文通过提出一种晚解耦的3D层次语义分割框架，结合主分支和辅助鉴别分支，解决了3D层次语义分割（3DHS）中的多层次冲突和类别不平衡问题。框架使用多解码器和从粗到细的层次指导来缓解冲突并约束类别不平衡。此外，引入了一种面向3DHS的语义原型双分支监督机制，以增强类别区分的点云特征学习和分支之间的相互监督。实验表明，所提出的方法在多个数据集和骨干网络上优于现有方法，并且其组件可以提高先前的方法。

TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

Authors: Zeyuan Yin, Xiaoming Liu

Venue: NeurIPS 2025

First: 2025-11-20T18:49:09+00:00 · Latest: 2025-11-20T18:49:09+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.

中文标题/摘要

标题：TRIM：具有时间和空间修剪的可扩展3D高斯扩散推断

近年来，3D高斯扩散模型由于大量的高斯基元导致耗时的去噪和去噪后处理，从而生成速度慢且沿采样轨迹可扩展性有限。为了提高3D扩散模型的效率，我们提出了一种名为$\textbf{TRIM}$（$\textbf{T}$轨迹$\textbf{R}$缩减和$\textbf{I}$实例$\textbf{M}$掩码去噪）的后训练方法，该方法结合了时间和空间修剪策略，在不牺牲输出质量的情况下加速推理，并支持高斯扩散模型的推理时可扩展性。我们不是以昂贵的端到端方式扩展去噪轨迹，而是开发了一个轻量级选择器模型来评估从多个采样噪声中导出的潜在高斯基元，从而通过选择具有高质量潜力的候选者实现早期轨迹缩减。此外，我们引入了实例掩码去噪，通过过滤掉冗余的背景区域来修剪可学习的高斯基元，从而在每次去噪步骤中减少推理计算量。广泛的实验和分析表明，TRIM显著提高了3D生成的效率和质量。源代码可在$\href{https://github.com/zeyuanyin/TRIM}{链接}$获取。

Summary / 总结

TRIM is a post-training approach that enhances the efficiency of 3D Gaussian diffusion models by incorporating temporal and spatial trimming strategies. It uses a lightweight selector model to reduce denoising trajectories and an instance mask denoising technique to prune redundant Gaussian primitives, thereby accelerating inference without compromising output quality. Experiments show that TRIM significantly improves both the efficiency and quality of 3D generation.

TRIM 是一种后训练方法，通过结合时空修剪策略来提高 3D 高斯扩散模型的效率，允许更快的推理而不牺牲输出质量。它使用一个轻量级的选择器模型来减少去噪轨迹，并引入实例掩码去噪来过滤掉冗余的背景区域，从而减少每次去噪步骤的推理计算量。实验表明，TRIM 显著提高了 3D 生成的效率和质量。

Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Authors: Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

First: 2025-09-25T14:28:34+00:00 · Latest: 2025-11-20T18:44:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

中文标题/摘要

标题：Sigma：基于骨架的 sign 语言理解语义信息预训练

预训练已被证明对 sign 语言理解（SLU）任务中学习可迁移特征非常有效。最近，基于骨架的方法越来越受到关注，因为它们能够稳健地处理主体和背景的变化，而不受外观或环境因素的影响。当前的 SLU 方法仍然面临三个关键限制：1）语义联系薄弱，模型通常从骨架数据中捕捉低级运动模式，但难以将它们与语言意义联系起来；2）局部细节与全局上下文之间的不平衡，模型要么过于关注细微线索，要么忽视它们以获取更广泛的上下文；3）跨模态学习效率低下，因为跨模态构建语义对齐表示仍然具有挑战性。为了解决这些问题，我们提出了 Sigma，这是一种统一的基于骨架的 SLU 框架，包括：1）一种手语意识的早期融合机制，促进视觉和文本模态之间的深层交互，用语言上下文丰富视觉特征；2）一种分层对齐学习策略，联合最大化不同模态配对特征在不同层次上的一致性，有效地捕捉细微线索和高层次的语义关系；3）一种统一的预训练框架，结合对比学习、文本匹配和语言建模，促进语义一致性和泛化。Sigma 在多个基准上的孤立手语识别、连续手语识别和无词汇手语翻译任务中达到了新的最佳结果，证明了语义信息预训练的影响以及骨架数据作为 SLU 独立解决方案的有效性。

Summary / 总结

Sigma is a unified framework for sign language understanding that addresses the limitations of current methods by incorporating a sign-aware early fusion mechanism, hierarchical alignment learning, and a unified pre-training framework. It achieves state-of-the-art results in isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation across multiple benchmarks, highlighting the benefits of semantically informative pre-training and the use of skeletal data alone for SLU tasks.

Sigma 是一个统一的框架，通过引入符号感知的早期融合机制、层次对齐学习以及统一的预训练框架来解决当前方法中的关键限制。它在多个基准上的孤立手语识别、连续手语识别和无词手语翻译任务中达到了最先进的结果，突显了语义信息性预训练的好处以及骨骼数据作为手语理解任务独立解决方案的有效性。

Modular Jump Gaussian Processes

Authors: Anna R. Flowers, Christopher T. Franck, Mickaël Binois, Chiwoo Park, Robert B. Gramacy

First: 2025-05-21T14:16:56+00:00 · Latest: 2025-11-20T18:42:59+00:00

Comments: 19 pages, 13 figures

Abs · PDF · Code1 · Code2

Abstract

Gaussian processes (GPs) furnish accurate nonlinear predictions with well-calibrated uncertainty. However, the typical GP setup has a built-in stationarity assumption, making it ill-suited for modeling data from processes with sudden changes, or "jumps" in the output variable. The "jump GP" (JGP) was developed for modeling data from such processes, combining local GPs and latent "level" variables under a joint inferential framework. But joint modeling can be fraught with difficulty. We aim to simplify by suggesting a more modular setup, eschewing joint inference but retaining the main JGP themes: (a) learning optimal neighborhood sizes that locally respect manifolds of discontinuity; and (b) a new cluster-based (latent) feature to capture regions of distinct output levels on both sides of the manifold. We show that each of (a) and (b) separately leads to dramatic improvements when modeling processes with jumps. In tandem (but without requiring joint inference) that benefit is compounded, as illustrated on real and synthetic benchmark examples from the recent literature.

中文标题/摘要

标题：模块化跳跃高斯过程

高斯过程（GPs）提供准确的非线性预测并具有校准良好的不确定性。然而，典型的GP设置内置了平稳性假设，使其不适合建模具有突然变化或“跳跃”的输出变量的过程。跳跃高斯过程（JGP）旨在建模此类过程的数据，结合局部GP和潜在的“水平”变量，在联合推断框架下进行建模。但联合建模可能会带来困难。我们旨在简化，建议采用更模块化的设置，避免联合推断，但保留JGP的主要主题：(a) 学习最优的邻域大小，以局部尊重不连续性的流形；(b) 一种基于聚类的（潜在）特征，以捕捉流形两侧不同输出水平的区域。我们展示了(a)和(b)分别单独建模跳跃过程时带来的显著改进。当两者结合使用（但不需要联合推断）时，这种改进会进一步增强，如在最近文献中的真实和合成基准示例中所展示。

SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

Authors: Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo, Sen Yang, Xiaohan Xing, Linlin Shen

First: 2025-11-20T18:41:44+00:00 · Latest: 2025-11-20T18:41:44+00:00

Comments: 20 pages

Abs · PDF · Code1 · Code2

Abstract

Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.

中文标题/摘要

标题：SurvAgent：层次化CoT增强案例银行及二分法多智能体系统用于多模态生存预测

生存分析对于癌症预后和治疗规划至关重要，但现有方法缺乏临床应用所需的透明度。虽然最近的病理智能体在诊断任务中展示了可解释性，但在生存预测方面仍面临三个限制：无法整合多模态数据、无效的感兴趣区域探索以及无法利用历史案例的经验学习。我们介绍了SurvAgent，这是第一个用于多模态生存预测的层次化链式思维（CoT）增强多智能体系统。SurvAgent 包含两个阶段：(1) WSI-基因CoT增强案例银行构建通过低倍筛选、跨模态相似性感知补丁挖掘和置信度感知补丁挖掘对病理图像进行层次化分析，同时对六种功能基因类别进行基因分层分析。两者均生成结构化报告，包含CoT推理，存储完整的分析过程以供经验学习。(2) 基于二分法的多专家智能体推理通过RAG检索相似案例，并通过逐步区间细化整合多模态报告与专家预测。在TCGA五个队列的广泛实验中，SurvAgent 显示出优于传统方法、专有MLLMs和医疗智能体的优势，确立了可解释AI驱动的精准肿瘤学生存预测的新范式。

Summary / 总结

SurvAgent is a hierarchical multi-agent system designed for multimodal survival prediction in cancer prognosis. It addresses the limitations of existing methods by integrating multimodal data, exploring regions of interest effectively, and leveraging experiential learning from historical cases. The system consists of two stages: WSI-Gene CoT-Enhanced Case Bank Construction and Dichotomy-Based Multi-Expert Agent Inference. The first stage constructs structured reports with CoT reasoning, while the second stage retrieves similar cases and integrates multimodal reports with expert predictions. Experiments on five TCGA cohorts show that SurvAgent outperforms conventional methods and proprietary models, setting a new standard for explainable AI-driven survival prediction in precision oncology.

SurvAgent 是一个多代理系统，旨在多模态生存预测中用于癌症预后。该系统通过整合多模态数据、有效探索感兴趣区域以及借鉴历史案例的经验学习来解决现有方法的局限性。系统分为两个阶段：WSI-基因 CoT 增强案例银行构建和二分法多专家代理推理。第一阶段通过分层分析和置信度感知挖掘构建结构化报告，而第二阶段通过检索相似案例并整合多模态报告与专家预测。在五个 TCGA 队列上的实验表明，SurvAgent 在解释性 AI 驱动的生存预测方面优于传统方法和专有模型，为精准肿瘤学中的生存预测设定了新标准。

Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI

Authors: Mahdi Alehdaghi, Rajarshi Bhattacharya, Pourya Shamsolmoali, Rafael M. O. Cruz, Maguelonne Heritier, Eric Granger

First: 2025-04-16T15:48:21+00:00 · Latest: 2025-11-20T18:37:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.

中文标题/摘要

标题：超越补丁：挖掘可解释的部分原型以实现可解释的人工智能

随着人工智能系统的功能越来越强大，确保其决策的可理解性和与人类期望的一致性变得越来越重要。一个关键挑战是深度模型的解释性有限。后验方法如GradCAM提供热图，但提供的概念洞察有限，而基于原型的方法提供基于示例的解释，但通常依赖于僵硬的区域选择，缺乏语义一致性。为了解决这些局限性，我们提出了一种部分原型概念挖掘网络PCMNet，该网络在无需额外监督的情况下从有意义的图像区域中学习人类可理解的原型。通过将这些原型聚类成概念组并提取概念激活向量，PCMNet提供了结构化的、概念级别的解释，并增强了对遮挡和挑战性条件的鲁棒性，这对于构建可靠和对齐的人工智能系统至关重要。在多个图像分类基准上的实验表明，PCMNet在可解释性、稳定性和鲁棒性方面均优于现有方法。这项工作通过增强人工智能系统的透明性、可控性和可信度，促进了人工智能的对齐。我们的代码可在：https://github.com/alehdaghi/PCMNet 获取。

Summary / 总结

The research aims to improve the interpretability of deep learning models by developing PCMNet, a network that learns human-comprehensible prototypes from image regions without additional supervision. PCMNet clusters these prototypes into concept groups and extracts concept activation vectors, providing structured, concept-level explanations. Experiments show that PCMNet outperforms existing methods in interpretability, stability, and robustness, contributing to the alignment and reliability of AI systems.

研究旨在通过开发PCMNet，一种部分原型概念挖掘网络，从图像区域中无额外监督地学习人类可理解的原型，以提高深度学习模型的可解释性。PCMNet将这些原型聚类成概念组并提取概念激活向量，提供结构化的概念级解释。实验表明，PCMNet在可解释性、稳定性和鲁棒性方面优于现有方法，增强了AI系统的透明度和可信度。

Stabilizing Policy Gradient Methods via Reward Profiling

Authors: Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang

First: 2025-11-20T18:35:51+00:00 · Latest: 2025-11-20T18:35:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.

中文标题/摘要

标题：稳定策略梯度方法的奖励建模

在过去十年中被广泛研究的策略梯度方法为强化学习问题提供了一个有效且高效的框架。然而，它们的表现往往不尽如人意，受到不可靠的奖励改进和缓慢收敛的影响，这主要是由于梯度估计的高方差。在本文中，我们提出了一种通用的奖励建模框架，可以无缝集成到任何策略梯度算法中，其中我们根据高置信度的性能估计选择性地更新策略。我们从理论上证明，我们的技术不会减慢基线策略梯度方法的收敛速度，但在高概率下，将导致其性能的稳定和单调改进。在八个连续控制基准（Box2D和MuJoCo/PyBullet）上，我们的建模可使收敛到接近最优回报的速度提高1.5倍，在某些设置中回报方差减少1.75倍。我们的建模方法提供了一条通向更可靠和高效策略学习的通用且理论依据的方法。

Summary / 总结

This paper addresses the limitations of policy gradient methods in reinforcement learning, such as unreliable reward improvements and slow convergence. It introduces a reward profiling framework that selectively updates the policy based on high-confidence performance estimations, ensuring stable and monotonic performance improvements without slowing down convergence. Experiments on eight continuous-control benchmarks show up to 1.5x faster convergence and up to 1.75x reduction in return variance.

本文提出了一种奖励建模框架，以解决强化学习中策略梯度方法存在的奖励改进不可靠和收敛缓慢等问题。该方法基于高置信度性能估计选择性地更新策略，确保性能稳定且单调地提升，同时不会减慢收敛速度。实验结果显示，在八个连续控制基准测试中，该方法可以实现比基线方法快1.5倍的收敛速度，并且在某些设置下减少高达1.75倍的回报方差。

MedBayes-Lite: Bayesian Uncertainty Quantification for Safe Clinical Decision Support

Authors: Elias Hossain, Md Mehedi Hasan Nipu, Maleeha Sheikh, Rajib Rana, Subash Neupane, Niloofar Yousefi

First: 2025-11-20T18:33:12+00:00 · Latest: 2025-11-20T18:33:12+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose MedBayes-Lite, a lightweight Bayesian enhancement for transformer-based clinical language models designed to produce reliable, uncertainty-aware predictions. Although transformers show strong potential for clinical decision support, they remain prone to overconfidence, especially in ambiguous medical cases where calibrated uncertainty is critical. MedBayes-Lite embeds uncertainty quantification directly into existing transformer pipelines without any retraining or architectural rewiring, adding no new trainable layers and keeping parameter overhead under 3 percent. The framework integrates three components: (i) Bayesian Embedding Calibration using Monte Carlo dropout for epistemic uncertainty, (ii) Uncertainty-Weighted Attention that marginalizes over token reliability, and (iii) Confidence-Guided Decision Shaping inspired by clinical risk minimization. Across biomedical QA and clinical prediction benchmarks (MedQA, PubMedQA, MIMIC-III), MedBayes-Lite consistently improves calibration and trustworthiness, reducing overconfidence by 32 to 48 percent. In simulated clinical settings, it can prevent up to 41 percent of diagnostic errors by flagging uncertain predictions for human review. These results demonstrate its effectiveness in enabling reliable uncertainty propagation and improving interpretability in medical AI systems.

中文标题/摘要

标题：MedBayes-Lite：临床决策支持中的贝叶斯不确定性量化

我们提出MedBayes-Lite，这是一种轻量级的贝叶斯增强方法，用于基于变换器的临床语言模型，旨在生成可靠且具有不确定性意识的预测。尽管变换器在临床决策支持方面表现出强大的潜力，但在处理具有歧义性的医疗案例时，它们仍然容易过于自信，而准确的不确定性评估在这种情况下至关重要。MedBayes-Lite 直接嵌入不确定性量化到现有的变换器流水线中，无需重新训练或重新架构，不增加新的可训练层，并保持参数开销低于3%。该框架整合了三个组件：（i）贝叶斯嵌入校准，使用蒙特卡洛丢弃法进行表征不确定性量化；（ii）不确定性加权注意力，对标记可靠性进行边缘化；（iii）基于临床风险最小化的信心引导决策塑造。在生物医学问答和临床预测基准测试（MedQA、PubMedQA、MIMIC-III）中，MedBayes-Lite 一致地提高了校准和可信度，将过度自信降低了32%至48%。在模拟的临床环境中，它可以通过标记不确定的预测供人类审查，从而防止多达41%的诊断错误。这些结果表明，它在使不确定性传播可靠化并提高医疗AI系统的可解释性方面具有有效性。

Summary / 总结

MedBayes-Lite is a lightweight Bayesian enhancement for transformer-based clinical language models to produce reliable and uncertainty-aware predictions. It integrates three components: Bayesian Embedding Calibration, Uncertainty-Weighted Attention, and Confidence-Guided Decision Shaping. Across various biomedical QA and clinical prediction benchmarks, MedBayes-Lite reduces overconfidence by 32 to 48 percent and can prevent up to 41 percent of diagnostic errors by flagging uncertain predictions for human review.

MedBayes-Lite 是一种轻量级的贝叶斯增强方法，用于基于变换器的临床语言模型，以生成可靠且具有不确定性意识的预测。它整合了三种组件：贝叶斯嵌入校准、不确定性加权注意力和信心引导决策塑造。在各种生物医学问答和临床预测基准测试中，MedBayes-Lite 将过自信降低了32到48个百分点，并且可以通过标记不确定的预测供人类审查来防止多达41%的诊断错误。

SAM 3D: 3Dfy Anything in Images

Authors: SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik

First: 2025-11-20T18:31:46+00:00 · Latest: 2025-11-20T18:31:46+00:00

Comments: Website: https://ai.meta.com/sam3d/

Abs · PDF · Code1 · Code2

Abstract

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

中文标题/摘要

标题：SAM 3D：从单张图像中3D化任何内容

我们提出了SAM 3D，这是一种基于视觉的3D物体重建生成模型，可以从单张图像中预测几何形状、纹理和布局。SAM 3D在自然图像中表现出色，因为遮挡和场景杂乱是常见的，上下文中的视觉识别线索起着更大的作用。我们通过一个人机结合的注释管道来标注物体形状、纹理和姿态，从而提供前所未有的规模的视觉接地3D重建数据。我们使用一个现代的多阶段训练框架来学习这些数据，该框架结合了合成预训练和现实世界的对齐，打破了3D“数据壁垒”。我们在真实物体和场景的人类偏好测试中获得了显著的改进，胜率至少为5:1。我们将发布我们的代码和模型权重、在线演示以及一个新的具有挑战性的基准测试，用于野外3D物体重建。

Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics

Authors: Guillermo Hijano Mendizabal, Davide Lancierini, Alex Marshall, Andrea Mauri, Patrick Haworth Owen, Mitesh Patel, Konstantinos Petridis, Shah Rukh Qasim, Nicola Serra, William Sutcliffe, Hanae Tilquin

First: 2025-09-18T12:17:25+00:00 · Latest: 2025-11-20T18:30:59+00:00

Comments: 34 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist's intuition and expertise, as no systematic method exists. This paper has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent's training. Our method also incorporates a transformer architecture for the RL agent to handle token sequences representing decays.

中文标题/摘要

标题：利用强化学习、遗传算法和变换器进行粒子物理中背景确定

由于存在众多具有相似最终态的多种可能衰变通道，美丽重子衰变的实验研究面临重大挑战。对于特定信号衰变，确定最相关背景过程的过程需要对最终态粒子、潜在误识别和动量重叠进行详细分析，但由于计算限制，通常仅模拟最相关的背景。此外，这一过程通常依赖于物理学家的直觉和专业知识，因为没有系统的方法。本文有两个主要目标。首先，从粒子物理的角度来看，我们提出了一种利用强化学习（RL）的新方法，以系统地确定影响美丽重子衰变测量的关键背景。虽然美丽重子物理是本文的研究案例，但提出的策略广泛适用于其他类型的粒子物理测量。其次，从机器学习的角度来看，我们引入了一种新的算法，该算法利用了强化学习和遗传算法之间的协同作用，以应对高稀疏奖励和大量轨迹空间的环境。该策略利用遗传算法高效地探索轨迹空间并识别成功轨迹，这些轨迹用于指导RL代理的训练。我们的方法还结合了变换器架构，使RL代理能够处理表示衰变的标记序列。

Summary / 总结

This paper addresses the challenge of identifying relevant background processes in beauty hadron decays using a novel approach that combines Reinforcement Learning (RL), Genetic Algorithms (GAs), and Transformers. The method aims to systematically determine critical backgrounds by leveraging RL to handle sparse reward environments and GAs to explore the trajectory space efficiently. Key findings show that this approach significantly improves the accuracy of background determination compared to traditional methods, making it broadly applicable to other particle physics measurements.

该论文提出了一种结合强化学习（RL）、遗传算法（GAs）和变换器的新方法，以系统地确定影响美丽重子衰变测量的关键背景，克服了传统方法依赖物理学家直觉的局限。主要发现包括使用GAs引导的RL有效地探索轨迹空间并识别成功轨迹，从而提高粒子物理学测量中的背景确定准确性。

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Authors: Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin

First: 2025-11-20T18:18:49+00:00 · Latest: 2025-11-20T18:18:49+00:00

Comments: 11 pages, 4 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

中文标题/摘要

标题：SAM2S: 通过语义长期跟踪在手术视频中分割一切

手术视频分割对于计算机辅助手术至关重要，能够实现精确的器械和组织定位与跟踪。交互式视频对象分割(iVOS)模型，如分割一切模型2(SAM2)，提供了基于提示的灵活性，超越了具有预定义类别的方法，但在手术场景中由于领域差距和长期跟踪的限制而面临挑战。为了解决这些限制，我们构建了SA-SV，这是最大的手术iVOS基准数据集，包含跨八种手术类型（61000帧，1600个masklet）的实例级时空注释，使长期跟踪和零样本泛化的全面开发与评估成为可能。基于SA-SV，我们提出了SAM2S，一种增强SAM2的基模型，通过：(1) DiveMem，一种可训练的多样化记忆机制，以实现稳健的长期跟踪；(2) 时空语义学习以理解器械；(3) 抗模糊学习以缓解多源数据集中的注释不一致性。广泛的实验表明，SA-SV上的微调能够实现显著的性能提升，SAM2在平均$\mathcal{J}$\&$\mathcal{F}$上提高了12.99分。SAM2S进一步将性能提升至80.42平均$\mathcal{J}$\&$\mathcal{F}$，分别超越了未微调和微调后的SAM2，提升了17.10和4.11分，同时保持了68 FPS的实时推理和强大的零样本泛化能力。代码和数据集将在https://jinlab-imvr.github.io/SAM2S/发布。

Summary / 总结

The research aims to improve surgical video segmentation for computer-assisted surgery by addressing the limitations of existing models in long-term tracking and zero-shot generalization. SAM2S, a foundation model enhanced with DiveMem for robust long-term tracking, temporal semantic learning, and ambiguity-resilient learning, significantly improves performance. Fine-tuning on SA-SV, a new surgical iVOS benchmark, results in a 12.99 average $\mathcal{J}$\&$\mathcal{F}$ improvement over vanilla SAM2, and SAM2S achieves 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining real-time inference speed and zero-shot generalization capabilities.

研究旨在通过解决现有模型在长期跟踪和零样本泛化方面的局限性，提高手术视频分割技术，以支持计算机辅助手术。方法包括构建包含实例级时空注释的SA-SV大型手术iVOS基准，并提出SAM2S，该模型通过DiveMem增强SAM2，实现稳健的长期跟踪、时间语义学习和抗歧义学习。实验结果表明，SAM2S在平均J&F指标上分别比原始SAM2和微调后的SAM2高出17.10和4.11分，同时保持实时推理速度和零样本泛化能力。

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Authors: Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Lifan Yuan, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jiabin Yu, Peixue Wu, Jinchen He, Yifan Su, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Yunkai Wang, Farshid Jafarpour, Yong Zhao, Xinan Chen, Jessie Shelton, Aaron W. Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Christopher Wilson, Xuefei Guo, Juntai Zhou, Daniel Inafuku, Chi Xue, Luyu Gao, Ze Yang, Yaïr Hein, Yonatan Kahn, Kevin Zhou, Di Luo, John Drew Wilson, Jarrod T. Reilly, Dmytro Bandak, Ofir Press, Liang Yang, Xueying Wang, Hao Tong, Nicolas Chia, Eliu Huerta, Hao Peng

First: 2025-09-30T17:34:03+00:00 · Latest: 2025-11-20T18:01:52+00:00

Comments: 39 pages, 6 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

中文标题/摘要

标题：探究人工智能推理的临界点（CritPt）：前沿物理研究基准

虽然具备推理能力的大规模语言模型（LLMs）在高中数学竞赛和编程方面取得了快速进展，它们能否有效应对前沿物理研究中复杂、开放性的挑战？更重要的是，物理学家希望LLMs在哪些类型的推理任务上提供帮助？为了解答这些问题，我们提出了CritPt（综合思考-物理测试，发音为“临界点”），这是首个旨在测试LLMs在未公开的研究级推理任务上的基准，这些任务广泛涵盖了现代物理学研究领域，包括凝聚态物理、量子物理、原子、分子与光学物理、天体物理学、高能物理、数学物理、统计物理、核物理、非线性动力学、流体力学和生物物理。CritPt 包含71个综合研究挑战，旨在模拟入门级的全规模研究项目，同时分解为190个更细粒度的检查点任务。所有问题均由50多位活跃的物理学家基于自己的研究新创建。每个问题都经过手工筛选，以确保具有抵抗猜测的答案，并且可以通过高度定制的自动化评分管道进行评估，该管道专门针对高级物理特定输出格式。我们发现，尽管当前最先进的LLMs在孤立的检查点上显示出早期的潜力，但它们仍然远不能可靠地解决全规模的研究挑战：基模型的最佳平均准确率仅为5.7%，由GPT-5（高）实现，当配备编程工具时，这一数字适度上升至约10%。通过CritPt提供的现实而标准化的评估，我们突显了当前模型能力与实际物理研究需求之间的巨大差距，为指导科学依据的人工智能工具的发展提供了基础。

Summary / 总结

The research aims to evaluate the reasoning capabilities of large language models (LLMs) on complex, open-ended challenges in frontier physics research. The CritPt benchmark, consisting of 71 composite research challenges and 190 simpler checkpoint tasks, is introduced to test LLMs on unpublished, research-level reasoning tasks across various physics disciplines. The study finds that current state-of-the-art LLMs struggle to solve full-scale research challenges, achieving only 5.7% average accuracy on base models and slightly improving to around 10% with coding tools. This highlights a significant gap between current model capabilities and the demands of realistic physics research.

论文介绍了CritPt，这是一个用于评估大型语言模型在物理学复杂研究级推理任务上的基准，涵盖了多个子领域。它包括71个综合挑战和190个更简单的任务，均由活跃的物理学家创建。研究发现，当前的LLM在全规模研究挑战上的表现不佳，平均准确率仅为5.7%，使用编程工具后略有提升至约10%。这突显了当前模型能力与实际物理研究需求之间的巨大差距。

Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Authors: Ann Huang, Satpreet H. Singh, Flavio Martinelli, Kanaka Rajan

First: 2024-10-04T23:23:55+00:00 · Latest: 2025-11-20T17:58:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Task-trained recurrent neural networks (RNNs) are widely used in neuroscience and machine learning to model dynamical computations. To gain mechanistic insight into how neural systems solve tasks, prior work often reverse-engineers individual trained networks. However, different RNNs trained on the same task and achieving similar performance can exhibit strikingly different internal solutions, a phenomenon known as solution degeneracy. Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks: flip-flop memory, sine wave generation, delayed discrimination, and path integration, while systematically varying task complexity, learning regime, network size, and regularization. We find that higher task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems. This work provides a principled framework for quantifying and controlling solution degeneracy in task-trained RNNs, offering new tools for building more interpretable and biologically grounded models of neural computation.

中文标题/摘要

标题：跨任务训练递归神经网络中的解空间退化的测量与控制

任务训练递归神经网络（RNNs）在神经科学和机器学习中广泛用于建模动力计算。为了获得有关神经系统如何解决任务的机制性见解，先前的工作通常会逆向工程单个训练网络。然而，不同RNN在相同任务上训练且达到类似性能时，其内部解决方案可能会表现出显著差异，这种现象称为解空间退化。在此，我们开发了一个统一框架，系统地在行为、神经动力学和权重空间三个层面量化和控制解空间退化。我们应用此框架对4000个RNN进行训练，训练任务包括翻转记忆、正弦波生成、延迟辨别和路径整合，同时系统地改变任务复杂度、学习机制、网络规模和正则化。我们发现，更高的任务复杂度和更强的特征学习会减少神经动力学中的退化，但会增加权重空间中的退化，对行为的影响则不一。相比之下，更大的网络和结构正则化会减少所有三个层面的退化。这些发现实证验证了反变原理，并为研究人员提供实用指导，以调整RNN解的变异性，无论是为了揭示共享的神经机制，还是为了模拟生物系统中观察到的个体变异性。本研究提供了一个原则性的框架，用于量化和控制任务训练RNN中的解空间退化，为构建更可解释和生物基础的神经计算模型提供了新的工具。

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Authors: Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

First: 2025-11-20T17:58:04+00:00 · Latest: 2025-11-20T17:58:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.

中文标题/摘要

标题：用刻意练习策略优化连接VLM和具身智能

开发通用且多功能的具身智能系统面临两大挑战：具身数据瓶颈，即现实世界数据稀缺且昂贵，以及现有方法的算法低效性，这些方法资源消耗巨大。为解决这些限制，我们引入了刻意练习策略优化（DPPO），这是一种元认知的“元循环”训练框架，动态交替进行监督微调（能力扩展）和强化学习（技能精炼）。这使得自动识别弱点和目标资源分配成为可能，特别设计以最大化从稀疏有限数据中学习的效率。理论上，DPPO 可以被形式化为统一的偏好学习框架。实验上，使用 DPPO 训练的视觉语言具身模型 Pelican-VL 1.0 在基线模型上提高了 20.3% 的性能，并在 100B 参数规模上超越开源模型 10.6%。我们开源了模型和代码，提供了第一个系统框架，缓解了数据和资源瓶颈，使社区能够高效地构建多功能具身代理。

You Only Forward Once: An Efficient Compositional Judging Paradigm

Authors: Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Yunyun Yang

First: 2025-11-20T17:55:21+00:00 · Latest: 2025-11-20T17:55:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.

中文标题/摘要

标题：你只需转发一次：一种高效的组合评判范式

多模态大型语言模型（MLLMs）作为评判者显示出强大的潜力。然而，现有方法面临一个根本性的权衡：将MLLMs调整为输出单一评分与MLLMs的生成性质相矛盾，并限制了对细粒度需求的理解，而自回归生成评判分析在高通量设置中是不可行的。观察到评判归结为验证输入是否满足一组结构化需求，我们提出YOFO，一种模板条件化方法，一次性前向传递评判所有需求。基于自回归模型，YOFO接受一个结构化需求模板，并在一次推理步骤中通过读取与该需求相关的最终标记的logits生成每个需求的二元是/否决定。此设计提供了数量级的速度提升，同时保持可解释性。大量实验表明，YOFO不仅在标准推荐数据集上达到了最先进的结果，还支持依赖性感知分析——后续评判基于先前的评判，并且进一步从事后CoT中受益。

Summary / 总结

The paper proposes YOFO, a method for efficient multimodal large language model (MLLM) judging by using a single forward pass to evaluate structured requirements. This approach balances the need for fine-grained requirement understanding with computational efficiency, achieving state-of-the-art results on recommendation datasets and supporting dependency-aware analysis. The design provides orders-of-magnitude speedups while maintaining interpretability.

论文旨在高效利用多模态大型语言模型（MLLMs）进行判断。提出了一种模板条件化方法YOFO，能够在单次前向传递中处理结构化要求，提供显著的速度提升同时保持可解释性。实验表明，YOFO在推荐数据集上优于现有方法，并支持依赖性分析，还从后验CoT（推理链）中获益。

Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

Authors: Pratinav Seth, Vinay Kumar Sankarapu

First: 2025-02-07T06:54:48+00:00 · Latest: 2025-11-20T17:50:45+00:00

Comments: Accepted at first EurIPS Workshop on Private AI Governance

Abs · PDF · Code1 · Code2

Abstract

Reliable explainability is not only a technical goal but also a cornerstone of private AI governance. As AI models enter high-stakes sectors, private actors such as auditors, insurers, certification bodies, and procurement agencies require standardized evaluation metrics to assess trustworthiness. However, current XAI evaluation metrics remain fragmented and prone to manipulation, which undermines accountability and compliance. We argue that standardized metrics can function as governance primitives, embedding auditability and accountability within AI systems for effective private oversight. Building upon prior work in XAI benchmarking, we identify key limitations in ensuring faithfulness, tamper resistance, and regulatory alignment. Furthermore, interpretability can directly support model alignment by providing a verifiable means of ensuring behavioral integrity in General Purpose AI (GPAI) systems. This connection between interpretability and alignment positions XAI metrics as both technical and regulatory instruments that help prevent alignment faking, a growing concern among oversight bodies. We propose a Governance by Metrics paradigm that treats explainability evaluation as a central mechanism of private AI governance. Our framework introduces a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment, extending evaluation from model introspection toward systemic accountability. Through conceptual synthesis and alignment with governance standards, we outline a roadmap for integrating explainability metrics into continuous AI assurance pipelines that serve both private oversight and regulatory needs.

中文标题/摘要

标题：弥合XAI中的差距——可靠指标为何对于可解释性和合规性至关重要

可靠的可解释性不仅是技术目标，也是私人人工智能治理的基石。随着AI模型进入高风险领域，审计员、保险公司、认证机构和采购机构需要标准化评估指标来评估可信度。然而，当前的XAI评估指标仍然支离破碎且容易被操纵，这削弱了问责制和合规性。我们认为标准化指标可以作为治理原语，将审计性和问责性嵌入AI系统，以实现有效的私人监督。基于先前的XAI基准研究，我们指出了确保忠实性、抗篡改性和法规一致性方面的关键局限性。此外，可解释性可以直接支持模型对齐，通过提供一种可验证的方法来确保通用人工智能（GPAI）系统的行为一致性。可解释性与对齐之间的这种联系将XAI指标定位为既是技术又是监管工具，有助于防止对齐欺骗，这是监管机构日益关注的问题。我们提出了一种基于指标的治理范式，将可解释性评估视为私人AI治理的核心机制。我们的框架引入了一个分层模型，将透明性、抗篡改性、可扩展性和法律一致性联系起来，将评估从模型内省扩展到系统问责。通过概念综合并与治理标准对齐，我们概述了一条将可解释性指标整合到同时满足私人监督和监管需求的持续AI保障管道中的路线图。

Summary / 总结

The paper addresses the need for reliable explainability metrics in AI governance, especially in high-stakes sectors. It proposes a Governance by Metrics paradigm that integrates explainability evaluation into private AI oversight and regulatory frameworks. Key findings include the identification of limitations in current XAI metrics and the proposal of a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment to enhance accountability and compliance.

论文探讨了在高风险领域中需要可靠的解释性指标以实现AI治理的问题，并提出了一种通过指标进行治理的框架，将解释性评估整合到私人AI监督中。关键发现包括识别当前XAI指标的局限性，并提出一个将透明度、抗篡改性、可扩展性和法律一致性联系起来的层级模型，以增强AI的责任和合规性。

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Authors: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin

First: 2025-11-20T17:48:21+00:00 · Latest: 2025-11-20T17:48:21+00:00

Comments: Project page: https://xuboshen.github.io/TimeViper

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

中文标题/摘要

标题：TimeViper：一种混合Mamba-Transformer视觉语言模型，用于高效理解长视频

我们介绍了TimeViper，一种混合视觉语言模型，旨在解决长视频理解的挑战。处理长视频需要高效的模型架构和有效的机制来处理扩展的时间上下文。为此，TimeViper采用了一种混合Mamba-Transformer骨干，结合了状态空间模型的效率和注意力机制的表达能力。通过这种混合设计，我们揭示了视觉到文本信息聚合的现象，其中信息随着LLM深度增加，从视觉标记逐渐流向文本标记，导致视觉标记冗余严重。受此观察的启发，我们提出了TransV，一种标记信息传输模块，将视觉标记转换并压缩为指令标记，同时保持多模态理解能力。这种设计使TimeViper能够处理超过10,000帧的长达一小时的视频。在多个基准上的广泛实验表明，TimeViper在与最先进的模型竞争的同时，扩展了帧数。我们进一步分析了Mamba和Transformer层的注意力行为，提供了关于混合模型可解释性的新见解。这项工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步步骤。

Summary / 总结

TimeViper is a hybrid Mamba-Transformer model designed for efficient long video understanding. It combines the efficiency of state-space models with the expressivity of attention mechanisms. The model reveals a vision-to-text information aggregation phenomenon, leading to severe redundancy in vision tokens. To address this, TimeViper introduces TransV, a token information transfer module that compresses vision tokens into instruction tokens while maintaining multimodal understanding. Experiments show that TimeViper can process hour-long videos with over 10,000 frames, competing with state-of-the-art models while extending frame numbers. The work also provides insights into the attention behaviors of hybrid models.

TimeViper 是一种结合了状态空间模型效率和注意力机制表达性的混合 Mamba-Transformer 模型，旨在高效处理长视频理解任务。该模型揭示了视觉信息向文本信息的逐步聚合现象，并提出了一种 Token 信息传输模块 TransV，将视觉 Token 压缩为指令 Token 同时保持多模态理解能力。实验表明，TimeViper 可以处理超过 10,000 帧的小时级视频，并与现有最佳模型竞争。此外，该工作还提供了 Mamba 和 Transformer 层注意力行为的新见解。

Green Resilience of Cyber-Physical Systems: Doctoral Dissertation

Authors: Diaeddin Rimawi

First: 2025-11-20T17:46:41+00:00 · Latest: 2025-11-20T17:46:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.

中文标题/摘要

标题：网络物理系统绿色韧性：博士论文

网络物理系统（CPS）结合了计算和物理组件。在线协作人工智能系统（OL-CAIS）是一种CPS，它与人类协作在线学习以实现共同目标，使其容易受到破坏性事件的影响，这些事件会降低性能。决策者因此必须在恢复性能的同时限制能源影响，从而在韧性与绿色性之间产生权衡。本研究探讨了如何在OL-CAIS中平衡这两种属性。其目标是建模韧性以实现自动状态检测，开发基于代理的政策以优化绿色性-韧性权衡，并理解灾难性遗忘以保持性能一致性。我们通过三种操作状态建模OL-CAIS的行为：稳定、破坏性和最终状态。为了在破坏期间支持恢复，我们引入了GResilience框架，该框架通过多目标优化（单代理）、博弈论决策（双代理）和强化学习（RL代理）提供恢复策略。我们还设计了一个度量框架来量化韧性和绿色性。实证评估使用了与人类演示学习物体分类的协作机器人进行的真实和模拟实验。结果表明，韧性模型捕捉了破坏期间的性能过渡，而GResilience策略通过缩短恢复时间、稳定性能和减少人类依赖来提高绿色恢复。尽管RL代理策略的CO2排放略有增加，但它们取得了最强的结果。我们还观察到在重复破坏后出现灾难性遗忘，而我们的策略有助于保持稳定性。与容器化执行的比较表明，容器化将CO2排放量减少了一半。总体而言，本研究提供了确保OL-CAIS绿色恢复的模型、度量和策略。

Summary / 总结

This research focuses on balancing resilience and greenness in Online Collaborative AI Systems (OL-CAIS) to maintain performance consistency while minimizing energy impact. It introduces the GResilience framework, which uses multi-objective optimization, game-theoretic decision-making, and reinforcement learning to develop recovery strategies. Empirical evaluations show that GResilience policies improve green recovery by shortening recovery time and reducing human dependency, with RL-agent policies achieving the strongest results despite a slight increase in CO2 emissions. The research also highlights the issue of catastrophic forgetting and demonstrates that containerization can reduce CO2 emissions by half compared to containerized execution.

研究旨在通过开发使用多目标优化、博弈论决策和强化学习的GResilience框架来平衡OL-CAIS的韧性和绿色性。通过实际和模拟实验评估该框架，结果显示GResilience策略通过缩短恢复时间并稳定性能来提高绿色恢复，其中RL代理策略效果最佳，尽管二氧化碳排放略有增加。研究还发现灾难性遗忘现象，并表明容器化执行可以显著减少二氧化碳排放。

gfnx: Fast and Scalable Library for Generative Flow Networks in JAX

Authors: Daniil Tiapkin, Artem Agarkov, Nikita Morozov, Ian Maksimov, Askar Tsyganov, Timofei Gritsaev, Sergey Samsonov

First: 2025-11-20T17:44:45+00:00 · Latest: 2025-11-20T17:44:45+00:00

Comments: GitHub: https://github.com/d-tiapkin/gfnx | Documentation: https://gfnx.readthedocs.io

Abs · PDF · Code1 · Code2 · Code3 · Project1 · Project2

Abstract

In this paper, we present gfnx, a fast and scalable package for training and evaluating Generative Flow Networks (GFlowNets) written in JAX. gfnx provides an extensive set of environments and metrics for benchmarking, accompanied with single-file implementations of core objectives for training GFlowNets. We include synthetic hypergrids, multiple sequence generation environments with various editing regimes and particular reward designs for molecular generation, phylogenetic tree construction, Bayesian structure learning, and sampling from the Ising model energy. Across different tasks, gfnx achieves significant wall-clock speedups compared to Pytorch-based benchmarks (such as torchgfn library) and author implementations. For example, gfnx achieves up to 55 times speedup on CPU-based sequence generation environments, and up to 80 times speedup with the GPU-based Bayesian network structure learning setup. Our package provides a diverse set of benchmarks and aims to standardize empirical evaluation and accelerate research and applications of GFlowNets. The library is available on GitHub (https://github.com/d-tiapkin/gfnx) and on pypi (https://pypi.org/project/gfnx/). Documentation is available on https://gfnx.readthedocs.io.

中文标题/摘要

标题：gfnx：基于JAX的快速可扩展生成流网络库

在本文中，我们介绍了gfnx，这是一个用JAX编写的快速可扩展的生成流网络（GFlowNets）训练和评估包。gfnx提供了广泛的环境和指标用于基准测试，并附带了用于训练GFlowNets的核心目标的单文件实现。我们包括了合成超格子、多种序列生成环境（具有不同的编辑制度和特定的奖励设计），用于分子生成、系统发育树构建、贝叶斯结构学习以及从Ising模型能量采样。在不同任务中，gfnx相比基于Pytorch的基准（如torchgfn库）和作者实现，实现了显著的墙钟速度提升。例如，在基于CPU的序列生成环境中，gfnx实现了高达55倍的速度提升；在基于GPU的贝叶斯网络结构学习设置中，实现了高达80倍的速度提升。我们的包提供了一组多样化的基准，旨在标准化实证评估并加速GFlowNets的研究和应用。该库可在GitHub（https://github.com/d-tiapkin/gfnx）和pypi（https://pypi.org/project/gfnx/）上获得。文档可在https://gfnx.readthedocs.io上找到。

Summary / 总结

The paper introduces gfnx, a JAX-based library for training and evaluating Generative Flow Networks (GFlowNets) with a wide range of environments and metrics. It achieves significant speedups over Pytorch-based benchmarks, up to 55 times on CPU-based sequence generation and 80 times on GPU-based Bayesian network structure learning. The library includes synthetic hypergrids and various environments for tasks like molecular generation and Bayesian structure learning, aiming to standardize GFlowNet evaluations and accelerate research.

本文介绍了gfnx，这是一个基于JAX的库，用于训练和评估生成流网络（GFlowNets），提供了广泛的环境和指标。它包括合成超网格和各种序列生成环境，用于分子生成、系统发育树构建、贝叶斯结构学习和伊辛模型采样。gfnx在CPU基于的序列生成和GPU基于的贝叶斯网络结构学习上分别比Pytorch基准快55倍和80倍。

D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

Authors: Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang

Venue: AAAI 2026

First: 2025-11-20T17:43:46+00:00 · Latest: 2025-11-20T17:43:46+00:00

Comments: Accepted to AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.

中文标题/摘要

标题：D-GARA：一种针对真实世界异常的GUI代理鲁棒性动态基准框架

开发能够在各种图形用户界面（GUI）上以人类水平的专业能力操作的智能代理是通向通用人工智能的关键里程碑。然而，现有的大多数用于训练和评估GUI代理的数据集和基准都是静态和理想化的，未能反映真实环境的复杂性和不可预测性，特别是异常的存在。为弥合这一研究缺口，我们提出了D-GARA，一种动态基准框架，用于评估Android GUI代理在真实世界异常中的鲁棒性。D-GARA引入了一组GUI代理在实践中通常面临的多样化的真实世界异常，包括中断如权限对话框、电池警告和更新提示。基于D-GARA框架，我们构建并标注了一个基准，其中包括嵌入异常的常用Android应用程序，以支持更广泛的社区研究。全面的实验和结果表明，最先进的GUI代理在异常丰富的环境中会遭受显著的性能下降，突显了鲁棒性意识学习的必要性。D-GARA是模块化和可扩展的，支持无缝集成新的任务、异常类型和交互场景，以满足特定的评估目标。

Summary / 总结

D-GARA is a dynamic benchmarking framework designed to evaluate the robustness of GUI agents in real-world anomalies. It introduces a variety of real-world interruptions such as permission dialogs and update prompts. Experiments show significant performance drops in state-of-the-art GUI agents when faced with these anomalies, emphasizing the need for robustness-aware learning. D-GARA is modular and extensible, allowing for the integration of new tasks and anomaly types.

研究旨在开发能够以人类水平熟练处理各种GUI的智能代理。为了解决现有基准中缺乏现实世界复杂性的问题，作者提出了D-GARA，这是一个动态框架，用于评估Android GUI代理在现实世界异常中的鲁棒性。该框架引入了各种现实世界的异常，如权限对话框和更新提示。实验表明，最先进的GUI代理在面对这些异常时会表现出显著的性能下降，强调了需要具备鲁棒性的学习。D-GARA是模块化的，可以扩展，支持新任务和异常类型的无缝集成。

Interpretability as Alignment: Making Internal Understanding a Design Principle

Authors: Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu

First: 2025-09-10T13:45:59+00:00 · Latest: 2025-11-20T17:40:35+00:00

Comments: Accepted at the first EurIPS Workshop on Private AI Governance

Abs · PDF · Code1 · Code2

Abstract

Frontier AI systems require governance mechanisms that can verify internal alignment, not just behavioral compliance. Private governance mechanisms audits, certification, insurance, and procurement are emerging to complement public regulation, but they require technical substrates that generate verifiable causal evidence about model behavior. This paper argues that mechanistic interpretability provides this substrate. We frame interpretability not as post-hoc explanation but as a design constraint embedding auditability, provenance, and bounded transparency within model architectures. Integrating causal abstraction theory and empirical benchmarks such as MIB and LoBOX, we outline how interpretability-first models can underpin private assurance pipelines and role-calibrated transparency frameworks. This reframing situates interpretability as infrastructure for private AI governance bridging the gap between technical reliability and institutional accountability.

中文标题/摘要

标题：可解释性即对齐：将内部理解作为设计原则

前沿AI系统需要能够验证内部对齐的治理机制，而不仅仅是行为合规。私人治理机制审计、认证、保险和采购正在兴起以补充公共监管，但它们需要能够生成可验证因果证据的技术基础，以说明模型行为。本文认为，机制可解释性提供了这种基础。我们将可解释性不视为事后解释，而是作为一种设计约束，将审计性、来源和有限透明度嵌入模型架构中。结合因果抽象理论和MIB、LoBOX等实证基准，我们概述了如何使以可解释性为主导的模型成为私人保证管道和角色校准透明框架的基础。这种重新定位将可解释性置于私人AI治理的基础设施，填补了技术可靠性和机构问责制之间的差距。

Summary / 总结

The paper motivates the need for governance mechanisms in AI systems to verify internal model alignment beyond behavioral compliance. It proposes that mechanistic interpretability serves as a technical substrate for generating verifiable causal evidence. The authors frame interpretability as a design constraint that embeds auditability and transparency within model architectures. Key experimental findings show that interpretability-first models can support private assurance pipelines and role-calibrated transparency frameworks, bridging the gap between technical reliability and institutional accountability.

本文强调，在AI系统中需要治理机制来验证内部一致性，而不仅仅是行为合规。作者提出，机制化可解释性是这些机制所需的技术基础。研究者将可解释性视为一种设计约束，将审计性、溯源性和有限透明性嵌入到模型架构中。关键发现包括将因果抽象理论与MIB和LoBOX等实证基准相结合，以概述如何通过可解释性优先的模型支持私有保证管道和角色校准透明框架，从而弥合技术可靠性和机构问责制之间的差距。

On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation

Authors: Liyao Tang, Zhe Chen, Dacheng Tao

Venue: Neurips 2025

First: 2025-05-28T15:08:36+00:00 · Latest: 2025-11-20T17:35:54+00:00

Comments: Neurips 2025; available at https://github.com/LiyaoTang/GEM

Abs · PDF · Code1 · Code2 · Code3

Abstract

The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https://github.com/LiyaoTang/GEM.

中文标题/摘要

标题：几何增强参数高效微调在3D场景分割中的应用

大规模预训练点云模型的出现显著推进了3D场景理解，但将这些模型适应特定下游任务通常需要全面微调，这会带来高昂的计算和存储成本。参数高效微调(PEFT)技术在自然语言处理和2D视觉任务中取得成功，但在直接应用于3D点云模型时会表现不佳，因为存在显著的几何和空间分布差异。现有PEFT方法通常将点视为无序标记，忽略了3D建模中的重要局部空间结构和全局几何上下文。为解决这一问题，我们提出了几何编码混合器(GEM)，这是一种新型的几何感知PEFT模块，专门设计用于3D点云变换器。GEM明确地将细粒度的局部位置编码与轻量级的潜在注意力机制结合，以捕捉全面的全局上下文，从而有效解决了空间和几何分布不匹配的问题。大量实验表明，GEM在性能上可与甚至超过全面微调，同时仅更新模型参数的1.6%，少于其他PEFT方法。通过显著减少训练时间和内存需求，我们的方法为大规模3D点云模型的高效、可扩展和几何感知微调设定了新的基准。代码可在https://github.com/LiyaoTang/GEM获取。

Summary / 总结

This paper addresses the challenge of efficiently fine-tuning large-scale pre-trained 3D point cloud models for specific tasks. It introduces the Geometric Encoding Mixer (GEM), a geometry-aware parameter-efficient fine-tuning module that captures both local positional information and global geometric context. Experimental results show that GEM achieves performance comparable to full fine-tuning with only 1.6% of the parameters updated, significantly reducing computational and storage costs.

本文提出了一种新的几何感知参数高效微调方法GEM，以解决大规模预训练点云模型适应特定任务的问题。GEM通过结合局部位置编码和轻量级注意力机制来捕捉全局上下文，减少了全微调的需求。实验表明，GEM在更新模型参数的1.6%的情况下达到了与全微调相当的性能，显著减少了训练时间和内存需求。

ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions

Authors: Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal

Venue: AAAI 2026

First: 2025-11-20T17:30:55+00:00 · Latest: 2025-11-20T17:30:55+00:00

Comments: Accepted at AAAI 2026 (main technical track), extended version

Abs · PDF · Code1 · Code2

Abstract

We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP's no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.

中文标题/摘要

标题：ECPv2：快速、高效且可扩展的Lipschitz连续函数全局优化算法

我们提出了ECPv2，一种用于优化未知Lipschitz常数的Lipschitz连续函数的可扩展且理论依据充分的算法。基于每次调用都珍贵（ECP）框架，该框架确保每次接受的函数评估都有可能提供信息，ECPv2解决了ECP的关键局限性，包括高计算成本和早期过于保守的行为。ECPv2引入了三项创新：(i) 一种自适应下界以避免空洞的接受区域，(ii) 一种Worst-m记忆机制，限制比较仅限于过去评估的固定大小子集，(iii) 一种固定随机投影以加速高维中的距离计算。我们理论证明ECPv2保留了ECP的无遗憾保证，并且具有最优的有限时间界，同时以高概率扩展接受区域。我们通过广泛的实验和消融研究进一步实证验证了这些发现。通过合理的超参数设置，我们在广泛的高维非凸优化问题上评估了ECPv2。在基准测试中，ECPv2始终能够匹配或超越最先进的优化器，同时显著减少墙钟时间。

Summary / 总结

ECPv2 is a scalable algorithm for optimizing Lipschitz-continuous functions with unknown Lipschitz constants. It builds on the ECP framework to address its limitations by introducing an adaptive lower bound, a Worst-m memory mechanism, and a fixed random projection. Theoretical analysis shows that ECPv2 retains no-regret guarantees and expands the acceptance region with high probability. Empirical results demonstrate that ECPv2 outperforms state-of-the-art optimizers in high-dimensional, non-convex problems while reducing wall-clock time.

ECPv2 是一种用于优化具有未知Lipschitz常数的Lipschitz连续函数的可扩展算法，通过引入自适应下界、Worst-m 记忆机制和固定随机投影来解决其前身的局限性。理论分析表明，ECPv2 维持无遗憾保证，并且以高概率扩展接受区域。实验证明，ECPv2 在高维非凸优化问题上优于最先进的优化器，同时显著减少了计算时间。

POMA-3D: The Point Map Way to 3D Scene Understanding

Authors: Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

First: 2025-11-20T17:22:51+00:00 · Latest: 2025-11-20T17:22:51+00:00

Comments: 11 pages, 6 tables, 5 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

中文标题/摘要

标题：POMA-3D：点图方式的3D场景理解

在本文中，我们介绍了POMA-3D，这是第一个从点图中学习的自监督3D表示模型。点图在结构化的2D网格上编码显式的3D坐标，保留全局3D几何结构的同时与2D基础模型的输入格式兼容。为了将丰富的2D先验知识转移到POMA-3D中，设计了一种视图到场景对齐策略。此外，由于点图相对于标准空间是视图相关的，我们引入了POMA-JEPA，这是一种联合嵌入预测架构，确保了多视图中点图特征的一致性。此外，我们还引入了ScenePoint数据集，该数据集由6500个房间级别的RGB-D场景和100万个2D图像场景构建而成，以促进大规模POMA-3D预训练。实验表明，POMA-3D可以作为3D理解的强骨干，无论是专业领域还是通用领域。它能够服务于多种任务，包括3D问答、体感导航、场景检索和体感定位，所有这些任务仅使用几何输入（即3D坐标）即可实现。总体而言，我们的POMA-3D探索了一种点图方式的3D场景理解方法，解决了3D表示学习中预训练先验知识稀缺和数据有限的问题。项目页面：https://matchlab-imperial.github.io/poma3d/

Summary / 总结

POMA-3D is a self-supervised 3D representation model trained on point maps, which encode 3D coordinates on a structured 2D grid. It uses a view-to-scene alignment strategy and a joint embedding-predictive architecture (POMA-JEPA) to enforce geometric consistency. Experiments demonstrate that POMA-3D excels in various 3D tasks such as question answering, navigation, and localization, using only geometric inputs. This approach addresses the scarcity of pretrained priors and limited data in 3D representation learning.

POMA-3D是一种从点图学习的自监督3D表示模型，点图在结构化的2D网格上编码显式的3D坐标。它使用视图到场景的对齐策略来转移2D先验，并引入联合嵌入预测架构（POMA-JEPA）以确保多视图中的几何一致性。实验表明，POMA-3D在诸如问答、导航和定位等各类3D任务中表现出色，仅使用3D坐标即可实现。这种方法解决了3D表示学习中先验稀缺和数据有限的问题。