arXiv 论文速递

Unique Lives, Shared World: Learning from Single-Life Videos

Authors: Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen

First: 2025-12-03T18:59:57+00:00 · Latest: 2025-12-03T18:59:57+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

中文标题/摘要

标题：独特的人生，共享的世界：从单人生活视频中学习

我们介绍了“单人生活”学习范式，其中我们专门在一个个体拍摄的主观视频上训练一个独特的视觉模型。我们利用单人生活中自然捕捉到的多个视角来以自监督的方式学习一个视觉编码器。我们的实验展示了三个关键发现。首先，独立训练在不同生活中模型发展出高度一致的几何理解。我们通过在各自捕捉不同生活的不同数据集上训练视觉编码器来证明这一点，并引入了一种新的基于交叉注意力的度量来量化不同模型内部表示的功能一致性。其次，我们展示了单人生活模型学习到的可泛化的几何表示能够有效地转移到下游任务，如深度估计，在未见过的环境中。第三，我们证明了在一个个体一周生活中的最多30小时数据上进行训练，其性能与在多样化的网络数据上进行30小时训练相当，突显了单人生活表示学习的强大力量。总体而言，我们的结果表明，共享的世界结构不仅导致在个体生活中训练的模型的一致性，还为视觉表示学习提供了强大的信号。

PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

Authors: Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang, Caifeng Shan, Chenyang Si

First: 2025-12-03T18:59:37+00:00 · Latest: 2025-12-03T18:59:37+00:00

Comments: Project page: https://postercopilot.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.

中文标题/摘要

标题：PosterCopilot：朝专业图形设计的布局推理和可控编辑迈进

图形设计是现代视觉交流的基础，是促进文化与商业活动的重要媒介。近年来，通过大型多模态模型（LMMs）自动化这一过程的研究取得了进展，但现有方法往往生成几何不准确的布局，并缺乏专业工作流程中所需的逐层迭代编辑。为解决这些限制，我们提出了PosterCopilot框架，以推进专业图形设计中的布局推理和可控编辑。具体而言，我们引入了一种渐进的三阶段训练策略，使LMMs具备几何理解和美学推理能力，包括扰动监督微调、视觉现实对齐的强化学习以及基于美学反馈的强化学习。此外，我们开发了一个完整的流程，将训练好的LMM基础设计模型与生成模型耦合，实现逐层可控、迭代编辑，以实现精确的元素细化，同时保持全局视觉一致性。大量实验表明，PosterCopilot能够生成几何准确且美学优越的布局，为专业迭代设计提供了前所未有的可控性。

Summary / 总结

The paper addresses the limitations of existing methods in automating graphic design by introducing PosterCopilot, a framework that enhances layout reasoning and controllable editing. It employs a three-stage training strategy involving perturbed supervised fine-tuning, reinforcement learning for visual-reality alignment, and reinforcement learning from aesthetic feedback. The framework integrates a trained LMM-based design model with generative models to enable precise, iterative editing while maintaining global visual consistency. Experimental results show that PosterCopilot produces geometrically accurate and aesthetically superior layouts with enhanced controllability for professional design workflows.

研究旨在通过大型多模态模型（LMM）提高自动化图形设计的准确性和可控性。PosterCopilot提出了一种三阶段训练策略，包括扰动监督微调、视觉现实对齐的强化学习和美学反馈的强化学习，以增强几何理解和美学推理。该框架还支持分层可控的迭代编辑，允许精确细化同时保持全局视觉一致性。实验结果表明，PosterCopilot生成了几何准确且美学优越的布局，并为专业设计工作流程提供了前所未有的可控性。

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Authors: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura

Venue: NeurIPS 2025

First: 2025-03-24T17:51:39+00:00 · Latest: 2025-12-03T18:56:50+00:00

Comments: NeurIPS 2025; 27 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https://github.com/bbartoldson/TBA.

中文标题/摘要

标题：轨迹平衡与异步性：解耦探索与学习以实现快速、可扩展的大语言模型后训练

强化学习（RL）是大语言模型（LLM）后训练中的关键组成部分。然而，用于后训练的在线策略算法不自然地对经验回放缓冲区中多样化的内容具有鲁棒性，而异步的离策略演员可以并行填充这些内容。我们提出了一种通过轨迹平衡与异步性（TBA）高效地利用离策略数据的方法，这是一种利用原理上离策略TB目标的异步RL方法。在数学、偏好调优和自动化红队任务中，我们对从Pythia 410M到Qwen 2.5 7B的模型进行后训练，发现TBA在速度和性能上都优于如在线DPO和Dr. GRPO等强大的基线。除了TBA的性能优势（即使异步性增加，准确性依然很高）和加速效果（$4\times$或更多），我们还展示了其奖励和近期性优先的采样在数据生成规模扩大时能带来进一步的收益。我们的代码可在https://github.com/bbartoldson/TBA获取。

Summary / 总结

The paper addresses the challenge of using on-policy algorithms in large language model (LLM) post-training, which are not robust to diverse experience replay buffers. It introduces Trajectory Balance with Asynchrony (TBA), an approach that leverages the off-policy Trajectory Balance (TB) objective for asynchronous reinforcement learning. Experiments on various tasks show that TBA outperforms strong baselines like Online DPO and Dr. GRPO, offering both speed and performance improvements. Additionally, TBA's reward- and recency-prioritizing sampling enables further gains when scaling data generation.

论文针对在大规模语言模型后训练中使用在线策略算法面临的挑战，这些算法对多样化的经验回放缓冲不鲁棒。它提出了轨迹平衡与异步性相结合的方法（TBA），利用离策略的轨迹平衡目标进行异步强化学习。在数学、偏好调优和自动化红队任务上的实验表明，TBA在性能和速度上都优于强基线如在线DPO和Dr. GRPO。进一步扩大数据生成规模还能通过TBA的奖励和近期性优先采样方法获得更好的效果。

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

Authors: Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren, Greg Durrett

First: 2025-12-03T18:54:53+00:00 · Latest: 2025-12-03T18:54:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.

中文标题/摘要

标题：SkillFactory：自我蒸馏以学习认知行为

利用长链推理的推理模型会运用各种认知技能，例如验证答案、回溯、使用其他方法重试等。先前的研究表明，当基础语言模型表现出这些技能时，通过强化学习（RL）进一步训练该模型可以学会利用这些技能。我们如何让模型利用基础模型未表现出的技能？我们的工作，SkillFactory，是一种方法，在强化学习（RL）之前的监督微调（SFT）阶段，通过自我蒸馏粗略学习这些技能。我们的方法不依赖于从更强的模型进行蒸馏，而是使用模型自身的样本，重新排列以提供符合这些技能的训练数据。这些“银质”的SFT轨迹可能不完美，但仍然有效，可以为模型在RL期间获取技能提供引导。我们的评估表明：（1）从SkillFactory SFT初始化开始有助于模型在RL后泛化到更难的任务，尽管在RL前的性能较低；（2）认知技能确实被模型使用；（3）经过RL的SkillFactory模型在跨域任务上的鲁棒性比基础模型更好。我们的工作表明，在RL之前学习的归纳偏置有助于模型学习稳健的认知技能使用。

Summary / 总结

SkillFactory is a method for fine-tuning language models to learn cognitive skills such as verification and backtracking before reinforcement learning. It uses self-distillation from model samples to provide training data for these skills. Experiments show that starting with SkillFactory initialization helps models generalize better to harder tasks post-RL, and that RLed SkillFactory models are more robust to out-of-domain task regression compared to RLed base models.

SkillFactory 是一种方法，在强化学习之前通过模型样本进行自我蒸馏来教授模型认知技能，如验证和回溯。评估结果显示，使用 SkillFactory 初始化的模型在强化学习后对更难的任务变体表现更好，使用了认知技能，并且在处理域外任务时更不易出现退化。

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Authors: Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

First: 2025-12-02T18:24:27+00:00 · Latest: 2025-12-03T18:51:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

中文标题/摘要

标题：DynamicVerse：一种物理感知的多模态框架用于4D世界建模

理解动态物理世界，其特征为不断演变的三维结构、真实世界的运动以及带有文本描述的语义内容，对于人类代理交互至关重要，并使具身代理能够以类人能力感知和行动于真实环境中。然而，现有数据集往往源自有限的模拟器或利用传统的结构从运动进行缩放标注，提供的描述性标题有限，这限制了基础模型从单目视频中准确解释真实世界动态的能力，这些视频通常来自互联网。为弥合这些差距，我们引入了DynamicVerse，一种物理尺度的多模态4D世界建模框架，用于动态真实世界视频。我们利用大规模视觉、几何和多模态模型来解释度量级静态几何、真实世界动态运动、实例级掩码和整体描述性标题。通过结合基于窗口的束调整与全局优化，我们的方法将长时间的真实世界视频序列转换为全面的4D多模态格式。DynamicVerse提供了一个大规模数据集，包含10万多个视频、80多万个标注掩码和1000多万帧来自互联网视频。在三个基准任务（即视频深度估计、相机姿态估计和相机内参估计）上的实验评估表明，我们的4D建模在捕捉物理尺度测量方面具有更高的全局准确性，优于现有方法。

Summary / 总结

DynamicVerse is a framework designed to model the dynamic physical world with evolving 3D structures, real-world motion, and semantic content. It uses large vision, geometric, and multimodal models to interpret static geometry, dynamic motion, instance-level masks, and descriptive captions. By integrating window-based Bundle Adjustment with global optimization, DynamicVerse converts long video sequences into a 4D multimodal format. The framework provides a large-scale dataset with over 100,000 videos, 800,000 annotated masks, and 10 million frames. Experimental evaluations show that DynamicVerse outperforms existing methods in video depth estimation, camera pose estimation, and camera intrinsics estimation, achieving greater global accuracy in physical-scale measurements.

DynamicVerse 是一个框架，旨在建模具有演变3D结构、真实世界运动和语义内容的动态物理世界。它使用大型视觉、几何和多模态模型来解释静态几何、动态运动、实例级掩码和描述性说明。通过结合窗口式Bundle Adjustment与全局优化，DynamicVerse 将长视频序列转换为4D多模态格式。实验表明，DynamicVerse 在视频深度估计、相机姿态估计和相机内参估计等基准任务中优于现有方法，实现了更高的全局精度和物理尺度测量。

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Authors: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay

First: 2025-12-03T18:50:04+00:00 · Latest: 2025-12-03T18:50:04+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.

中文标题/摘要

标题：SpaceTools: 通过双交互式强化学习增强的空间推理工具

视觉语言模型（VLMs）展示了强大的定性视觉理解能力，但在需要精确度量的空间推理方面存在困难，这正是为实体应用所需的能力。代理范式表明，VLMs 可以利用各种工具来增强这些能力，例如深度估计器、分割模型和姿态估计器。然而，如何在不依赖于手工制作的提示策略或强制执行固定预定义工具管道的情况下实现这一愿景仍是一个开放的挑战。强化学习可以弥补这一差距，但由于多工具推理中的搜索空间庞大，它目前仅限于处理单个视觉工具的推理。我们引入了双交互式强化学习（DIRL），这是一种两阶段训练框架，其中VLMs通过交互探索和反馈学习协调多种工具。在教学阶段，我们将通过交互式RL训练的单工具专家的演示与使用所有工具的前沿模型的轨迹结合起来。在探索阶段，模型进一步通过继续RL细化多工具协调。我们的模型SpaceTools，具有工具增强的空间推理能力，在空间理解基准测试（RoboSpatial-Home、BLINK、BOP-ASK）上实现了最先进的性能，并使用7-DOF机器人作为工具展示了可靠的现实世界操作。DIRL在vanilla SFT（+12%）和RL（+16%）基线上提供了显著的改进。项目页面：https://spacetools.github.io/

Summary / 总结

SpaceTools uses Double Interactive Reinforcement Learning (DIRL) to enable Vision Language Models (VLMs) to coordinate multiple tools for precise spatial reasoning, improving performance on spatial understanding benchmarks by 12% and 16% over strong baselines. This approach allows VLMs to discover optimal tool-use patterns without relying on handcrafted prompting or fixed pipelines.

SpaceTools 使用 Double Interactive Reinforcement Learning (DIRL) 让 Vision Language Models 能够协调多种工具进行精确的空间推理，改进了空间理解基准测试和实际操作任务的表现。该模型在 RoboSpatial-Home 和 RoboSpatial 上分别超越了先前方法 12% 和 16%，并且展示了可靠地使用 7-DOF 机器人进行操作任务的能力。

Eval Factsheets: A Structured Framework for Documenting AI Evaluations

Authors: Florian Bordes, Candace Ross, Justine T Kao, Evangelia Spiliopoulou, Adina Williams

First: 2025-12-03T18:46:50+00:00 · Latest: 2025-12-03T18:46:50+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models -- which benefit from structured documentation frameworks like Datasheets and Model Cards -- evaluation methodologies lack systematic documentation standards. We introduce Eval Factsheets, a structured, descriptive framework for documenting AI system evaluations through a comprehensive taxonomy and questionnaire-based approach. Our framework organizes evaluation characteristics across five fundamental dimensions: Context (Who made the evaluation and when?), Scope (What does it evaluate?), Structure (With what the evaluation is built?), Method (How does it work?) and Alignment (In what ways is it reliable/valid/robust?). We implement this taxonomy as a practical questionnaire spanning five sections with mandatory and recommended documentation elements. Through case studies on multiple benchmarks, we demonstrate that Eval Factsheets effectively captures diverse evaluation paradigms -- from traditional benchmarks to LLM-as-judge methodologies -- while maintaining consistency and comparability. We hope Eval Factsheets are incorporated into both existing and newly released evaluation frameworks and lead to more transparency and reproducibility.

中文标题/摘要

标题：Eval 事实简报：一种结构化框架以记录AI评估

基准的迅速增长造成了可重复性、透明度和知情决策的重大挑战。然而，与受益于结构化文档框架（如数据简报和模型卡片）的数据集和模型不同，评估方法缺乏系统化的文档标准。我们引入了Eval 事实简报，这是一种通过全面的分类和基于问卷的方法来记录AI系统评估的结构化、描述性框架。我们的框架将评估特征组织在五个基本维度上：背景（谁在何时进行了评估？）、范围（评估什么？）、结构（评估由什么构成？）、方法（如何运作？）和对齐（在哪些方面可靠/有效/稳健？）。我们将这种分类作为五个部分的实用问卷实施，包含强制性和推荐的文档元素。通过多个基准的案例研究，我们证明了Eval 事实简报有效地捕捉了各种评估范式——从传统基准到LLM作为评判者的方法——同时保持了一致性和可比性。我们希望Eval 事实简报能够被纳入现有的和新发布的评估框架中，从而提高透明度和可重复性。

Summary / 总结

The paper introduces Eval Factsheets, a structured framework for documenting AI system evaluations to enhance reproducibility and transparency. It organizes evaluation characteristics into five dimensions: Context, Scope, Structure, Method, and Alignment. Through case studies, the framework demonstrates its ability to capture various evaluation paradigms consistently and comparably, aiming to improve informed decision-making in AI evaluations.

论文提出了Eval Factsheets，这是一种结构化的框架，用于记录AI系统评估，以提高可重复性和透明度。该框架将评估特征组织为五个维度：背景、范围、结构、方法和一致性。通过案例研究，该框架展示了其能够一致且比较地捕捉各种评估范式的能力，旨在改善AI评估中的知情决策。

Stable Signer: Hierarchical Sign Language Generative Model

Authors: Sen Fang, Yalin Feng, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

First: 2025-12-03T18:33:40+00:00 · Latest: 2025-12-03T18:33:40+00:00

Comments: 12 pages, 7 figures. More Demo at https://stablesigner.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.

中文标题/摘要

标题：稳定手语签名者：层次化手语生成模型

手语生产（SLP）是将复杂的输入文本转换为真实视频的过程。大多数先前的工作集中在Text2Gloss、Gloss2Pose、Pose2Vid阶段，有些则集中在Prompt2Gloss和Text2Avatar阶段。但由于这些阶段中文本转换、姿态生成以及姿态渲染成真实人类视频的不准确性，导致累积了越来越多的错误，因此这一领域进展缓慢。因此，在本文中，我们简化了传统的冗余结构，简化并优化了任务目标，并设计了一个新的手语生成模型，称为稳定手语签名者。它重新定义了SLP任务为一个从文本理解（Prompt2Gloss、Text2Gloss）到Pose2Vid的端到端层次生成任务，并通过我们提出的新的手语理解链接器（SLUL）执行文本理解，通过命名的SLP-MoE手部姿态渲染专家模块生成手部动作，从而端到端生成高质量和多风格的手语视频。SLUL使用新开发的语义感知手语掩码损失（SAGM损失）进行训练。与当前最先进的生成方法相比，其性能提高了48.6%。

Summary / 总结

The paper addresses the challenges in Sign Language Production by proposing a new hierarchical generative model called Stable Signer. It simplifies the traditional multi-stage process and focuses on text understanding and pose-to-video generation. The model uses a new Sign Language Understanding Linker (SLUL) and an SLP-MoE block to generate high-quality sign language videos. The performance of Stable Signer is improved by 48.6% compared to current state-of-the-art methods.

研究旨在通过简化传统的手语生成（SLP）层次结构来提高手语视频生成的准确性和质量。提出的Stable Signer模型将SLP任务重新定义为一个端到端的层次生成过程，专注于文本理解和姿态到视频的生成。该模型使用新的Sign Language Understanding Linker（SLUL）和SLP-MoE手部动作渲染专家模块来生成高质量的手语视频。模型的性能显著提升，比当前最先进的方法提高了48.6%。

Polarization by Design: How Elites Could Shape Mass Preferences as AI Reduces Persuasion Costs

Authors: Nadav Kunievsky

First: 2025-12-03T18:33:26+00:00 · Latest: 2025-12-03T18:33:26+00:00

Abs · PDF · Code1 · Code2

Abstract

In democracies, major policy decisions typically require some form of majority or consensus, so elites must secure mass support to govern. Historically, elites could shape support only through limited instruments like schooling and mass media; advances in AI-driven persuasion sharply reduce the cost and increase the precision of shaping public opinion, making the distribution of preferences itself an object of deliberate design. We develop a dynamic model in which elites choose how much to reshape the distribution of policy preferences, subject to persuasion costs and a majority rule constraint. With a single elite, any optimal intervention tends to push society toward more polarized opinion profiles - a ``polarization pull'' - and improvements in persuasion technology accelerate this drift. When two opposed elites alternate in power, the same technology also creates incentives to park society in ``semi-lock'' regions where opinions are more cohesive and harder for a rival to overturn, so advances in persuasion can either heighten or dampen polarization depending on the environment. Taken together, cheaper persuasion technologies recast polarization as a strategic instrument of governance rather than a purely emergent social byproduct, with important implications for democratic stability as AI capabilities advance.

中文标题/摘要

标题：设计中的极化：AI降低说服成本后精英如何塑造大众偏好

在民主国家中，重大政策决策通常需要某种形式的多数或共识，因此精英必须获得大众支持才能执政。历史上，精英只能通过有限的工具如教育和大众媒体来塑造支持；AI驱动的说服技术的进步大大降低了塑造公众意见的成本并提高了其精确度，使得偏好分布本身成为刻意设计的对象。我们发展了一个动态模型，在该模型中，精英选择重塑政策偏好分布的多少，受到说服成本和多数规则约束。在单一精英的情况下，任何最优干预往往会将社会推向更加极化的意见配置——一种“极化拉力”——并且说服技术的进步会加速这一趋势。当两个对立的精英轮流执政时，同样的技术也会创造激励，使社会停留在“半锁定”区域，在这些区域，意见更加一致且更难被对手推翻，因此，随着说服技术的进步，极化可能会加剧或减弱，具体取决于环境。总体而言，更便宜的说服技术将极化重新定义为治理的战略工具，而不是纯粹的社会副产品，随着AI能力的提升，这对民主稳定具有重要影响。

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Authors: Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan

First: 2025-09-29T16:09:03+00:00 · Latest: 2025-12-03T18:31:29+00:00

Comments: 20 pages, 10 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Extensive ablations highlight the importance of diffusion priors, RGB reconstruction, and multi-scale SD U-Net features for cross-domain generalization, and t-SNE analyses further explain SD's domain-invariant latent structure. We also show that SDPose serves as an effective zero-shot pose annotator for controllable image and video generation.

中文标题/摘要

标题：SDPose：利用扩散先验进行跨域和鲁棒姿态估计

预训练的扩散模型提供了丰富的多尺度潜在特征，正逐渐成为强大的视觉骨干网络。尽管Marigold和Lotus等近期工作利用扩散先验进行密集预测并表现出强大的跨域泛化能力，但它们在结构化输出方面的潜力尚未得到充分探索。本文提出SDPose，一种基于Stable Diffusion的微调框架，旨在充分利用预训练的扩散先验进行人体姿态估计。首先，我们直接在SD U-Net的图像潜在空间中预测关键点热图，而不是修改交叉注意力模块或引入可学习嵌入，以保留原始生成先验。其次，我们通过一个轻量级的卷积姿态头将这些潜在特征映射到关键点热图，避免破坏预训练的骨干网络。最后，为了防止过拟合并增强跨域鲁棒性，我们引入了一个辅助的RGB重建分支，以保留域转移的生成语义。为了评估在域转移下的鲁棒性，我们进一步构建了COCO-OOD，这是一种保留注释的风格转移变体COCO。仅使用Sapiens在COCO上训练时间的五分之一，SDPose在COCO验证集上与Sapiens-1B/2B达到同等性能，并在跨域基准HumanArt和COCO-OOD上建立了新的性能记录。广泛的消融实验强调了扩散先验、RGB重建和多尺度SD U-Net特征对于跨域泛化的关键作用，t-SNE分析进一步解释了SD的域不变潜在结构。我们还展示了SDPose作为可控图像和视频生成的零样本姿态注释器的有效性。

Summary / 总结

SDPose is a fine-tuning framework based on Stable Diffusion for human pose estimation, which directly predicts keypoint heatmaps in the U-Net's latent space to preserve generative priors and uses a lightweight convolutional pose head for feature mapping. It also includes an auxiliary RGB reconstruction branch to enhance robustness. SDPose achieves competitive performance on COCO and establishes a new state-of-the-art on cross-domain benchmarks, demonstrating the importance of diffusion priors and RGB reconstruction for cross-domain generalization.

SDPose 是一个基于 Stable Diffusion 的人体姿态估计框架，直接在 U-Net 的潜空间中预测关键点热图以保留生成先验，并使用轻量级的卷积姿态头进行特征映射。它还包含一个辅助的 RGB 重建分支以增强鲁棒性。SDPose 在 COCO 上达到竞争力的表现，并在跨域基准 HumanArt 和 COCO-OOD 上建立了新的状态-of-the-art，展示了生成先验和 RGB 重建对跨域泛化的重要性。

NVRC: Neural Video Representation Compression

Authors: Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, David Bull

First: 2024-09-11T16:57:12+00:00 · Latest: 2025-12-03T18:31:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in implicit neural representation (INR)-based video coding have demonstrated its potential to compete with both conventional and other learning-based approaches. With INR methods, a neural network is trained to overfit a video sequence, with its parameters compressed to obtain a compact representation of the video content. However, although promising results have been achieved, the best INR-based methods are still out-performed by the latest standard codecs, such as VVC VTM, partially due to the simple model compression techniques employed. In this paper, rather than focusing on representation architectures as in many existing works, we propose a novel INR-based video compression framework, Neural Video Representation Compression (NVRC), targeting compression of the representation. Based on the novel entropy coding and quantization models proposed, NVRC, for the first time, is able to optimize an INR-based video codec in a fully end-to-end manner. To further minimize the additional bitrate overhead introduced by the entropy models, we have also proposed a new model compression framework for coding all the network, quantization and entropy model parameters hierarchically. Our experiments show that NVRC outperforms many conventional and learning-based benchmark codecs, with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset, measured in PSNR. As far as we are aware, this is the first time an INR-based video codec achieving such performance. The implementation of NVRC will be released.

中文标题/摘要

标题：NVRC：神经视频表示压缩

基于隐式神经表示（INR）的视频编码最近取得了进展，显示出其与传统方法和其它基于学习的方法竞争的潜力。通过INR方法，一个神经网络被训练以拟合视频序列，其参数被压缩以获得视频内容的紧凑表示。然而，尽管取得了有希望的结果，最好的INR方法仍然无法超越最新的标准编解码器，如VVC VTM，部分原因是采用了简单的模型压缩技术。在本文中，我们没有像许多现有工作那样专注于表示架构，而是提出了一种新的基于INR的视频压缩框架——神经视频表示压缩（NVRC），旨在压缩表示。基于我们提出的新型熵编码和量化模型，NVRC首次能够以端到端的方式优化基于INR的视频编解码器。为了进一步减少由熵模型引入的额外比特率开销，我们还提出了一种新的模型压缩框架，用于分层编码网络、量化和熵模型的所有参数。我们的实验表明，NVRC在PSNR指标上比VVC VTM（随机访问）在UVG数据集上的平均编码增益高出24%，超过了多种传统和基于学习的基准编解码器。据我们所知，这是首次实现具有如此性能的基于INR的视频编解码器。NVRC的实现将被发布。

Summary / 总结

The research aims to improve the performance of INR-based video compression by proposing a novel framework, NVRC, which optimizes the representation in an end-to-end manner. NVRC uses novel entropy coding and quantization models and a hierarchical model compression framework to minimize additional bitrate overhead. Experiments show that NVRC outperforms conventional and learning-based benchmark codecs, achieving a 24% average coding gain over VVC VTM on the UVG dataset in PSNR.

NVRC 是一种新型的基于 INR 的视频压缩框架，通过新的熵编码和量化模型在端到端的方式下优化表示。它在 UVG 数据集上以 PSNR 为指标，相对于 VVC VTM 的平均压缩增益达到 24%，这是首次实现基于 INR 的视频编解码器达到如此高的性能水平。

RELIC: Interactive Video World Model with Long-Horizon Memory

Authors: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan

First: 2025-12-03T18:29:20+00:00 · Latest: 2025-12-03T18:29:20+00:00

Comments: 22 pages

Abs · PDF · Code1 · Code2

Abstract

A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

中文标题/摘要

标题：RELIC：具有长时记忆的互动视频世界模型

一个真正互动的世界模型需要三个关键要素：实时长时序流式传输、一致的空间记忆和精确的用户控制。然而，大多数现有方法仅在单一方面进行处理，因为同时实现这三个方面是非常具有挑战性的——例如，长期记忆机制往往会影响实时性能。在本工作中，我们提出了RELIC，这是一种统一框架，可以同时解决这三个挑战。给定一张图片和一段文本描述，RELIC能够实现具有记忆意识的、长时间的场景探索。基于最近的自回归视频扩散蒸馏技术，我们的模型使用高度压缩的历史隐态令牌来表示长时序记忆，这些令牌在KV缓存中编码了相对动作和绝对相机姿态。这种紧凑的、相机感知的记忆结构支持隐式的3D一致内容检索，并在最小的计算开销下确保长期一致性。同时，我们对双向教师视频模型进行微调，以生成超出其原始5秒训练窗口的序列，并使用一种新的内存高效自我强迫范式将其转换为因果学生生成器，该范式可以在长时间的教师序列以及长时间的学生自我滚动中实现全面上下文蒸馏。作为140亿参数模型实现，并在精心挑选的Unreal Engine渲染数据集上进行训练，RELIC实现了每秒16帧的实时生成，同时展示了比先前工作更准确的动作跟随、更稳定的长时序流式传输和更稳健的空间记忆检索。这些能力使RELIC成为下一代互动世界建模的强大基础。

Summary / 总结

RELIC is a unified framework that addresses the challenges of real-time long-horizon streaming, consistent spatial memory, and precise user control in interactive world models. It uses autoregressive video-diffusion distillation techniques to represent long-term memory with compressed latent tokens and a camera-aware structure, enabling real-time generation at 16 FPS. Compared to previous methods, RELIC shows more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval.

研究旨在开发一个结合实时长时序流、一致的空间记忆和精确用户控制的世界模型。RELIC 统一框架通过使用压缩的历史隐状态令牌和摄像机感知的记忆结构来应对这些挑战。该模型能够实时生成长时间段的场景，显示出比以往方法更好的动作跟踪、稳定性和空间记忆检索能力。

Sat2Flow: A Structure-Aware Diffusion Framework for Human Flow Generation from Satellite Imagery

Authors: Xiangxu Wang, Tianhong Zhao, Wei Tu, Bowen Zhang, Guanzhou Chen, Jinzhou Cao

First: 2025-08-27T01:05:37+00:00 · Latest: 2025-12-03T18:28:49+00:00

Comments: 9 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Origin-Destination (OD) flow matrices are critical for urban mobility analysis, supporting traffic forecasting, infrastructure planning, and policy design. Existing methods face two key limitations: (1) reliance on costly auxiliary features (e.g., Points of Interest, socioeconomic statistics) with limited spatial coverage, and (2) fragility to spatial topology changes, where reordering urban regions disrupts the structural coherence of generated flows. We propose Sat2Flow, a structure-aware diffusion framework that generates structurally coherent OD flows using only satellite imagery. Our approach employs a multi-kernel encoder to capture diverse regional interactions and a permutation-aware diffusion process that maintains consistency across regional orderings. Through joint contrastive training linking satellite features with OD patterns and equivariant diffusion training enforcing structural invariance, Sat2Flow ensures topological robustness under arbitrary regional reindexing. Experiments on real-world datasets show that Sat2Flow outperforms physics-based and data-driven baselines in accuracy while preserving flow distributions and spatial structures under index permutations. Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce environments, eliminating region-specific auxiliary data dependencies while maintaining structural robustness for reliable mobility modeling.

中文标题/摘要

标题：Sat2Flow：一种基于卫星影像的人流生成结构感知扩散框架

起讫点（OD）流量矩阵是城市交通分析的关键，支持交通预测、基础设施规划和政策设计。现有方法面临两个主要限制：（1）依赖于成本高昂的辅助特征（例如，兴趣点、社会经济统计数据），这些特征的空间覆盖范围有限；（2）对空间拓扑变化的脆弱性，其中重新排序的城市区域会破坏生成流量的结构连贯性。我们提出了一种结构感知的扩散框架Sat2Flow，仅使用卫星影像生成结构连贯的OD流量。我们的方法采用多核编码器捕获多样化的区域交互，并采用排列感知的扩散过程，以在区域顺序变化中保持一致性。通过将卫星特征与OD模式联合对比训练，并通过等变扩散训练强制结构不变性，Sat2Flow确保在任意区域重新索引下具有拓扑鲁棒性。在真实数据集上的实验表明，Sat2Flow在准确度上优于基于物理和数据驱动的基线模型，同时在索引重新排序下保持流量分布和空间结构。Sat2Flow为数据稀缺环境中的OD流量生成提供了一种全球可扩展的解决方案，消除了区域特定的辅助数据依赖性，同时保持结构鲁棒性以实现可靠的交通建模。

Summary / 总结

Sat2Flow is a structure-aware diffusion framework that generates Origin-Destination (OD) flow matrices using only satellite imagery, addressing the limitations of existing methods that rely on costly auxiliary features and are fragile to spatial topology changes. It uses a multi-kernel encoder to capture regional interactions and a permutation-aware diffusion process to maintain structural coherence. Experiments show that Sat2Flow outperforms existing baselines in accuracy while preserving flow distributions and spatial structures under index permutations, making it a reliable solution for OD flow generation in data-scarce environments.

Sat2Flow 是一种结构感知的扩散框架，仅使用卫星影像生成 Origin-Destination (OD) 流量矩阵，解决了现有方法依赖昂贵的辅助特征和对空间拓扑变化敏感的问题。它使用多核编码器捕获不同的区域交互，并使用排列感知的扩散过程保持一致性。实验表明，Sat2Flow 在保持流量分布和空间结构不变的情况下，在准确性上优于现有基线，使其成为在数据稀缺环境中可靠的 OD 流量生成解决方案。

Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning

Authors: Juan Sebastian Rojas, Chi-Guhn Lee

First: 2025-10-03T12:40:03+00:00 · Latest: 2025-12-03T18:21:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance between retaining useful information and adapting to new situations. To date, continual RL has been explored almost exclusively through the lens of risk-neutral decision-making, in which the agent aims to optimize the expected long-run performance. In this work, we present the first formal theoretical treatment of continual RL through the lens of risk-aware decision-making, in which the behaviour of the agent is directed towards optimizing a measure of long-run performance beyond the mean. In particular, we show that the classical theory of risk measures, widely used as a theoretical foundation in non-continual risk-aware RL, is, in its current form, incompatible with continual learning. Then, building on this insight, we extend risk measure theory into the continual setting by introducing a new class of ergodic risk measures that are compatible with continual learning. Finally, we provide a case study of risk-aware continual learning, along with empirical results, which show the intuitive appeal of ergodic risk measures in continual settings.

中文标题/摘要

标题：遍历风险度量：迈向持续强化学习的风险意识基础

持续强化学习（持续RL）旨在形式化终身学习和无尽适应的概念。特别是，持续RL的目标是开发能够保持有用信息和适应新情况之间微妙平衡的RL代理。迄今为止，持续RL几乎完全通过无风险决策的角度进行探索，其中代理的目标是优化长期预期性能。在本文中，我们首次通过风险意识决策的角度对持续RL进行形式化的理论研究，其中代理的行为旨在优化超出均值的长期性能度量。特别是，我们证明了广泛用于非持续风险意识RL的古典风险度量理论，以当前形式与持续学习不兼容。然后，基于这一见解，我们通过引入与持续学习兼容的新类遍历风险度量，将风险度量理论扩展到持续环境中。最后，我们提供了一个风险意识持续学习的案例研究，以及实验结果，表明在持续环境中遍历风险度量具有直观的吸引力。

Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions

Authors: Hong Yang, Devroop Kar, Qi Yu, Alex Ororbia, Travis Desell

First: 2025-12-03T18:17:49+00:00 · Latest: 2025-12-03T18:17:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Why do state-of-the-art OOD detection methods exhibit catastrophic failure when models are trained on single-domain datasets? We provide the first theoretical explanation for this phenomenon through the lens of information theory. We prove that supervised learning on single-domain data inevitably produces domain feature collapse -- representations where I(x_d; z) = 0, meaning domain-specific information is completely discarded. This is a fundamental consequence of information bottleneck optimization: models trained on single domains (e.g., medical images) learn to rely solely on class-specific features while discarding domain features, leading to catastrophic failure when detecting out-of-domain samples (e.g., achieving only 53% FPR@95 on MNIST). We extend our analysis using Fano's inequality to quantify partial collapse in practical scenarios. To validate our theory, we introduce Domain Bench, a benchmark of single-domain datasets, and demonstrate that preserving I(x_d; z) > 0 through domain filtering (using pretrained representations) resolves the failure mode. While domain filtering itself is conceptually straightforward, its effectiveness provides strong empirical evidence for our information-theoretic framework. Our work explains a puzzling empirical phenomenon, reveals fundamental limitations of supervised learning in narrow domains, and has broader implications for transfer learning and when to fine-tune versus freeze pretrained models.

中文标题/摘要

标题：领域特征坍缩：超分布检测的含义与解决方案

为什么最先进的超分布检测方法在使用单领域数据集训练模型时会出现灾难性失败？我们通过信息论的视角首次提供了这一现象的理论解释。我们证明，单领域数据上的监督学习不可避免地会导致领域特征坍缩——特征表示I(x_d; z) = 0，这意味着领域特定的信息完全被丢弃。这是信息瓶颈优化的基本结果：在单领域（例如，医学图像）上训练的模型学会仅依赖于类别特定特征，而丢弃领域特征，导致在检测超领域样本时出现灾难性失败（例如，在MNIST上仅达到53%的FPR@95）。我们使用Fano不等式扩展了我们的分析，以在实际场景中量化部分坍缩。为了验证我们的理论，我们引入了领域基准（Domain Bench），这是一个单领域数据集基准，并证明通过领域过滤（使用预训练表示）保留I(x_d; z) > 0可以解决这一失败模式。虽然领域过滤本身概念上很简单，但其有效性为我们的信息论框架提供了强有力的实证证据。我们的工作解释了一个令人困惑的实证现象，揭示了监督学习在狭窄领域中的基本局限性，并对迁移学习以及何时微调预训练模型何时冻结预训练模型具有更广泛的影响。

Summary / 总结

This paper explains why state-of-the-art out-of-distribution (OOD) detection methods fail when trained on single-domain datasets. It proves that supervised learning on single-domain data leads to domain feature collapse, where domain-specific information is discarded. This results in poor OOD detection performance, as shown by achieving only 53% FPR@95 on MNIST. The authors introduce Domain Bench to validate their theory and demonstrate that preserving domain-specific information through domain filtering can improve OOD detection. This work provides a theoretical framework for understanding OOD detection failures and has implications for transfer learning and model fine-tuning strategies.

该论文解释了为什么在单一领域数据上训练的先进离域（OOD）检测方法会失效。它表明，单一领域数据上的监督学习会导致领域特征坍缩，即丢弃领域特定的信息。这导致了较差的OOD检测性能。作者使用信息论来证明这一点，并通过Domain Bench验证其理论，表明通过领域过滤保留领域特定信息可以改善OOD检测。

Jina-VLM: Small Multilingual Vision Language Model

Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

First: 2025-12-03T18:13:41+00:00 · Latest: 2025-12-03T18:13:41+00:00

Comments: 18 pages, 1-7 main content

Abs · PDF · Code1 · Code2

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.

中文标题/摘要

标题：Jina-VLM：小型多语言视觉语言模型

我们提出了Jina-VLM，这是一种参数量为24亿的视觉-语言模型，在开放的2B规模的视觉语言模型中，其在多语言视觉问答方面达到了最先进的水平。该模型通过一种注意力池化连接器将SigLIP2视觉编码器与Qwen3语言骨干网络耦合在一起，从而能够高效处理任意分辨率的图像。在标准的视觉问答基准测试和多语言评估中，Jina-VLM 在保持与同类模型相当的纯文本性能的同时，表现更优。

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang

First: 2025-12-03T18:02:11+00:00 · Latest: 2025-12-03T18:02:11+00:00

Comments: Tech report

Abs · PDF · Code1 · Code2

Abstract

Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA

中文标题/摘要

标题：PSA：金字塔稀疏注意机制在视频理解和生成中的高效应用

注意机制是基础模型的核心，但其二次复杂性仍然是扩展的关键瓶颈。这一挑战推动了高效注意机制的发展，稀疏性已成为主导范式。当前方法通常使用二元掩码保留或丢弃整个键值块，导致在高稀疏度下信息损失严重。为缓解这一差距，我们提出了金字塔稀疏注意机制（PSA），该机制适用于视频理解和生成任务。PSA 不使用二元掩码，而是引入多级池化键值表示，允许更细粒度的掩码。具体而言，每个查询块动态分配较低的池化级别给关键的键值块，而将较高的级别分配给不太重要的键值块，从而在完全保留和完全剪枝之间创建一个信息丰富的插值。该设计类似于固定点量化和计算机视觉中的经典特征金字塔网络，有效地缓解了信息损失，同时在低计算预算下保持计算效率。它与一个原生的、硬件友好的内核兼容，该内核利用解耦的块-瓷砖设计确保高效执行。在视频理解和生成基准测试中，PSA 保留了上下文信息和视觉保真度，始终优于现有稀疏注意机制基线，具有更优的效率-质量权衡。我们的代码和模型权重可在以下网址获取：http://ziplab.co/PSA

Summary / 总结

The research aims to address the computational bottleneck of attention mechanisms in foundation models by proposing Pyramid Sparse Attention (PSA), which introduces multi-level pooled key-value representations to mitigate information loss under high sparsity. PSA dynamically allocates pooling levels to different key-value blocks based on their importance, providing a finer granularity of mask control. Experimental results show that PSA outperforms existing sparse attention methods in both video understanding and generation tasks, maintaining high efficiency and quality.

论文提出了金字塔稀疏注意（PSA），通过引入多级池化键值表示来缓解高稀疏度下的信息丢失问题。PSA适用于视频理解和生成任务，并根据键值块的重要性动态分配池化级别。实验结果表明，PSA在各种基准测试中优于现有稀疏注意方法，在效率和性能方面具有优势，同时保持了上下文信息和视觉保真度。

C3G: Learning Compact 3D Representations with 2K Gaussians

Authors: Honggyu An, Jaewoo Jung, Mungyeom Kim, Sunghwan Hong, Chaehyun Kim, Kazumi Fukuda, Minkyeong Jeon, Jisang Han, Takuya Narihira, Hyuna Ko, Junsu Kim, Yuki Mitsufuji, Seungryong Kim

First: 2025-12-03T17:59:05+00:00 · Latest: 2025-12-03T17:59:05+00:00

Comments: Project Page : https://cvlab-kaist.github.io/C3G/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

中文标题/摘要

标题：C3G：使用2K高斯分布学习紧凑的3D表示

从未摆姿势的稀疏视角以前馈方式重建和理解3D场景仍然是3D计算机视觉中的一个具有挑战性的任务。最近的方法使用逐像素3D高斯散点进行重建，随后通过2D到3D特征提升阶段进行场景理解。然而，它们生成了过多的冗余高斯分布，导致高内存开销和多视图特征聚合的次优性能，从而降低了新颖视角合成和场景理解的效果。我们提出了一种名为C3G的新型前馈框架，仅在关键空间位置估计紧凑的3D高斯分布，从而最小化冗余并使特征提升更有效。我们引入了可学习的令牌，通过自注意力聚合多视图特征以指导高斯分布的生成，确保每个高斯分布整合了来自不同视角的相关视觉特征。然后利用学习到的注意力模式进行高斯解码，以高效地提升特征。在无姿态新颖视角合成、3D开放词汇分割和视不变特征聚合方面的广泛实验表明了我们方法的有效性。结果表明，一个紧凑且几何上有意义的表示足以实现高质量的场景重建和理解，相比现有方法具有更高的内存效率和特征保真度。

Summary / 总结

C3G addresses the challenge of reconstructing and understanding 3D scenes from unposed sparse views by proposing a compact 3D Gaussian representation. It uses learnable tokens to guide the generation of essential spatial Gaussians, reducing redundancy and improving feature aggregation. Experiments show that C3G outperforms existing methods in terms of memory efficiency and feature fidelity, achieving high-quality novel view synthesis and scene understanding.

论文旨在通过前馈方法从未定位的稀疏视图中重建和理解3D场景。提出了一种名为C3G的方法，该方法在关键空间位置估计紧凑的3D高斯分布，以减少冗余并改善特征聚合。该方法使用可学习的令牌和自注意力来引导高斯生成和特征提升，从而提高新颖视图合成和场景理解的效果。实验结果显示，C3G在内存效率和特征保真度方面优于现有方法。

Ultra-lightweight Neural Video Representation Compression

Authors: Ho Man Kwan, Tianhao Peng, Ge Gao, Fan Zhang, Mike Nilsson, Andrew Gower, David Bull

First: 2025-12-03T17:56:44+00:00 · Latest: 2025-12-03T17:56:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.

中文标题/摘要

标题：超轻量神经视频表示压缩

近期研究表明，可以利用过拟合的隐式神经表示（INRs）作为自编码器模型的替代方案，用于神经视频压缩。在这些基于INR的视频编解码器中，神经视频表示压缩（NVRC）是第一个采用完全端到端压缩框架的编解码器，该框架压缩INRs，实现了最先进的性能。此外，一些最近提出的轻量级INRs在计算复杂度低于10kMACs/像素的情况下，与基线编解码器具有可比的性能。在本文中，我们将NVRC扩展到轻量级表示，并提出了NVRC-Lite，其中包含两个关键变化。首先，我们将多尺度特征网格集成到我们的轻量级神经表示中，使用更高分辨率的网格在低复杂度下显著提高了INRs的性能。其次，我们解决了现有INRs通常使用自回归模型进行熵编码的问题：这些模型虽然有效，但由于编码速度慢而不实用。在本文中，我们提出了一种基于八叉树的上下文模型，用于熵编码高维特征网格，从而加速了模型的熵编码模块。我们的实验结果表明，与C3（一种性能最佳的轻量级INR基视频编解码器）相比，NVRC-Lite在PSNR和MS-SSIM下的BD率节省分别高达21.03%和23.06%，同时实现了8.4倍的编码速度和2.5倍的解码速度提升。NVRC-Lite的实现将可供使用。

Summary / 总结

This work aims to improve the performance of neural video compression by developing a lightweight version of NVRC, called NVRC-Lite. NVRC-Lite incorporates multi-scale feature grids and an octree-based context model for entropy coding, which enhances performance at low complexity and accelerates the coding process. Experimental results show that NVRC-Lite outperforms C3 with up to 21.03% and 23.06% BD-rate savings in PSNR and MS-SSIM, respectively, while achieving significant speedup in both encoding and decoding processes.

该研究旨在通过开发NVRC的轻量级版本NVRC-Lite来提高神经视频压缩的性能。NVRC-Lite结合了多尺度特征网格和基于八叉树的上下文模型进行熵编码，这在低计算复杂度下提升了性能。实验结果表明，NVRC-Lite在PSNR和MS-SSIM中分别比C3节省了高达21.03%和23.06%的BD率，并且在编码和解码过程中实现了显著的速度提升。

Learning Group Actions In Disentangled Latent Image Representations

Authors: Farhana Hossain Swarnali, Miaomiao Zhang, Tonmoy Hossain

First: 2025-12-03T17:52:24+00:00 · Latest: 2025-12-03T17:52:24+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations .

中文标题/摘要

标题：学习离散潜图像表示中的群组动作

在潜表示上建模群组动作能够使高维图像数据的可控变换成为可能。先前的工作应用群论先验或建模变换通常在高维数据空间中进行，其中群组动作在整个输入上均匀应用，使得难以分离在变换下变化的子空间。虽然潜空间方法提供了更大的灵活性，但它们仍然需要手动将潜变量划分为协变和不变子空间，限制了在表示空间内稳健学习和操作群组动作的能力。为了解决这个问题，我们引入了一种全新的端到端框架，该框架首次能够在潜图像流形上学习群组动作，无需手动干预即可自动发现与变换相关的结构。我们的方法使用可学习的二进制掩码和直通估计来动态地将潜表示划分为变换敏感和不变分量。我们在此统一优化框架中联合学习潜分离和群变换映射。该框架可以无缝集成到任何标准编码器-解码器架构中。我们在五个2D/3D图像数据集上验证了我们的方法，证明了其能够自动学习适用于群组动作的潜分离因子的能力，而下游分类任务则证实了所学习表示的有效性。我们的代码已公开发布在https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations 。

Summary / 总结

This paper introduces a novel end-to-end framework for learning group actions on latent image representations, automatically partitioning latent variables into transformation-sensitive and invariant components without manual intervention. The method uses learnable binary masks and straight-through estimation to achieve this, and is integrated within a unified optimization framework. Experiments on five 2D/3D image datasets show that the approach can learn disentangled latent factors for group actions, and the learned representations improve downstream classification tasks.

该论文提出了一种新型端到端框架，用于在解纠缠的潜在图像表示中学习群组动作，该框架无需手动干预即可自动将潜在变量划分为对变换敏感和不变的组件。该方法使用可学习的二进制掩码和直通估计在统一的优化框架中联合学习潜在解纠缠和群组变换映射。在五个2D/3D图像数据集上的实验表明，该方法可以有效地学习用于群组动作的解纠缠潜在因子，并且下游分类任务验证了所学习表示的有效性。

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

Authors: Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, Chen Feng

First: 2025-12-03T17:48:25+00:00 · Latest: 2025-12-03T17:48:25+00:00

Comments: Project page: https://cvlab-kaist.github.io/RobustVGGT/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

中文标题/摘要

标题：视觉几何导向变换器中的自适应异常视图拒绝

从野外图像集合中可靠地重建3D结构往往受到“噪声”图像的阻碍——这些无关输入与其它图像几乎没有视图重叠。传统结构从运动管道通过几何验证和异常值拒绝来处理此类情况，而前馈3D重建模型缺乏这些显式机制，在野外条件下导致性能下降。在本文中，我们发现现有的前馈重建模型，例如VGGT，尽管缺乏显式的异常值拒绝机制或噪声感知训练，却能够自然地区分干扰图像。通过在不同比例的合成干扰图像下进行深入分析，我们确定了一层自然表现出异常值抑制行为。进一步探究表明，该层编码了具有区分性的内部表示，使其具备有效的噪声过滤能力，我们简单地利用这一机制在前馈3D重建中进行异常视图拒绝，无需任何额外的微调或监督。在受控和野外数据集上的广泛实验表明，这种隐式的过滤机制是一致且在多种场景下具有良好泛化能力的。

Summary / 总结

This paper addresses the challenge of 3D reconstruction from in-the-wild image collections, where irrelevant images (outliers) can degrade performance. The authors discovered that the existing feed-forward VGGT model can inherently distinguish these outliers without explicit mechanisms for outlier rejection. By analyzing the model's behavior under varying proportions of synthetic distractors, they identified a specific layer that naturally suppresses outliers. This layer encodes discriminative internal representations that enable effective noise filtering, allowing the model to perform outlier-view rejection without additional fine-tuning. Experiments on both controlled and in-the-wild datasets show that this implicit filtering mechanism is consistent and generalizes well across different scenarios.

论文解决了来自野外图像集合的3D重建问题，这些图像往往包含无关图像。研究发现，现有的前馈模型VGGT，虽然没有明确的离群值拒绝机制，但能够自然地区分和拒绝这些干扰图像。实验表明，VGGT中的特定层能够自然抑制离群值，从而实现有效的噪声过滤，并在各种场景中保持一致的性能。

Neural Radiance and Gaze Fields for Visual Attention Modeling in 3D Environments

Authors: Andrei Chubarau, Yinan Wang, James J. Clark

First: 2025-03-10T20:18:42+00:00 · Latest: 2025-12-03T17:46:03+00:00

Comments: 11 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

We introduce Neural Radiance and Gaze Fields (NeRGs), a novel approach for representing visual attention in complex environments. Much like how Neural Radiance Fields (NeRFs) perform novel view synthesis, NeRGs reconstruct gaze patterns from arbitrary viewpoints, implicitly mapping visual attention to 3D surfaces. We achieve this by augmenting a standard NeRF with an additional network that models local egocentric gaze probability density, conditioned on scene geometry and observer position. The output of a NeRG is a rendered view of the scene alongside a pixel-wise salience map representing the conditional probability that a given observer fixates on visible surfaces. Unlike prior methods, our system is lightweight and enables visualization of gaze fields at interactive framerates. Moreover, NeRGs allow the observer perspective to be decoupled from the rendering camera and correctly account for gaze occlusion due to intervening geometry. We demonstrate the effectiveness of NeRGs using head pose from skeleton tracking as a proxy for gaze, employing our proposed gaze probes to aggregate noisy rays into robust probability density targets for supervision.

中文标题/摘要

标题：神经辐射场和凝视场在3D环境中的视觉注意力建模

我们提出了神经辐射场和凝视场（NeRGs），这是一种用于表示复杂环境中视觉注意力的新方法。就像神经辐射场（NeRFs）进行新颖视图合成一样，NeRGs 从任意视角重建凝视模式，隐式地将视觉注意力映射到3D表面。我们通过在标准NeRF中增加一个额外的网络来实现这一点，该网络模型局部自我的凝视概率密度，条件依赖于场景几何和观察者位置。NeRG的输出是场景的渲染视图以及一个像素级的显著性图，表示给定观察者注视可见表面的条件概率。与先前的方法不同，我们的系统轻量且能够在交互帧率下可视化凝视场。此外，NeRGs 允许观察者视角与渲染相机解耦，并正确地考虑到由于中间几何体引起的凝视遮挡。我们使用骨架跟踪提供的头部姿态作为凝视的代理，并使用我们提出的凝视探针将嘈杂的光线聚合为监督的稳健概率密度目标。

Summary / 总结

Neural Radiance and Gaze Fields (NeRGs) are introduced for modeling visual attention in 3D environments. By augmenting standard Neural Radiance Fields (NeRFs) with an additional network that models local egocentric gaze probability density, NeRGs can reconstruct gaze patterns from arbitrary viewpoints. The system outputs a rendered view of the scene along with a salience map indicating the probability of gaze fixation. NeRGs enable interactive visualization of gaze fields and can decouple the observer perspective from the rendering camera, accurately accounting for gaze occlusion. Experiments using head pose from skeleton tracking as a proxy for gaze show that NeRGs can aggregate noisy rays into robust probability density targets for supervision.

研究引入了Neural Radiance和Gaze Fields（NeRGs），扩展了Neural Radiance Fields（NeRFs）以在3D环境中建模视觉注意力。通过添加一个预测注视概率密度的网络，NeRGs可以渲染带有注意力热点图的场景，显示观察者可能聚焦的位置。该方法允许观察者的视角与渲染相机分离，并准确处理视线被遮挡的情况。实验表明，使用骨架追踪的头部姿态数据，NeRGs能够有效建模注视模式并生成用于训练的稳健概率密度目标。

Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics

Authors: Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis

First: 2025-12-03T17:45:09+00:00 · Latest: 2025-12-03T17:45:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical two-layer linear neural network with standard-basis vectors as inputs: the simplest non-convex extension for which the implicit bias remained unknown. This model coincides with the unconstrained features model used to study neural collapse, making our work the first to prove that gradient flow on CE converges to the neural collapse geometry. We construct an explicit Lyapunov function that establishes global convergence, despite the presence of spurious critical points in the non-convex landscape. A key insight underlying our analysis is an inconspicuous finding: Hadamard Initialization diagonalizes the softmax operator, freezing the singular vectors of the weight matrices and reducing the dynamics entirely to their singular values. This technique opens a pathway for analyzing CE training dynamics well beyond our specific setting considered here.

中文标题/摘要

标题：对角化Softmax：可处理交叉熵动力学的哈达玛初始化

交叉熵（CE）训练损失主导了深度学习实践，但现有理论往往依赖于简化，要么将其替换为平方损失，要么限制为凸模型，这些简化忽略了关键行为。CE和平方损失生成了根本不同的动力学，而凸线性模型无法捕捉非凸优化的复杂性。我们通过分析一个标准基向量作为输入的两层线性神经网络，提供了多类CE优化动力学的深入表征，超越了凸域。该模型与用于研究神经塌缩的无约束特征模型一致，使我们的工作成为第一个证明梯度流在CE上的收敛性，收敛到神经塌缩几何结构。我们构建了一个显式的李亚普诺夫函数，尽管存在非凸景观中的虚假临界点，仍证明了全局收敛。我们分析的核心见解是不显眼的发现：哈达玛初始化对角化了Softmax操作符，冻结了权重矩阵的奇异向量，并将动力学完全简化为它们的奇异值。该技术为分析CE训练动力学打开了通往更广泛研究领域的途径，超越了我们在此考虑的具体设置。

Summary / 总结

The paper aims to provide a deeper understanding of cross-entropy (CE) training dynamics in non-convex settings by analyzing a two-layer linear neural network with standard basis vectors as inputs. The authors use Hadamard Initialization to diagonalize the softmax operator, which simplifies the dynamics to singular values and proves global convergence to the neural collapse geometry. Key findings include the establishment of a Lyapunov function and the demonstration that gradient flow on CE converges to this geometry, despite the presence of spurious critical points.

该论文旨在通过分析以标准基向量作为输入的两层线性神经网络，提供对非凸设置下交叉熵（CE）优化动力学的更深入理解。方法涉及使用Hadamard初始化，该初始化将softmax操作符对角化，将动力学简化为奇异值。主要发现包括证明梯度流在CE上的收敛性会达到神经塌缩几何，并建立了全局收敛的Lyapunov函数，尽管存在伪临界点。

SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

Authors: Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura, Guy Tevet, Chuan Guo, Michael Taylor, Chang Shu, Pengcheng Xi, Xue Bin Peng

First: 2025-12-02T18:54:12+00:00 · Latest: 2025-12-03T17:44:54+00:00

Comments: 14 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20

中文标题/摘要

标题：SMP：可重用的评分匹配运动先验用于基于物理的字符控制

数据驱动的运动先验可以引导代理产生自然行为，在创建逼真虚拟角色中起着关键作用。对抗性模仿学习是一种从参考运动数据中学习运动先验的非常有效的方法。然而，除了少数例外情况，对抗性先验需要为每个新控制器重新训练，从而限制了它们的可重用性，并且在下游任务训练时需要保留参考运动数据。在本文中，我们提出了评分匹配运动先验（SMP），它利用预训练的运动扩散模型和评分蒸馏采样（SDS）来创建可重用的任务无关运动先验。SMP可以在不依赖任何控制策略或任务的情况下，独立于运动数据集进行预训练。一旦训练完成，SMP可以保持冻结并作为通用奖励函数重用，以训练策略生成下游任务中的自然行为。我们展示了在大规模数据集上训练的一般运动先验可以重新用于各种风格特定的先验。此外，SMP可以组合不同的风格以合成原始数据集中不存在的新风格。我们的方法通过可重用和模块化的运动先验生成高质量的运动，与最先进的对抗性模仿学习方法相当。我们展示了SMP在多种物理模拟的人形角色控制任务中的有效性。视频演示可在https://youtu.be/ravlZJteS20观看

Summary / 总结

This paper introduces Score-Matching Motion Priors (SMP), a method for creating reusable motion priors that can guide agents to produce natural behaviors without retraining for each new controller. SMP leverages pre-trained motion diffusion models and score distillation sampling to generate task-agnostic priors that can be used as reward functions for training policies on various downstream tasks. The method demonstrates the ability to repurpose a general motion prior into style-specific priors and to synthesize new styles not present in the original dataset, achieving high-quality motion comparable to state-of-the-art methods.

本文介绍了Score-Matching Motion Priors (SMP) 方法，该方法可以创建可重用的运动先验，用于引导代理产生自然行为，而无需为每个新控制器重新训练。SMP 利用预训练的运动扩散模型和得分蒸馏采样生成任务无关的先验，这些先验可以作为训练各种下游任务策略的奖励函数。该方法展示了通用运动先验可以适应不同的风格，甚至可以生成原始数据集中不存在的新风格，产生与最先进的方法相当的高质量运动。

Physics-Embedded Gaussian Process for Traffic State Estimation

Authors: Yanlin Chen, Kehua Chen, Yinhai Wang

First: 2025-12-03T17:43:40+00:00 · Latest: 2025-12-03T17:43:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Traffic state estimation (TSE) becomes challenging when probe-vehicle penetration is low and observations are spatially sparse. Pure data-driven methods lack physical explanations and have poor generalization when observed data is sparse. In contrast, physical models have difficulty integrating uncertainties and capturing the real complexity of traffic. To bridge this gap, recent studies have explored combining them by embedding physical structure into Gaussian process. These approaches typically introduce the governing equations as soft constraints through pseudo-observations, enabling the integration of model structure within a variational framework. However, these methods rely heavily on penalty tuning and lack principled uncertainty calibration, which makes them sensitive to model mis-specification. In this work, we address these limitations by presenting a novel Physics-Embedded Gaussian Process (PEGP), designed to integrate domain knowledge with data-driven methods in traffic state estimation. Specifically, we design two multi-output kernels informed by classic traffic flow models, constructed via the explicit application of the linearized differential operator. Experiments on HighD, NGSIM show consistent improvements over non-physics baselines. PEGP-ARZ proves more reliable under sparse observation, while PEGP-LWR achieves lower errors with denser observation. Ablation study further reveals that PEGP-ARZ residuals align closely with physics and yield calibrated, interpretable uncertainty, whereas PEGP-LWR residuals are more orthogonal and produce nearly constant variance fields. This PEGP framework combines physical priors, uncertainty quantification, which can provide reliable support for TSE.

中文标题/摘要

标题：嵌入物理的高斯过程用于交通状态估计

当探针车辆渗透率低且观测数据空间稀疏时，交通状态估计（TSE）变得具有挑战性。纯数据驱动的方法缺乏物理解释，在观测数据稀疏时泛化能力较差。相比之下，物理模型难以整合不确定性并捕捉交通的真实复杂性。为弥合这一差距，最近的研究探索了将它们结合起来的方法，即将物理结构嵌入高斯过程。这些方法通常通过伪观测引入控制方程作为软约束，使模型结构能够在变分框架内进行整合。然而，这些方法严重依赖于惩罚调优，缺乏原理性的不确定性校准，这使得它们对模型误设非常敏感。在本工作中，我们通过提出一种新颖的嵌入物理的高斯过程（PEGP），解决了这些限制，旨在将领域知识与数据驱动方法结合用于交通状态估计。具体而言，我们设计了两种多输出核，由经典的交通流模型启发，通过线性化微分算子的显式应用构建。在HighD和NGSIM上的实验表明，PEGP在稀疏观测下比非物理基线更可靠，而PEGP-LWR在密集观测下误差更低。消融研究进一步表明，PEGP-ARZ的残差与物理高度一致，提供校准且可解释的不确定性，而PEGP-LWR的残差更正交，产生几乎恒定的方差场。此PEGP框架结合了物理先验和不确定性量化，可以为TSE提供可靠的支撑。

Summary / 总结

This paper addresses the challenge of traffic state estimation (TSE) in scenarios with low probe-vehicle penetration and sparse observations. It proposes a Physics-Embedded Gaussian Process (PEGP) to integrate domain knowledge with data-driven methods. Two multi-output kernels are designed based on classic traffic flow models, using linearized differential operators. Experiments on HighD and NGSIM demonstrate that PEGP-ARZ is more reliable under sparse observations, while PEGP-LWR achieves lower errors with denser observations. An ablation study shows that PEGP-ARZ provides calibrated and interpretable uncertainty, whereas PEGP-LWR produces nearly constant variance fields.

该论文通过提出一种物理嵌入高斯过程（PEGP），解决了低探针车辆渗透率和稀疏观测下的交通状态估计（TSE）问题。该方法将交通流模型的知识嵌入到高斯过程框架中，使用两个由线性化微分算子导出的多输出核。实验结果表明，PEGP-ARZ 在稀疏观测下表现更好，而 PEGP-LWR 在密集观测下误差更低。消融研究进一步表明，PEGP-ARZ 提供了校准且可解释的不确定性，而 PEGP-LWR 产生的方差场几乎不变。

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Authors: Jialuo Li, Bin Li, Jiahao Li, Yan Lu

First: 2025-12-03T17:36:06+00:00 · Latest: 2025-12-03T17:36:06+00:00

Abs · PDF · Code1 · Code2

Abstract

The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.

中文标题/摘要

标题：分割，然后接地：根据查询类型调整框架选择以适应长格式视频理解

大型多模态模型（LMMs）在长格式视频理解中的应用受到上下文长度有限和处理密集视频标记的计算成本高昂的限制。因此，最近的研究集中在查询感知的框架选择方法上，这些方法通常会带来显著的计算开销。本文挑战了复杂搜索机制在所有情况下的必要性。我们首先识别并验证了一种查询类型学，区分全局查询和局部查询。我们证明，均匀采样对于全局查询既有效又高效，而对于局部查询，确实需要查询感知的选择以获得最佳性能。基于这一见解，我们提出了DIG，这是一种无需训练的框架选择框架，其策略根据查询类型进行调整。具体而言，DIG 使用高效的均匀采样进行全局查询，而激活专门的管道以提取与查询相关的帧进行局部查询。在三个长格式视频理解基准上的实验表明，DIG 一致地优于现有基线，并且即使将输入帧数扩展到 256 时，也能稳健地提高 LMM 的性能。

Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Authors: Hang Xu, Linjiang Huang, Feng Zhao

First: 2025-12-03T17:27:53+00:00 · Latest: 2025-12-03T17:27:53+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.

中文标题/摘要

标题：具有文本嵌入扰动的T2I扩散模型测试时高效缩放

测试时缩放（TTS）旨在通过增加随机采样并基于规则和指标评估样本来获得更好的结果。然而，在文本到图像（T2I）扩散模型中，大多数相关工作集中在搜索策略和奖励模型上，而噪声的随机特性对方法性能的影响尚未被探索。在本文中，我们分析了T2I扩散模型中的随机性影响，并探索了一种新的TTS随机性格式：文本嵌入扰动，它与现有的随机性（如SDE注入噪声）相结合，以增强生成的多样性和质量。我们从频域分析了这些随机性格式及其对生成的影响，并发现这两种随机性在频域中表现出互补行为：空间噪声倾向于低频成分（早期步骤），而文本嵌入扰动增强了高频细节（后期步骤），从而弥补了空间噪声随机性在高频操作中的潜在局限性。同时，文本嵌入在生成过程的不同维度上对扰动的容忍度各不相同。具体而言，我们的方法包括两个关键设计：（1）引入基于步骤的文本嵌入扰动，结合频率导向的噪声调度与空间噪声扰动。（2）根据其对生成的频率特异性贡献和对扰动的容忍度选择性地调整扰动强度。我们的方法可以无缝集成到现有的TTS方法中，并在多个基准测试上显示出显著的改进，几乎不需要额外的计算。代码可在https://github.com/xuhang07/TEP-Diffusion 获取。

Summary / 总结

This work addresses the limitations of test-time scaling (TTS) in text-to-image (T2I) diffusion models by introducing text embedding perturbation, which complements spatial noise randomness. The authors analyze the frequency-domain effects of these randomizations and propose a step-based text embedding perturbation method that enhances generative diversity and quality. The approach integrates seamlessly into existing TTS methods and shows significant improvements on multiple benchmarks with minimal additional computational cost.

该研究分析了文本到图像扩散模型中随机性的影响，并引入了文本嵌入扰动作为新的TTS（测试时缩放）形式的随机性。作者在频域中分析了空间噪声和文本嵌入扰动的互补行为，并提出了一种结合频率引导噪声调度的基于步骤的文本嵌入扰动方法。该方法增强了生成的多样性和质量，并可以无缝集成到现有的TTS方法中，在多个基准测试中显示出显著的改进，几乎不需要额外的计算。

Artificial Microsaccade Compensation: Stable Vision for an Ornithopter

Authors: Levi Burner, Guido de Croon, Yiannis Aloimonos

First: 2025-12-03T17:24:02+00:00 · Latest: 2025-12-03T17:24:02+00:00

Comments: 29 pages, 5 figures, 2 tables, under review

Abs · PDF · Code1 · Code2

Abstract

Animals with foveated vision, including humans, experience microsaccades, small, rapid eye movements that they are not aware of. Inspired by this phenomenon, we develop a method for "Artificial Microsaccade Compensation". It can stabilize video captured by a tailless ornithopter that has resisted attempts to use camera-based sensing because it shakes at 12-20 Hz. Our approach minimizes changes in image intensity by optimizing over 3D rotation represented in SO(3). This results in a stabilized video, computed in real time, suitable for human viewing, and free from distortion. When adapted to hold a fixed viewing orientation, up to occasional saccades, it can dramatically reduce inter-frame motion while also benefiting from an efficient recursive update. When compared to Adobe Premier Pro's warp stabilizer, which is widely regarded as the best commercial video stabilization software available, our method achieves higher quality results while also running in real time.

中文标题/摘要

标题：人工微跳补偿： ornithopter 的稳定视觉

具有视网膜中心视觉的动物，包括人类，会经历微跳，这是一种小而快速的眼球运动，它们并不自知。受此现象启发，我们开发了一种“人工微跳补偿”方法。它可以稳定由无尾 ornithopter 捕获的视频，这种 ornithopter 因其每秒 12-20 次的抖动而难以使用基于相机的传感技术。我们的方法通过在 SO(3) 中优化 3D 旋转来最小化图像强度的变化。这产生了一段实时计算的稳定视频，适合人类观看且无失真。当适应保持固定视角时，即使偶尔有 saccades，也能显著减少帧间运动，同时还能从高效的递归更新中受益。与广泛认为是最佳商业视频稳定软件的 Adobe Premier Pro 的变形稳定器相比，我们的方法在实时运行的同时还实现了更高的质量结果。

Summary / 总结

The research aims to stabilize video captured by a tailless ornithopter, which shakes at 12-20 Hz and has been challenging to stabilize using camera-based methods. The method developed minimizes changes in image intensity by optimizing 3D rotations, resulting in real-time stabilized video suitable for human viewing without distortion. Compared to Adobe Premier Pro's warp stabilizer, the method provides higher quality results and runs in real time.

研究旨在稳定尾部无舵的扑翼飞行器拍摄的视频，该视频因每秒12-20次的抖动而难以使用相机进行感测。通过优化3D旋转来最小化图像强度的变化，该方法‘人工微眼跳补偿’实现了适合人类观看且无失真的稳定视频。与Adobe Premiere Pro的变形稳定器相比，该方法不仅效果更佳，还能实时运行。

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

Authors: Oren Rachmil, Roy Betser, Itay Gershon, Omer Hofman, Nitay Yakoby, Yuval Meron, Idan Yankelev, Asaf Shabtai, Yuval Elovici, Roman Vainshtein

Venue: AAAI 2026

First: 2025-12-03T17:23:39+00:00 · Latest: 2025-12-03T17:23:39+00:00

Comments: Accepted to the AAAI 2026 Deployable AI (DAI) Workshop

Abs · PDF · Code1 · Code2

Abstract

Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection

中文标题/摘要

标题：在LLMs中通过激活空间去偏检测无训练策略违规

随着组织越来越多地在法律支持、金融和医疗服务等敏感领域部署专有大型语言模型（LLMs），使LLMs与内部组织政策保持一致已成为当务之急。除了通用的安全过滤器外，企业还需要可靠的机制来在其监管和运营框架内检测策略违规，因为违规可能会引发法律和声誉风险。现有的内容审核框架，如护栏，主要局限于安全领域，缺乏捕捉复杂组织政策的稳健性。LLM作为法官和微调方法虽然灵活，但会引入显著的延迟并缺乏可解释性。为了解决这些限制，我们提出了一种无训练且高效的策略，将策略违规检测视为离群值检测问题。受去偏技术的启发，我们应用线性变换来解相关模型的隐藏激活，并标准化为零均值和单位方差，产生接近单位协方差矩阵。在该变换空间中，我们使用欧几里得范数作为合规评分来检测策略违规。该方法仅需策略文本和少量示例样本，使其轻量级且易于部署。在一项具有挑战性的策略基准测试中，我们的方法达到了最先进的效果，超越了现有的护栏和微调推理模型。这项工作为企业提供了一种实用且统计上合理的框架，用于LLM的策略感知监督，推动了可部署AI治理的更广泛目标。代码可在：https://tinyurl.com/policy-violation-detection 获取

Summary / 总结

This paper addresses the challenge of aligning large language models (LLMs) with organizational policies by proposing a training-free method for detecting policy violations. The method uses activation-space whitening to decorrelate model activations and standardize them, transforming the space to detect policy violations using the Euclidean norm. The approach requires minimal resources and outperforms existing guardrails and fine-tuned models on a policy benchmark, providing a practical solution for AI governance.

论文提出了一种无需训练的方法，通过将政策违规检测视为异常分布检测问题来检测大型语言模型中的政策违规。该方法通过对模型隐藏激活进行线性变换来标准化它们，然后使用欧几里得范数作为合规得分。该方法只需要政策文本和少量示例样本，使其轻量级且易于部署。它在具有挑战性的政策基准测试中优于现有的护栏和微调推理模型。

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

Authors: Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang

First: 2025-12-03T17:22:29+00:00 · Latest: 2025-12-03T17:22:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

中文标题/摘要

标题：DIQ-H：评估在时间视觉退化下VLM的幻觉持久性

部署在自动驾驶等安全关键应用中的视觉-语言模型（VLMs）必须在不完美的视觉流下处理连续的视觉信息。然而，现有的基准测试主要关注静态、高质量的图像，而忽略了时间退化和错误传播，这是关键故障模式之一，即瞬态视觉损坏会引发持续到后续帧的幻觉。我们引入了DIQ-H，这是第一个用于评估VLM在时间序列中动态视觉退化下的鲁棒性的基准测试。DIQ-H 应用了基于物理的损坏，包括运动模糊、传感器噪声和压缩伪影，并通过多轮问答任务来衡量幻觉持久性、错误恢复和时间一致性。为了实现可扩展的注释，我们提出了基于不确定性迭代细化（UIR）的方法，该方法使用具有不确定性过滤的轻量级VLM生成可靠的伪地面真值，实现了15.3%的准确率提升。在16个最先进的VLM上的实验揭示了显著的鲁棒性差距：即使是先进的模型如GPT-4o也只能实现78.5%的恢复率，而开源模型在时间一致性方面低于60%。DIQ-H 提供了一个全面的平台，用于评估VLM在实际部署中的可靠性。

Summary / 总结

The research aims to evaluate the robustness of Vision-Language Models (VLMs) under dynamic visual degradation, focusing on hallucination persistence and error recovery. DIQ-H, the introduced benchmark, applies physics-based corruptions and measures these aspects through multi-turn question-answering tasks. Experiments show significant robustness gaps among state-of-the-art models, with even advanced models like GPT-4o achieving only 78.5 percent recovery rate, and open-source models struggling with temporal consistency below 60 percent. To enable scalable annotation, the Uncertainty-Guided Iterative Refinement (UIR) method is proposed, improving accuracy by 15.3 percent.

研究旨在评估视觉-语言模型（VLMs）在动态视觉降级条件下的鲁棒性，这对于自动驾驶等安全关键应用至关重要。研究引入了DIQ-H基准，该基准对时间序列应用物理基础的破坏，并测量幻觉持续性、错误恢复和时间一致性。实验表明，即使是先进的模型如GPT-4o也只能实现78.5％的恢复率，开源模型在时间一致性方面表现不佳。为了实现注释的可扩展性，研究提出了不确定性引导迭代改进（UIR），该方法通过轻量级VLM和不确定性过滤提高了15.3％的准确性。

DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment

Authors: Sheng-Hao Liao, Shang-Fu Chen, Tai-Ming Huang, Wen-Huang Cheng, Kai-Lung Hua

First: 2025-12-03T17:12:00+00:00 · Latest: 2025-12-03T17:12:00+00:00

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: https://frakw.github.io/DirectDrag/. Code is available at: https://github.com/frakw/DirectDrag.

中文标题/摘要

标题：DirectDrag：基于读出引导特征对齐的无遮罩、无提示高保真图像编辑

使用生成模型的拖拽式图像编辑提供了对图像结构的直观控制。然而，现有方法严重依赖手动提供的遮罩和文本提示来保持语义保真度和运动精度。去除这些限制会产生根本性的权衡：没有遮罩的视觉伪影和没有提示的空间控制不佳。为了解决这些限制，我们提出了DirectDrag，一种新颖的无遮罩、无提示编辑框架。DirectDrag能够在最少用户输入的情况下实现精确和高效的操控，同时保持高图像保真度和准确的点对齐。DirectDrag引入了两项关键创新。首先，我们设计了一个自动软遮罩生成模块，该模块能够从点位移中智能推断可编辑区域，自动沿运动路径定位变形，同时通过生成模型的固有能力保持上下文完整性。其次，我们开发了一种读出引导特征对齐机制，利用中间扩散激活来保持基于点的编辑过程中的结构一致性，显著提高视觉保真度。尽管没有手动遮罩或提示，DirectDrag在图像质量上优于现有方法，同时保持了竞争性的拖拽精度。在DragBench和实际场景上的大量实验表明，DirectDrag在高质量、交互式图像操作方面具有有效性和实用性。项目页面：https://frakw.github.io/DirectDrag/。代码可在：https://github.com/frakw/DirectDrag/获取。

Summary / 总结

DirectDrag is a mask- and prompt-free image editing framework that uses drag-based manipulation to achieve high-fidelity image editing. It introduces an Auto Soft Mask Generation module to automatically infer editable regions and a Readout-Guided Feature Alignment mechanism to maintain structural consistency. Experiments show that DirectDrag outperforms existing methods in terms of image quality and drag accuracy while requiring minimal user input.

DirectDrag 是一个无需掩码和提示的图像编辑框架，通过拖拽操作实现高质量的图像编辑。它引入了自动软掩码生成模块来智能地推断可编辑区域，并开发了读出引导特征对齐机制以保持结构一致性。实验表明，DirectDrag 在图像质量和拖拽准确性方面优于现有方法，同时需要较少的用户输入。

BlurDM: A Blur Diffusion Model for Image Deblurring

Authors: Jin-Ting He, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin, Yen-Yu Lin

Venue: NeurIPS 2025

First: 2025-12-03T17:10:44+00:00 · Latest: 2025-12-03T17:10:44+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at https://github.com/Jin-Ting-He/BlurDM.

中文标题/摘要

标题：BlurDM：一种用于图像去模糊的模糊扩散模型

扩散模型在动态场景去模糊方面显示出潜力；然而，现有研究往往未能充分利用扩散模型中的模糊过程本质，限制了其全部潜力。为了解决这一问题，我们提出了一种模糊扩散模型（BlurDM），该模型将模糊形成过程无缝地整合到扩散中以进行图像去模糊。我们观察到运动模糊源自连续曝光，BlurDM 通过双重扩散前向方案隐式地建模模糊形成过程，将噪声和模糊扩散到锐利图像上。在反生成过程中，我们推导出双重去噪和去模糊公式，使BlurDM能够在给定受模糊图像条件的纯高斯噪声输入的情况下，同时去噪和去模糊以恢复锐利图像。此外，为了高效地将BlurDM集成到去模糊网络中，我们在潜在空间中执行BlurDM，形成一个灵活的先验生成网络以进行去模糊。广泛的实验表明，BlurDM在四个基准数据集上显著且一致地增强了现有的去模糊方法。源代码可在https://github.com/Jin-Ting-He/BlurDM获取。

Summary / 总结

BlurDM is a novel diffusion model designed for image deblurring, which integrates the blur formation process into the diffusion framework. By using a dual-diffusion forward scheme, BlurDM diffuses both noise and blur onto a sharp image, and during the reverse generation process, it simultaneously denoises and deblurs the image. Experiments show that BlurDM outperforms existing deblurring methods on four benchmark datasets, significantly enhancing image clarity and detail recovery.

BlurDM 是一种新颖的模糊扩散模型，通过将模糊形成过程整合到扩散模型中来提升图像去模糊的效果。它使用双扩散前向方案来建模运动模糊，并在反向生成过程中采用双重去噪和去模糊公式。实验表明，BlurDM 在四个基准数据集上显著且一致地优于现有方法，提升了去模糊质量。

Refining Machine Learning Potentials through Thermodynamic Theory of Phase Transitions

Authors: Paul Fuchs, Julija Zavadlav

First: 2025-12-03T17:06:26+00:00 · Latest: 2025-12-03T17:06:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Foundational Machine Learning Potentials can resolve the accuracy and transferability limitations of classical force fields. They enable microscopic insights into material behavior through Molecular Dynamics simulations, which can crucially expedite material design and discovery. However, insufficiently broad and systematically biased reference data affect the predictive quality of the learned models. Often, these models exhibit significant deviations from experimentally observed phase transition temperatures, in the order of several hundred kelvins. Thus, fine-tuning is necessary to achieve adequate accuracy in many practical problems. This work proposes a fine-tuning strategy via top-down learning, directly correcting the wrongly predicted transition temperatures to match the experimental reference data. Our approach leverages the Differentiable Trajectory Reweighting algorithm to minimize the free energy differences between phases at the experimental target pressures and temperatures. We demonstrate that our approach can accurately correct the phase diagram of pure Titanium in a pressure range of up to 5 GPa, matching the experimental reference within tenths of kelvins and improving the liquid-state diffusion constant. Our approach is model-agnostic, applicable to multi-component systems with solid-solid and solid-liquid transitions, and compliant with top-down training on other experimental properties. Therefore, our approach can serve as an essential step towards highly accurate application-specific and foundational machine learning potentials.

中文标题/摘要

标题：通过相变热力学理论精炼机器学习势能

基础机器学习势能可以解决经典势场的准确性和可转移性限制。它们通过分子动力学模拟提供材料行为的微观洞察，从而加速材料设计和发现。然而，参考数据不够广泛且系统性偏差影响学习模型的预测质量。通常，这些模型在相变温度上与实验观察结果存在显著偏差，偏差范围可达几百开尔文。因此，需要进行精细调整以在许多实际问题中达到足够的准确性。本工作提出了一种自上而下的学习策略，直接纠正错误预测的相变温度以匹配实验参考数据。我们的方法利用可微轨迹重加权算法，在实验目标压力和温度下最小化相态之间的自由能差异。我们证明，我们的方法可以在高达5 GPa的压力范围内准确修正纯钛的相图，与实验参考数据相差几开尔文，并提高液态扩散常数。我们的方法是模型无关的，适用于具有固态-固态和固态-液态转变的多组分系统，并适用于其他实验性质的自上而下的训练。因此，我们的方法可以作为实现高度准确的应用特定和基础机器学习势能的重要步骤。

Summary / 总结

This study aims to enhance the accuracy and transferability of machine learning potentials in predicting phase transitions by addressing systematic biases in reference data. The method employs a top-down learning strategy using the Differentiable Trajectory Reweighting algorithm to correct predicted transition temperatures to match experimental data. The approach successfully refined the phase diagram of pure titanium, achieving accuracy within tenths of kelvins and improving the liquid-state diffusion constant. This method is model-agnostic and can be applied to various systems with solid-solid and solid-liquid transitions, making it a valuable step towards more accurate machine learning potentials.

该研究旨在通过将模型调整以匹配实验数据来提高机器学习势在预测相变方面的准确性和可转移性。方法使用可微轨迹重加权算法来纠正预测的相变温度，确保与实验相变温度有更好的一致。关键发现包括在5 GPa的压力范围内准确修正纯钛的相图，改进了液态扩散常数，并在十分之几开尔文的范围内与实验参考数据匹配。

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

Authors: Franki Nguimatsia Tiofack, Théotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier

First: 2025-12-03T17:05:58+00:00 · Latest: 2025-12-03T17:05:58+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/

中文标题/摘要

标题：引导流动策略：在离线强化学习中学习高价值动作

离线强化学习通常依赖于行为正则化，以确保策略保持接近数据集分布。然而，此类方法在其正则化组件中无法区分高价值和低价值动作。我们引入了引导流动策略（GFP），它将多步流动匹配策略与提炼的一步演员结合在一起。演员通过加权行为克隆引导流动策略，专注于克隆数据集中的高价值动作，而不是无差别地模仿所有状态-动作对。反过来，流动策略约束演员保持与数据集最佳过渡的对齐，同时最大化评论家。这种相互指导使GFP在OGBench、Minari和D4RL基准的144个状态和像素基任务中实现了最先进的性能，特别是在次优数据集和具有挑战性的任务上取得了显著的提升。网页：https://simple-robotics.github.io/publications/guided-flow-policy/

Summary / 总结

The paper addresses the limitation of traditional behavior regularization in offline reinforcement learning by introducing Guided Flow Policy (GFP), which focuses on cloning high-value actions from the dataset. GFP combines a multi-step flow-matching policy with a distilled one-step actor, where the actor guides the flow policy to clone high-value actions, and the flow policy constrains the actor to remain aligned with the dataset's best transitions. This approach leads to state-of-the-art performance across various benchmarks, particularly on suboptimal datasets and challenging tasks.

论文通过引入Guided Flow Policy (GFP)解决了传统行为正则化在离线强化学习中的局限性，GFP专注于从数据集中克隆高价值动作。GFP结合了一个多步流匹配策略和一个提炼的一步演员，其中演员引导流策略克隆高价值动作，而流策略则约束演员保持与数据集中最佳过渡的一致性。这种方法在各种基准测试中表现出色，特别是在次优数据集和具有挑战性的任务上取得了显著的提升。

Technical Report on Text Dataset Distillation

Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Edson Bollis, Lucas Pellicer, Rosimeire Pereira Costa, Anna Helena Reali Costa, Artur Jordao

First: 2025-12-03T16:58:44+00:00 · Latest: 2025-12-03T16:58:44+00:00

Abs · PDF · Code1 · Code2

Abstract

In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use transformer models, the generation of discrete synthetic text, and the scaling to decoder-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.

中文标题/摘要

标题：技术报告：文本数据集蒸馏

在视觉领域，数据集蒸馏作为一种技术出现，用于将大型数据集压缩成一个较小的合成数据集，该数据集在训练过程中表现出相似的结果。尽管图像数据在蒸馏方法方面有大量的文献，但文本数据集蒸馏的工作相对较少。文本数据集蒸馏最初是视觉领域努力的适应，随着该模态的特殊性成为明显的障碍，它发展成为研究的一个独立分支。该领域的几个里程碑包括使用变压器模型的方法的引入、生成离散的合成文本以及扩展到超过1亿参数的仅解码器模型。尽管现代方法取得了重大进展，但该领域仍处于成熟阶段，需要在基准标准化、克服文本的离散性质、处理复杂任务以及提供现实世界应用的明确示例方面进行改进。在本报告中，我们回顾了文本数据集蒸馏的过去和近期进展，强调不同的蒸馏策略、关键贡献和一般挑战。

Summary / 总结

The motivation for this study is to address the limited research on text dataset distillation compared to the extensive literature in the vision domain. The main method involves using transformer models to generate discrete synthetic text and scaling to large decoder-only models with over 1 billion parameters. Key experimental findings include the development of various distillation strategies and the identification of challenges such as benchmarking standardization and handling complex tasks.

研究动机是鉴于文本数据集蒸馏相比视觉领域文献较少。主要方法是使用变压器模型生成离散的合成文本，并扩展到超过1亿参数的纯解码器模型。关键实验发现包括各种蒸馏策略的发展以及标准化基准测试和处理复杂任务等挑战的识别。

Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization

Authors: Lianyu Pang, Ji Zhou, Qiping Wang, Baoquan Zhao, Zhenguo Yang, Qing Li, Xudong Mao

First: 2025-12-03T16:57:50+00:00 · Latest: 2025-12-03T16:57:50+00:00

Comments: 17 pages, 13 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID

中文标题/摘要

标题：身份训练，推理控制：一种无调优的统一面部个性化方法

无调优的面部个性化方法沿着两个不同的范式发展：文本嵌入方法将面部特征映射到文本嵌入空间，以及基于适配器的方法通过辅助交叉注意力层注入特征。虽然这两种范式都显示出潜力，但现有方法难以同时实现高身份保真度和灵活的文本控制。我们引入了UniID，这是一种统一的无调优框架，将这两种范式协同整合。我们的关键见解是，在合并这些方法时，它们应该仅强化与身份相关的信息，同时保留非身份属性的原始扩散先验。我们通过一个原则性的训练-推理策略实现这一点：在训练期间，我们采用一种以身份为中心的学习方案，引导两个分支仅捕捉身份特征；在推理期间，我们引入一种归一化缩放机制，恢复基础扩散模型的文本控制能力，同时使互补的身份信号相互增强。这种原则性设计使UniID能够实现高保真度的面部个性化和灵活的文本控制。与六种最先进的方法的广泛实验表明，UniID在身份保留和文本控制方面均表现出更优的性能。代码将在https://github.com/lyuPang/UniID上提供

Summary / 总结

The paper introduces UniID, a unified tuning-free framework for face personalization that integrates text embedding and adapter-based methods. During training, it focuses on identity features, and at inference, it uses a normalized rescaling mechanism to enhance text controllability while preserving identity. Experiments show that UniID outperforms six state-of-the-art methods in both identity preservation and text controllability.

论文提出了UniID，这是一种结合了文本嵌入和适配器方法的统一无调优框架。在训练过程中，它专注于身份特征，而在推理时，使用归一化重缩放机制来增强文本可控性同时保留身份特征。实验表明，UniID 在身份保真度和文本可控性方面均优于六种最先进的方法。

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Authors: Tao Wu, Li Yang, Gen Zhan, Yiting Liao, Junlin Li, Deliang Fu, Li Zhang, Limin Wang

First: 2025-12-03T16:57:00+00:00 · Latest: 2025-12-03T16:57:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.

中文标题/摘要

标题：TempR1：通过时间意识多任务强化学习提高MLLM的时间理解

提高多模态大型语言模型（MLLM）的时间理解对于推进长视频分析至关重要，这使得时间定位、动作检测和时间敏感的问题回答等任务成为可能。虽然强化学习（RL）最近被探索用于提高时间推理能力，但现有方法通常局限于有限的任务类型和数据，限制了它们在不同时间理解场景中的泛化能力。为了解决这一挑战，我们提出了TempR1，这是一种时间意识的多任务强化学习框架，系统地增强了MLLM的时间理解能力。我们构建了一个多任务语料库，使模型接触到各种各样的时间结构和语义，并在此基础上利用Group Relative Policy Optimization（GRPO）算法实现跨任务的稳定和有效的优化。具体而言，我们将时间任务分为三种预测区间与真实实例之间对应类型，并为每种类型设计定制化的定位奖励，使TempR1能够捕捉到细微的时间依赖关系并适应不同的时间模式。广泛的实验表明，TempR1在多个基准测试中达到了最先进的性能。此外，其在互补任务上的联合优化产生了强大的协同效应，增强了泛化能力和单任务性能，为MLLM中的时间推理建立了可扩展和原则性的范式。

Summary / 总结

The research aims to improve the temporal understanding of Multimodal Large Language Models (MLLMs) for long-form video analysis. TempR1, a temporal-aware multi-task reinforcement learning framework, is proposed to enhance temporal comprehension by exposing the model to diverse temporal structures and using the GRPO algorithm for stable cross-task optimization. The framework categorizes temporal tasks into three types and designs tailored rewards, achieving state-of-the-art performance across multiple benchmarks and demonstrating a synergistic effect in generalization and single-task performance.

研究旨在通过增强多模态大型语言模型（MLLMs）的时序理解能力，提高长视频分析的能力。提出了一种时序感知的多任务强化学习框架TempR1，通过让模型接触多样化的时序结构，并为不同类型的时序任务设计定制化的奖励系统来提升时序推理能力。实验表明，TempR1在多个基准测试中表现出色，并通过互补任务的联合优化提高了泛化能力和单任务性能。

All that structure matches does not glitter

Authors: Maya M. Martirossyan, Thomas Egg, Philipp Hoellmer, George Karypis, Mark Transtrum, Adrian Roitberg, Mingjie Liu, Richard G. Hennig, Ellad B. Tadmor, Stefano Martiniani

Venue: NeurIPS

First: 2025-09-15T17:41:16+00:00 · Latest: 2025-12-03T16:56:41+00:00

Comments: Accepted at Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)

Abs · PDF · Code1 · Code2

Abstract

Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task$\unicode{x2014}$generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $\approx$40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

中文标题/摘要

标题：所有结构匹配并非皆有价值

材料生成模型，尤其是无机晶体，具有潜力彻底改变新型化合物和结构的理论预测。该领域的发展依赖于稳健的基准和最少但信息丰富的数据集，以实现有意义的模型评估。本文批判性地审视了晶体结构预测任务中常见的数据集和报告的指标。我们关注三个关键问题：首先，材料数据集应包含独特的晶体结构；例如，我们展示了广泛使用的碳-24数据集仅包含约40%的独特结构。其次，如果不同组成的多形体数量众多，则不应随机划分数据集，我们发现这适用于perov-5和MP-20数据集。第三，如果未经批判性使用基准，可能会导致误导，例如仅报告匹配率指标而不考虑相同构建块展示的结构多样性。为解决这些常被忽视的问题，我们提出了一些修正措施。我们提供了碳-24数据集的修订版本：一个去除了重复项的版本，一个去重并按原子数N划分的版本，一个包含对映异构体的版本，以及两个仅包含相同结构但具有不同晶胞的版本。我们还为具有多形体的数据集提出了新的划分方法，确保多形体在每个划分子集中分组，为模型性能基准设定更合理的标准。最后，我们提出了METRe和cRMSE，两种新的模型评估指标，可以纠正现有匹配率指标的问题。

Summary / 总结

This paper addresses critical issues in datasets and evaluation metrics for crystal structure prediction, focusing on the carbon-24, perov-5, and MP-20 datasets. It highlights that these datasets contain duplicate structures and polymorphs, which can mislead model evaluation. The authors introduce revised versions of the carbon-24 dataset and propose new splits for datasets with polymorphs to ensure more accurate benchmarking. Additionally, they introduce new metrics, METRe and cRMSE, to improve model evaluation.

本文针对晶体结构预测中数据集和评估指标的关键问题，重点关注碳-24、perov-5和MP-20数据集。指出这些数据集包含重复结构、不恰当的多形体分割以及误导性的匹配率指标。作者提出了修订后的碳-24数据集版本，提出了多形体数据集的新分割方法，并引入了METRe和cRMSE作为改进的评估指标。这些改进旨在为材料科学中的生成模型提供更 robust 的基准。

Tada-DIP: Input-adaptive Deep Image Prior for One-shot 3D Image Reconstruction

Authors: Evan Bell, Shijun Liang, Ismail Alkhouri, Saiprasad Ravishankar

First: 2025-12-03T16:56:38+00:00 · Latest: 2025-12-03T16:56:38+00:00

Comments: 6 pages, 8 figures, 2025 Asilomar Conference on Signals, Systems, and Computers. Code is available at github.com/evanbell02/Tada-DIP/

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.

中文标题/摘要

标题：Tada-DIP：输入自适应深度图像先验用于单次3D图像重建

深度图像先验（DIP）最近作为一种有前途的一次性神经网络图像重建方法而崭露头角。然而，DIP在3D图像重建问题中的应用有限。在本文中，我们引入了Tada-DIP，这是一种高效且完全3D的DIP方法，用于解决3D逆问题。通过结合输入自适应和去噪正则化，Tada-DIP能够生成高质量的3D重建图像，同时避免了DIP中常见的过拟合现象。在稀视图X射线计算机断层扫描重建实验中，验证了所提出方法的有效性，表明Tada-DIP产生的重建效果远优于无训练数据基线，并且达到了与使用大量数据集训练的监督网络相当的重建性能。

Summary / 总结

Tada-DIP is a novel 3D DIP method that combines input-adaptation and denoising regularization to improve one-shot 3D image reconstruction. It addresses the overfitting issue common in DIP and achieves high-quality 3D reconstructions, outperforming training-data-free baselines and matching the performance of a supervised network trained on large datasets for fully-sampled volumes. Experiments on sparse-view X-ray computed tomography reconstruction validate its effectiveness.

Tada-DIP 是一种结合输入自适应和去噪正则化的新型 3D DIP 方法，用于提高一-shot 3D 图像重建效果。它解决了 DIP 中常见的过拟合问题，并生成高质量的 3D 重建图像。实验表明，Tada-DIP 在稀疏视图 X 射线计算机断层扫描中的表现优于无训练数据基线，并且与在大量数据集上训练的监督网络具有相当的重建性能。

Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol

Authors: Niklas Jobs, Luis Miguel Vieira da Silva, Jayanth Somashekaraiah, Maximilian Weigand, David Kube, Felix Gehlhoff

First: 2025-12-03T16:49:14+00:00 · Latest: 2025-12-03T16:49:14+00:00

Comments: This work has been submitted to IFAC for possible publication

Abs · PDF · Code1 · Code2

Abstract

Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.

中文标题/摘要

标题：大型语言模型代理的规划与控制基准：基于模型上下文协议的积木世界问题

工业自动化越来越需要能够适应变化任务和环境的灵活控制策略。基于大型语言模型（LLMs）的代理具有实现这种适应性规划和执行的潜力，但缺乏标准化基准以系统比较。我们引入了一个基准，其中包含一个可执行的模拟环境，代表积木世界问题，提供五个复杂度类别。通过将模型上下文协议（MCP）作为标准化工具接口集成，可以将不同的代理架构连接到并评估基准，而无需实施特定的修改。单个代理的实现证明了基准的适用性，建立了基于LLMs的规划和执行方法的可比性定量指标。