arXiv 论文速递

Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Authors: Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Saining Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu

First: 2025-12-04T18:59:57+00:00 · Latest: 2025-12-04T18:59:57+00:00

Comments: Project Page: https://lightx-ai.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

中文标题/摘要

标题：Light-X：基于相机和照明控制的4D视频生成

近期在照明控制方面的进展将基于图像的方法扩展到了视频，但仍面临照明保真度和时间一致性之间的权衡。超越重新照明，生成现实场景的关键步骤是同时控制相机轨迹和照明，因为视觉动态是由几何形状和照明共同塑造的。为此，我们提出了Light-X，一种能够从单目视频中实现可控渲染的视频生成框架，同时控制视角和照明。1) 我们提出了一种解耦设计，将几何和照明信号分离：几何和运动通过沿用户定义的相机轨迹投影的动态点云捕获，而照明线索由一致投影到相同几何形状的重新照明帧提供。这些明确的细粒度线索有助于有效的解耦，并引导高质量的照明。2) 为了解决缺乏多视角和多照明视频配对的问题，我们引入了Light-Syn，一种基于退化的方法，通过逆映射从野外单目视频中合成训练配对。这种方法生成了一个涵盖静态、动态和AI生成场景的数据集，确保了稳健的训练。广泛的实验表明，Light-X 在同时控制相机和照明方面优于基线方法，并且在文本和背景条件设置下超过了先前的视频重新照明方法。

Summary / 总结

Light-X is a video generation framework that controls both camera trajectory and illumination to achieve high-quality 4D rendering from monocular videos. It uses a disentangled design to capture geometry and motion via dynamic point clouds and provides illumination cues through relit frames. Light-Syn, a degradation-based pipeline, synthesizes training pairs from monocular footage to address the lack of multi-view and multi-illumination data. Experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under various conditions.

Light-X 是一种能够同时控制摄像机轨迹和照明的视频生成框架，以实现高质量的4D渲染。它采用分离设计分别捕捉几何和照明信息，并引入 Light-Syn，一种基于退化的方法，从单目视频中合成训练数据。实验表明，Light-X 在联合摄像机-照明控制方面优于基线方法，并在各种设置下超越了之前的视频重新照明方法。

Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

Authors: Hao-Jen Chien, Yi-Chuan Huang, Chung-Ho Wu, Wei-Lun Chao, Yu-Lun Liu

Venue: WACV 2025

First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00

Comments: WACV 2025. Project page: https://chien90190.github.io/splannequin/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/

中文标题/摘要

标题：Splannequin：冻结单目人形挑战视频中的场景双检测斑点化

从单目人形挑战（MC）视频中合成高保真冻结3D场景是一个独特的问题，不同于标准的动态场景重建。我们不专注于建模运动，而是旨在创建一个冻结的场景，同时战略性地保留微妙的动力学，以使用户能够即时选择。为此，我们引入了一种动态高斯斑点化的新应用：场景动态建模，保留了附近的时间变化，通过固定模型的时间参数渲染静态场景。然而，在这种使用方式下，单目捕获在稀疏时间监督下引入了高斯体未观察到或被遮挡时的鬼影和模糊等伪影。我们提出了一种架构无关的正则化方法Splannequin，该方法检测高斯原语的两种状态：隐藏状态和缺陷状态，并应用时间锚定。在主要向前摄像机运动下，隐藏状态被锚定到其最近的观察良好的过去状态，而缺陷状态被锚定到未来具有更强监督的状态。我们的方法通过简单的损失项集成到现有的动态高斯管道中，无需架构更改，并且不增加推理开销。这导致了显著改进的视觉质量，使用户能够选择高保真、可选的冻结时间渲染，经96%的用户偏好验证。项目页面：https://chien90190.github.io/splannequin/

Summary / 总结

The research aims to synthesize high-fidelity frozen 3D scenes from monocular Mannequin-Challenge videos by strategically preserving subtle dynamics. The method uses dynamic Gaussian splatting to model the scene and render a static scene by fixing the time parameter. Splannequin, an architecture-agnostic regularization, detects hidden and defective Gaussian states and applies temporal anchoring to improve visual quality, achieving 96% user preference. The method integrates into existing pipelines with simple loss terms and no additional inference overhead.

研究旨在通过战略性地保留微妙动态，从单目Mannequin-Challenge视频中合成高保真3D场景。方法使用动态Gaussian splatting来建模场景，同时保留附近的时间变化，并通过固定模型的时间参数来渲染静态场景。Splannequin是一种架构无关的正则化技术，可以检测隐藏和缺陷的Gaussian状态，并应用时间锚定以提高视觉质量，从而实现高保真、用户可选的冻结时间渲染，用户偏好度达到96%。

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00

Comments: Project Page: https://github.com/CaraJ7/DraCo

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

中文标题/摘要

标题：DraCo：草图作为CoT以实现文本到图像预览和稀有概念生成

近期统一的多模态大型语言模型（MLLMs）展示了令人印象深刻的性能，通过链式推理（CoT）增强了文本到图像生成能力。然而，现有方法仍然有限，要么仅将模型视为独立生成器，要么依赖于抽象的文本规划。为此，我们提出了一种名为Draft-as-CoT（DraCo）的新颖交替推理范式，该范式充分利用了CoT中的文本和视觉内容，以更好地进行规划和验证。我们的方法首先生成低分辨率的草图图像作为预览，提供更具体的视觉规划和指导。然后，我们利用模型的内在理解能力验证草图与输入提示之间潜在的语义不一致，并通过选择性修正进行超分辨率细化。这样，我们的方法解决了两个基本挑战：文本规划的粗粒度性质和生成稀有属性组合的难度。为了支持训练，我们收集了DraCo-240K，旨在增强一般修正、实例操作和布局重组的三种原子能力。借助DraCo-CFG，一种专门的交替推理无分类器引导（CFG）策略，DraCo在GenEval上取得了8%的巨大提升，在Imagine-Bench上提升了0.91，在GenEval++上提升了3%，显著优于直接生成和其他借助CoT增强的生成方法。

Summary / 总结

DraCo is a novel method that uses a draft image as a chain-of-thought (CoT) for text-to-image generation, addressing the limitations of existing approaches. It generates a low-resolution draft image first, which provides concrete visual planning and guidance, and then refines it through selective corrections with super-resolution. DraCo improves on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%) by effectively handling the challenges of textual planning and rare attribute generation.

DraCo 提出了一种名为 Draft-as-CoT 的新型交错推理范式，通过在链式思维过程中利用文本和视觉内容来提升文本到图像的生成能力。它首先生成一个低分辨率的草图作为预览，然后根据输入提示进行验证和细化。DraCo 在 GenEval (+8%)、Imagine-Bench (+0.91) 和 GenEval++ (+3%) 上显著优于直接生成和其他基于 CoT 的生成方法，解决了粗粒度文本规划和稀有属性生成的挑战。

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang

First: 2025-12-04T18:59:52+00:00 · Latest: 2025-12-04T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

中文标题/摘要

标题：ARM-Thinker: 使用自主工具使用和视觉推理强化多模态生成奖励模型

奖励模型对于使视觉语言系统与人类偏好保持一致至关重要，但当前方法存在幻觉、视觉定位弱以及无法使用工具进行验证的问题，限制了它们在复杂多模态推理任务中的可靠性。我们提出了ARM-Thinker，这是一种自主多模态奖励模型，能够自主调用外部工具（例如，图像裁剪、文档页面检索）以基于可验证的证据进行判断，替代静态、非交互式的奖励评分。这使模型能够验证细微的视觉细节、交叉引用多页证据并验证推理声明，而这些能力在现有的奖励模型中是不存在的。我们使用多阶段强化学习训练ARM-Thinker，联合优化工具调用决策和判断准确性。为了评估自主奖励建模，我们引入了ARMBench-VL，包含三个基准测试，分别评估细微的视觉定位（图像级工具）、多页文档理解（检索工具）和指令遵循（文本级验证）。ARM-Thinker 在奖励模型基准测试中平均提高了16.2%，在工具使用任务中提高了9.6%，并在多模态数学和逻辑推理基准测试中优于基线模型。我们的结果表明，自主能力显著提高了奖励模型的准确性和可解释性。

Summary / 总结

ARM-Thinker is designed to address the limitations of current reward models by incorporating agentic tool use and visual reasoning. It uses multi-stage reinforcement learning to autonomously invoke external tools for verifying visual details and cross-referencing evidence, improving its reliability in complex multimodal reasoning tasks. ARM-Thinker shows a 16.2% average improvement on reward modeling benchmarks and outperforms baselines on multimodal math and logical reasoning tasks.

ARM-Thinker 通过引入自主工具使用和视觉推理来解决当前奖励模型的局限性。它使用多阶段强化学习自主调用外部工具来验证视觉细节和交叉引用证据，提高其在复杂多模态推理任务中的可靠性。ARM-Thinker 在奖励模型基准测试中的平均改进幅度为 16.2%，并在多模态数学和逻辑推理基准测试中优于基线模型。

ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

Authors: Rundong Luo, Noah Snavely, Wei-Chiu Ma

First: 2025-12-04T18:59:51+00:00 · Latest: 2025-12-04T18:59:51+00:00

Comments: Project page: https://red-fairy.github.io/ShadowDraw/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!

中文标题/摘要

标题：ShadowDraw：从任意物体到影子绘画组合艺术

我们介绍了ShadowDraw，一个将普通3D物体转换为影子绘画组合艺术的框架。给定一个3D物体，我们的系统预测场景参数，包括物体姿态和照明，以及部分线稿，使得投射的阴影完成绘画并形成可识别的图像。为此，我们优化场景配置以揭示有意义的阴影，利用阴影笔触引导线稿生成，并采用自动评估以确保影子绘画的一致性和视觉质量。实验表明，ShadowDraw在从真实世界扫描、精选数据集到生成资产的各种输入中均能产生引人注目的结果，并自然扩展到多物体场景、动画和物理部署。我们的工作提供了一条实用的创作影子绘画艺术的管道，并拓宽了计算视觉艺术的设计空间，弥合了算法设计与艺术叙事之间的差距。请访问我们的项目页面https://red-fairy.github.io/ShadowDraw/查看更多结果和我们管道的端到端现实世界演示！

Summary / 总结

ShadowDraw is a framework that converts 3D objects into shadow-drawing compositional art by predicting scene parameters and optimizing shadow configurations. It generates partial line drawings that, when combined with the cast shadows, form recognizable images. Experiments demonstrate that ShadowDraw can produce compelling results for various inputs, including real-world scans, curated datasets, and generative assets, and can be extended to multi-object scenes and animations. The work offers a practical pipeline for creating shadow-drawing art and expands the design space of computational visual art, connecting algorithmic design with artistic storytelling.

ShadowDraw 是一个框架，通过预测场景参数和优化阴影配置，将 3D 对象转换为阴影绘画组合艺术。系统生成部分线稿，结合投射的阴影形成可识别的图像。实验表明，ShadowDraw 可以为各种输入，包括真实世界的扫描、精选数据集和生成资产，产生引人注目的结果，并扩展到多对象场景和物理部署。这项工作提供了一种创建阴影绘画艺术的实用管道，并扩展了计算视觉艺术的设计空间，将算法设计与艺术叙事联系起来。

NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Authors: Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister

First: 2025-12-04T18:59:18+00:00 · Latest: 2025-12-04T18:59:18+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.

中文标题/摘要

标题：NeuralRemaster：相位保持扩散以实现结构对齐生成

标准扩散使用具有随机幅度和随机相位的高斯噪声破坏数据。虽然对于无条件或文本到图像生成有效，但破坏相位分量会破坏空间结构，使其不适合需要几何一致性的任务，如重新渲染、仿真增强和图像到图像转换。我们引入了相位保持扩散φ-PD，这是一种模型无关的扩散过程重新表述，能够保持输入相位的同时随机化幅度，从而在无需架构更改或额外参数的情况下实现结构对齐生成。我们还提出了频率选择性结构（FSS）噪声，通过单一的频率截止参数提供连续的结构刚性控制。φ-PD 不增加推理时间成本，并且可以与任何图像或视频的扩散模型兼容。在逼真和风格化重新渲染、以及从仿真到现实的增强驾驶规划中，φ-PD 生成了可控且空间对齐的结果。当应用于CARLA仿真器时，φ-PD 将CARLA到Waymo规划器性能提高了50%。该方法与现有的条件方法互补，并广泛适用于图像到图像和视频到视频生成。有关视频、额外示例和代码，请参阅我们的\href{https://yuzeng-at-tri.github.io/ppd-page/}{项目页面}。

Summary / 总结

NeuralRemaster introduces Phase-Preserving Diffusion (φ-PD), which preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes. It proposes Frequency-Selective Structured (FSS) noise for continuous control over structural rigidity. φ-PD improves re-rendering, sim-to-real enhancement, and planner performance by 50% in the CARLA simulator, demonstrating its effectiveness in tasks requiring geometric consistency. Videos and code are available on the project page.

研究针对标准扩散模型中相位破坏的问题，该问题破坏了空间结构，不适合需要几何一致性的任务。研究引入了相位保持扩散（φ-PD），它保持输入相位的同时随机化幅度，无需额外参数即可实现结构对齐的生成。φ-PD 在重新渲染、模拟到现实增强以及 CARLA 模拟器中的规划器性能提升 50%，展示了其在各种应用中生成可控且空间对齐结果的有效性。

Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Authors: Purbesh Mitra, Sennur Ulukus

First: 2025-12-04T18:59:18+00:00 · Latest: 2025-12-04T18:59:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.

中文标题/摘要

标题：语义软自举：无需强化学习的LLM长上下文推理

大型语言模型（LLM）中的长上下文推理通过链式思考（CoT）推理展示了其认知能力的增强。这类模型的训练通常通过基于推理的问题中的可验证奖励（RLVR）强化学习来完成，例如数学和编程问题。然而，RLVR受到一些瓶颈的限制，如稀疏奖励和样本效率不足。因此，在训练后阶段需要大量计算资源。为克服这些限制，本文提出了一种**语义软自举（SSB）**，这是一种自蒸馏技术，其中同一个基础语言模型在训练时扮演教师和学生的角色，但接收不同的关于其结果正确性的语义上下文。模型首先被提示一个数学问题，生成多个展开。从中筛选出正确的和最常见的错误回答，然后提供给模型以产生更稳健、逐步的解释和验证的最终答案。该流程自动从原始问题-答案数据中生成教师-学生训练集，无需任何人工干预。此生成过程还产生了一组logits，学生模型仅从原始问题中尝试匹配这些logits。在我们的实验中，我们使用参数高效微调在GSM8K数据集上对Qwen2.5-3B-Instruct进行训练。然后在MATH500和AIME2024基准测试上测试其准确性。我们的实验分别在准确率上比常用的RLVR算法GRPO提高了10.6%和10%。我们的代码可在https://github.com/purbeshmitra/semantic-soft-bootstrapping获取，模型和整理的数据集可在https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping获取。

Summary / 总结

The research aims to enhance long context reasoning in large language models (LLMs) through a self-distillation technique called Semantic Soft Bootstrapping (SSB), which avoids the limitations of reinforcement learning with verifiable rewards (RLVR). SSB uses the same base model as both teacher and student, providing different semantic contexts about the correctness of its outcomes during training. This method automatically curates a training set from raw problem-answer data without human intervention, improving accuracy on MATH500 and AIME2024 benchmarks by 10.6% and 10% respectively, compared to GRPO, a commonly used RLVR algorithm.

研究旨在通过提出语义软自举（SSB）技术来增强大型语言模型（LLMs）在长上下文推理中的能力，该技术通过自蒸馏方法提高模型的稳健性和准确性，而不依赖于带有验证奖励的强化学习（RLVR）。该方法包括基模型为数学问题生成多个卷积，筛选出正确的和不正确的响应，并在上下文中提供这些信息以生成更详细和验证的解释。实验在GSM8K数据集上显示，与常用的GRPO算法相比，在MATH500和AIME2024基准测试上的准确率分别提高了10.6%和10%。

EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Authors: Jiaqi Ma, Shengkai Hu, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan

First: 2025-12-04T18:59:10+00:00 · Latest: 2025-12-04T18:59:10+00:00

Abs · PDF · Code1 · Code2

Abstract

All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

中文标题/摘要

标题：EvoIR：通过进化频率调制实现一站式图像恢复

一站式图像恢复（AiOIR）任务通常涉及多种退化，需要稳健且通用的策略。然而，大多数现有方法通常缺乏显式的频率建模，并依赖于固定的或启发式的优化计划，这限制了其在异构退化中的泛化能力。为了解决这些限制，我们提出了EvoIR，这是一种针对AiOIR的特定框架，引入了进化频率调制以实现动态和自适应的图像恢复。具体而言，EvoIR 使用频率调制模块（FMM），以显式方式将特征分解为高频和低频分支，并自适应地调制它们以增强结构保真度和细粒度细节。EvoIR 的核心是一种进化优化策略（EOS），通过基于群体的进化过程迭代调整频率感知目标，动态平衡结构准确性和感知保真度。其进化的指导进一步缓解了退化之间的梯度冲突并加速了收敛。通过结合FMM和EOS，EvoIR 在单独使用任一组件时表现出更大的改进，突显了它们的互补作用。在多个基准上的广泛实验表明，EvoIR 在一站式图像恢复方法中优于最先进的方法。

Summary / 总结

EvoIR is designed to handle diverse image degradation by introducing evolutionary frequency modulation. It uses a Frequency-Modulated Module (FMM) to decompose features into high- and low-frequency branches and adaptively modulate them. An Evolutionary Optimization Strategy (EOS) dynamically adjusts frequency-aware objectives, balancing structural accuracy and perceptual fidelity. Experiments show that EvoIR outperforms existing methods on multiple benchmarks.

EvoIR 通过引入进化频率调制来处理多样化的图像退化问题。它使用频率调制模块（FMM）将特征分解为高频和低频分支，并进行自适应调制。进化优化策略（EOS）动态调整频率感知目标，平衡结构准确性和感知保真度。实验表明，EvoIR 在多个基准测试中优于现有方法。

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-04T18:59:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

中文标题/摘要

标题：TV2TV：一种统一的交错语言和视频生成框架

视频生成模型正在迅速发展，但仍可能在需要大量语义分支或反复进行下一步该发生什么的高层推理的复杂视频输出上遇到困难。在本文中，我们介绍了一类新的全能视频-文本模型，这些模型结合了最近语言模型推理进展的想法，以应对这一挑战。具体来说，我们提出了TV2TV，这是一种统一的生成建模框架，将视频生成分解为交错的语言和视频生成过程。TV2TV 使用混合的变换器（MoT）架构联合学习语言建模（下一个标记预测）和视频流匹配（下一个帧预测）。在推理时，TV2TV 决定何时在生成文本和视频帧之间交替，使模型能够在“用语言思考”后续内容之前“用像素行动”来生成帧。这种设计将决定下一步该发生什么的责任大部分转移到了语言建模塔上，从而提高了生成视频的视觉质量和提示对齐。它还使细粒度的可控性成为可能，允许用户通过文本干预在过程中的任何点修改视频生成轨迹。在对视频游戏数据的受控实验中，TV2TV 在视觉质量和可控性方面都取得了显著的改进。TV2TV 还扩展到自然视频，正如我们通过使用视觉-语言模型（VLMs）交替自然语言动作描述来增强体育视频所展示的那样。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐，展示了该模型能够推理和生成复杂的现实世界动作序列的能力。这些结果共同突显了 TV2TV 作为视频生成中具有开放性文本推理和控制的有希望的一步。

Summary / 总结

TV2TV is a unified generative modeling framework that interleaves text and video generation processes, using a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching. At inference, TV2TV alternates between generating text and video frames, allowing for improved visual quality and better alignment with prompts. Experiments on video game data show significant improvements in visual quality and controllability, while scaling to natural videos demonstrates the model's ability to reason about and generate complex real-world action sequences.

TV2TV 是一种统一的生成模型框架，通过交错文本和视频生成过程来解决生成复杂视频输出的挑战。它使用混合变换器架构同时学习语言建模和视频流匹配，使模型能够在‘用词思考’之前进行‘像素行动’。在视频游戏数据上的实验显示，在视觉质量和可控性方面有显著改进，并通过视觉语言模型将动作描述集成到自然视频中。TV2TV 在视觉质量和提示对齐方面表现出色，表明其在视频生成中的开放文本推理和控制潜力。

BioAnalyst: A Foundation Model for Biodiversity

Authors: Athanasios Trantas, Martino Mensio, Stylianos Stasinos, Sebastian Gribincea, Taimur Khan, Damian Podareanu, Aliene van der Veen

First: 2025-07-11T23:56:08+00:00 · Latest: 2025-12-04T18:58:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Foundation Models (FMs) offer a path to learn general-purpose representations from heterogeneous ecological data, easily transferable to downstream tasks. However, practical biodiversity modelling remains fragmented; separate pipelines and models are built for each dataset and objective, which limits reuse across regions and taxa. In response, we present BioAnalyst, to our knowledge the first multimodal Foundation Model tailored to biodiversity analysis and conservation planning in Europe at $0.25^{\circ}$ spatial resolution targeting regional to national-scale applications. BioAnalyst employs a transformer-based architecture, pre-trained on extensive multimodal datasets that align species occurrence records with remote sensing indicators, climate and environmental variables. Post pre-training, the model is adapted via lightweight roll-out fine-tuning to a range of downstream tasks, including joint species distribution modelling, biodiversity dynamics and population trend forecasting. The model is evaluated on two representative downstream use cases: (i) joint species distribution modelling and with 500 vascular plant species (ii) monthly climate linear probing with temperature and precipitation data. Our findings show that BioAnalyst can provide a strong baseline both for biotic and abiotic tasks, acting as a macroecological simulator with a yearly forecasting horizon and monthly resolution, offering the first application of this type of modelling in the biodiversity domain. We have open-sourced the model weights, training and fine-tuning pipelines to advance AI-driven ecological research.

中文标题/摘要

标题：BioAnalyst：生物多样性基础模型

多模态基础模型（FMs）提供了一条从异质生态数据中学习通用表示的途径，这些表示可以轻松地转移到下游任务中。然而，实际的生物多样性建模仍然支离破碎；每个数据集和目标都需要单独构建管道和模型，这限制了跨地区和类群的重用。为应对这一挑战，我们提出了BioAnalyst，据我们所知，这是第一个针对欧洲区域至国家规模应用的生物多样性分析和保护规划的多模态基础模型，空间分辨率为0.25°。BioAnalyst采用基于变换器的架构，预训练在广泛的多模态数据集上，将物种分布记录与遥感指标、气候和环境变量对齐。预训练后，通过轻量级滚动微调，模型适应到一系列下游任务，包括联合物种分布建模、生物多样性动态和种群趋势预测。该模型在两个代表性下游应用场景中进行了评估：（i）联合物种分布建模，涉及500种维管植物物种；（ii）月度气候线性探测，使用温度和降水量数据。我们的研究结果表明，BioAnalyst可以为生物和非生物任务提供强大的基线，作为宏观生态模拟器，具有每年的预测时间范围和月度分辨率，这是此类建模在生物多样性领域的首次应用。我们已开源了模型权重、训练和微调管道，以促进AI驱动的生态学研究。

Summary / 总结

BioAnalyst is a multimodal Foundation Model designed for biodiversity analysis and conservation in Europe, addressing the fragmented nature of existing models. It uses a transformer-based architecture pre-trained on extensive multimodal datasets, aligning species occurrence records with remote sensing and environmental variables. After pre-training, BioAnalyst is fine-tuned for various tasks such as joint species distribution modeling and climate dynamics. The model demonstrates strong performance in both biotic and abiotic tasks, providing a yearly forecasting horizon and monthly resolution, marking the first application of this type of modeling in the biodiversity domain.

BioAnalyst 是一个针对欧洲生物多样性分析和保护规划设计的多模态基础模型，旨在解决现有生物多样性建模管道的碎片化问题。它采用基于变换器的架构，预训练在广泛的多模态数据集上，将物种分布记录与遥感和环境变量对齐。预训练后，BioAnalyst 细调用于各种任务，如联合物种分布建模和气候线性探测。该模型在生物和非生物任务中表现出色，提供一年的预测窗口和月度分辨率，并且是该领域的首个此类建模应用。模型及其训练过程已开源，以促进进一步的AI驱动生态学研究。

Structured Document Translation via Format Reinforcement Learning

Authors: Haiyue Song, Johannes Eschbach-Dymanus, Hour Kaing, Sumire Honda, Hideki Tanaka, Bianka Buschbeck, Masao Utiyama

First: 2025-12-04T18:58:30+00:00 · Latest: 2025-12-04T18:58:30+00:00

Comments: IJCNLP-AACL 2025 Main (Oral)

Abs · PDF · Code1 · Code2

Abstract

Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

中文标题/摘要

标题：结构化文档翻译通过格式强化学习

近期的结构化文本翻译工作主要集中在句子层面，因为它们难以有效处理复杂的文档级XML或HTML结构。为了解决这一问题，我们提出了**格式强化学习（FormatRL）**，它在监督微调模型的基础上使用组相对策略优化，直接优化新颖的结构感知奖励：1）TreeSim，用于测量预测和参考XML树之间的结构相似性；2）Node-chrF，用于衡量XML节点级别的翻译质量。此外，我们还应用了StrucAUC，这是一种细粒度的度量标准，能够区分轻微错误和重大结构失败。在SAP软件文档基准测试上的实验表明，在六个度量标准上均有所改进，并且进一步的分析表明不同的奖励函数如何在结构质量和翻译质量上都带来改进。

Summary / 总结

The research aims to improve structured text translation beyond the sentence level by addressing the challenges of XML or HTML document structures. It introduces Format Reinforcement Learning (FormatRL), which uses Group Relative Policy Optimization to optimize structure-aware rewards such as TreeSim and Node-chrF. The experiments on the SAP software-documentation benchmark show improvements across six metrics, indicating better structural and translation quality. Additionally, StrucAUC is used to distinguish between minor and major errors, further enhancing the model's performance.

研究旨在通过解决现有方法仅关注句子级翻译且难以处理复杂文档级结构的问题，来改进结构化文本翻译。提出的Format Reinforcement Learning (FormatRL) 方法使用Group Relative Policy Optimization来优化结构感知奖励，包括用于结构相似度的TreeSim和用于XML节点级翻译质量的Node-chrF。研究还使用StrucAUC来区分轻微和重大错误。实验表明，在六个指标上均有所改进，突出了不同奖励函数在提高结构和翻译质量方面的有效性。

SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

Authors: Yuan Gao, Jin Song

First: 2025-12-04T18:58:18+00:00 · Latest: 2025-12-04T18:58:18+00:00

Abs · PDF · Code1 · Code2

Abstract

In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.

中文标题/摘要

标题：SA-IQA：以多维度奖励重新定义空间美学的图像质量评估

近年来，人工智能生成图像（AIGI）的图像质量评估（IQA）取得了快速进展；然而，现有方法主要针对肖像和艺术图像，缺乏对室内场景的系统评估。我们引入了空间美学这一范式，从布局、和谐、照明和失真四个维度评估室内图像的美学质量。我们构建了SA-BENCH，这是首个空间美学基准，包含18,000张图像和50,000个精确注释。利用SA-BENCH，我们系统地评估了当前的IQA方法，并通过MLLM微调和多维度融合方法开发了SA-IQA，作为全面的奖励框架来评估空间美学。我们应用SA-IQA到两个下游任务：（1）作为与GRPO强化学习集成的奖励信号，优化AIGC生成流水线；（2）Best-of-N选择，筛选高质量图像，提高生成质量。实验表明，SA-IQA在SA-BENCH上显著优于现有方法，为空间美学评估设立了新标准。代码和数据集将开源，以促进该领域的研究和应用。

Summary / 总结

The research aims to evaluate the aesthetic quality of interior images by introducing a new paradigm called Spatial Aesthetics, which assesses images along four dimensions: layout, harmony, lighting, and distortion. The authors developed SA-BENCH, a benchmark dataset with 18,000 images and 50,000 annotations, and used it to evaluate existing IQA methods. They then created SA-IQA, a multidimensional reward framework, to assess spatial aesthetics more comprehensively. Experiments showed that SA-IQA outperformed existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. The method was applied to optimize AIGC generation and improve image quality through Best-of-N selection.

研究旨在通过引入新的美学评价范式Spatial Aesthetics，解决现有IQA方法对室内图像缺乏系统评价的问题，该范式从布局、和谐、照明和失真四个维度评估图像。研究构建了包含18,000张图像和50,000个标注的SA-BENCH基准，评估现有IQA方法并开发了SA-IQA多维度奖励框架。SA-IQA被应用于优化AIGC生成和筛选高质量图像，实验表明其在SA-BENCH上的表现优于现有方法。

From Generated Human Videos to Physically Plausible Robot Trajectories

Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig

First: 2025-12-04T18:56:03+00:00 · Latest: 2025-12-04T18:56:03+00:00

Comments: For project website, see https://genmimic.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

中文标题/摘要

标题：从生成的人类视频到物理上可信的机器人轨迹

视频生成模型在合成新颖情境下的人类动作方面的能力正在迅速提高，这为作为上下文机器人控制的高级规划者提供了潜在用途。为了实现这一潜力，一个关键的研究问题仍然悬而未决：如何以零样本的方式让类人机器人执行生成视频中的人类动作？这一挑战源于生成的视频通常噪声较大且存在形态失真，使得直接模仿比真实视频更困难。为了解决这一问题，我们引入了一个两阶段的流水线。首先，我们将视频像素提升到4D的人类表示，然后重新定向到类人形态。其次，我们提出了一种GenMimic——一种基于3D关键点的物理感知强化学习策略，并通过对称正则化和关键点加权跟踪奖励进行训练。因此，GenMimic可以从噪声较大的生成视频中模仿人类动作。我们构建了GenMimicBench，这是一个使用两种视频生成模型生成的合成人类动作数据集，涵盖了各种动作和情境，为评估零样本泛化能力和策略鲁棒性建立了基准。广泛的实验表明，与强大的基线相比，在模拟中有所改进，并且在无需微调的情况下，GenMimic能够在Unitree G1类人机器人上实现连贯且物理稳定的运动跟踪。这项工作为实现视频生成模型作为机器人控制高级策略的潜力提供了一条有希望的道路。

Summary / 总结

The research aims to enable robots to execute human actions from generated videos in a zero-shot manner, addressing the challenges of noise and morphological distortions in generated videos. A two-stage pipeline is introduced, first lifting video pixels into a 4D human representation and then retargeting to the humanoid morphology. The GenMimic policy, trained with symmetry regularization and keypoint-weighted tracking rewards, successfully mimics human actions from noisy videos. Experiments show improvements over strong baselines in simulation and on a Unitree G1 humanoid robot, confirming coherent and physically stable motion tracking without fine-tuning.

该研究解决了将生成视频中的人类动作转化为物理上可行的机器人轨迹的挑战。它提出了一种两阶段管道：首先将视频像素提升到4D人体表示并重新目标到人形形态，然后提出GenMimic，一种基于3D关键点的物理感知强化学习策略。实验表明，GenMimic可以从嘈杂的生成视频中有效模仿人类动作，实现与Unitree G1人形机器人的一致且物理稳定的运动跟踪，无需微调。该工作建立了GenMimicBench，一个合成数据集，用于评估零样本泛化和策略鲁棒性。

Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

Authors: Vincent Pauline, Tobias Höppe, Kirill Neklyudov, Alexander Tong, Stefan Bauer, Andrea Dittadi

First: 2025-12-04T18:55:36+00:00 · Latest: 2025-12-04T18:55:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in $\mathbb{R}^d$ and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.

中文标题/摘要

标题：一般状态空间扩散模型的基础：一个自包含的介绍

尽管扩散模型现在在生成建模中占据中心地位，但入门性介绍通常假设欧几里得数据，并且很少阐明其与离散状态对应物的联系。本文是对一般状态空间扩散现象的一个自包含入门，统一了连续域和离散/分类结构下的连续时间视角——通过马尔可夫核的前向噪声化和学习到的反向动力学，以及它们的连续时间极限——$\mathbb{R}^d$ 中的随机微分方程（SDEs）和有限字母表上的连续时间马尔可夫链（CTMCs）——并推导了相关的福克-普朗克方程和主方程。一种共同的变分处理给出了支撑标准训练损失的ELBO。我们明确说明了前向污染选择——连续空间中的高斯过程和离散空间中的结构化分类转换核（均匀的、遮蔽/吸收的以及其他）如何塑造反向动力学和ELBO。本文分层面向三类受众：寻求自包含直观介绍的新手；希望获得全局理论综合的扩散实践者；以及希望从类比入手了解离散扩散的连续扩散专家。结果是跨连续域和离散序列的现代扩散方法论的统一路线图，突显了一组紧凑的可重用证明、恒等式和核心理论原则。

Summary / 总结

This paper provides a comprehensive introduction to diffusion models in general state spaces, unifying continuous and discrete structures. It develops the discrete-time and continuous-time views of diffusion processes, including forward noising via Markov kernels and reverse dynamics learned from data, and derives the associated Fokker-Planck and master equations. Key findings include how different forward corruption choices in continuous and discrete spaces affect reverse dynamics and the evidence lower bound (ELBO) used for training. The paper serves as a foundational resource for newcomers, practitioners, and continuous-diffusion experts, offering a unified theoretical framework for diffusion models.

本文旨在提供连续和离散状态空间中扩散模型的全面介绍，解决现有处理中的不清晰问题。作者通过结合离散时间和连续时间的观点，推导出Fokker-Planck和马尔可夫链方程，并解释不同的前向污染选择如何影响反向动力学和证据下界（ELBO）。该工作针对初学者、扩散模型从业者和连续扩散专家，提供了一个清晰而连贯的理论基础。

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Authors: Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang

First: 2025-12-04T18:55:34+00:00 · Latest: 2025-12-04T18:55:34+00:00

Comments: Technical Report; Project Page: https://harboryuan.github.io/visual-reasoning-tracer

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

中文标题/摘要

标题：视觉推理追踪器：基于对象的视觉推理基准

近年来，多模态大型语言模型（MLLMs）在视觉定位和视觉问答等任务上的表现显著提升。然而，这些模型的推理过程仍然相当不透明；它们通常只输出最终预测，而不揭示导致结果的中间步骤或细粒度证据（例如，像素、位置）。这与人类智能自然通过视觉推理链进行运作的方式形成对比。为解决这一局限，我们引入了视觉推理追踪器（VRT）任务，要求模型不仅要定位目标对象，还要明确预测形成推理路径的中间对象。为了推动该领域的研究，我们贡献了：(1) VRT-Bench，一个由人工标注的视觉推理基准；(2) 一种评估推理轨迹质量的新指标；以及(3) VRT-80k，一个大规模数据集用于推理模型训练。我们的实验表明，尽管现有模型通常能产生正确的最终输出，但在定位中间推理方面却存在困难。相比之下，使用VRT-80k训练的模型在追踪推理路径方面取得了显著进步。

Summary / 总结

The research aims to enhance the transparency of Multimodal Large Language Models (MLLMs) by introducing the Visual Reasoning Tracer (VRT) task, which requires models to explicitly predict intermediate reasoning steps. The study contributes VRT-Bench, a benchmark for evaluating visual reasoning, a new reasoning trace metric, and VRT-80k, a large dataset for training reasoning models. Experiments show that existing models often produce correct final outputs but struggle to ground intermediate reasoning, while models trained on VRT-80k show significant improvements in tracing the reasoning path.

研究旨在通过引入视觉推理追踪（VRT）任务来增强多模态大型语言模型（MLLMs）的透明度，该任务要求模型预测中间推理步骤。研究贡献了VRT-Bench，一个用于评估视觉推理的基准，一个新的推理轨迹质量评估指标，以及VRT-80k，一个用于训练推理模型的大规模数据集。实验表明，现有模型通常能产生正确的最终输出，但在追踪中间推理方面存在困难，而使用VRT-80k训练的模型在这一方面显示出显著的改进。

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim

First: 2025-12-04T18:46:44+00:00 · Latest: 2025-12-04T18:46:44+00:00

Comments: Project Page: https://cvlab-kaist.github.io/DeepForcing/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

中文标题/摘要

标题：深度强迫：无需训练的长视频生成方法

近期自回归视频扩散技术的进步使得实时帧流成为可能，但现有解决方案仍然存在时间重复、漂移和运动减速的问题。我们发现，简单地将类似于StreamingLLM的注意力池应用于视频扩散会导致保真度下降和运动停滞。为克服这一问题，我们引入了深度强迫，这是一种无需训练的机制，能够解决这一问题而不进行任何微调。具体来说，1) 深度池化将滑动窗口的一半用于持久池化令牌，并重新对齐它们的时空RoPE相位以匹配当前时间线，从而在长时间生成过程中稳定全局上下文。2) 参与式压缩执行重要性感知的KV缓存剪枝，仅保留最近参与注意力的活跃令牌，同时安全地丢弃冗余和退化的历史记录，从而在生成超出分布长度时最小化误差累积。这些组件结合在一起，使生成能力提高了超过12倍（例如，5秒训练到60秒以上的生成），同时保持了更好的成像质量，更好的美学质量，几乎保持了整体一致性，并在动态程度上取得了显著进步，同时保持了实时生成。我们的结果表明，无需训练的KV缓存管理可以与基于训练的方法相媲美或超越自回归流式长视频生成。

Summary / 总结

Deep Forcing is a training-free method for long video generation that addresses temporal repetition and motion deceleration issues in existing solutions. It introduces two mechanisms: Deep Sink, which stabilizes global context by re-aligning persistent sink tokens, and Participative Compression, which prunes the KV cache to preserve only active tokens. These components enable over 12x extrapolation with better imaging and aesthetic quality compared to previous methods, maintaining consistency and dynamic degree while supporting real-time generation.

Deep Forcing 是一种无需训练的方法，用于解决现有长视频生成方案中的时间重复和运动减速问题。它引入了两种机制：Deep Sink 通过重新对齐持久的 sink 标记来稳定全局上下文，而 Participative Compression 则通过保留仅参与最近注意力的标记来修剪 KV 缓存，从而减少误差累积。这种方法使视频生成的倍数超过 12 倍，具有更好的成像质量和美学质量，同时保持实时生成能力。

OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design

Authors: Ian Dunn, Liv Toft, Tyler Katz, Juhi Gupta, Riya Shah, Ramith Hettiarachchi, David R. Koes

First: 2025-12-04T18:46:35+00:00 · Latest: 2025-12-04T18:46:35+00:00

Comments: Presented at the Machine Learning for Structural Biology Workshop, 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at https://github.com/gnina/OMTRA

中文标题/摘要

标题：OMTRA：基于结构的药物设计的多任务生成模型

基于结构的药物设计（SBDD）专注于设计与特定蛋白质口袋结合的小分子配体。现代SBDD工作流程中的计算方法至关重要，通常利用对接或药效团搜索的虚拟筛选方法。现代生成模型方法侧重于通过支持从头设计来提高新颖配体的发现。在本研究中，我们认识到这些任务具有共同结构，因此可以作为一致的生成建模框架的不同实例来表示。我们提出了一种统一的方法，即OMTRA，这是一种多模态流匹配模型，灵活地执行与SBDD相关的多种任务，包括一些在传统工作流程中没有对应的任务。此外，我们整理了一个包含50亿个3D分子构象的数据集，补充了蛋白质-配体数据，扩展了可用于训练的化学多样性。OMTRA在口袋条件下的从头设计和对接方面取得了最先进的性能；然而，大规模预训练和多任务训练的效果有限。所有代码、训练模型和数据集均可在https://github.com/gnina/OMTRA下载。

Summary / 总结

OMTRA is a multi-task generative model designed for structure-based drug design, integrating various tasks such as de novo ligand design and docking. It leverages a unified framework to handle these tasks efficiently. The model was trained on a large dataset of 500M 3D molecular conformers, enhancing chemical diversity. While OMTRA achieves state-of-the-art performance on pocket-conditioned de novo design and docking, the benefits of large-scale pretraining and multi-task training are limited.

OMTRA 是一种用于结构基于药物设计的多任务生成模型，整合了如从头设计配体和对接等任务。它利用统一框架高效执行这些任务。模型通过包含5亿个3D分子构象的大数据集进行训练，增强化学多样性。尽管 OMTRA 在口袋条件下的从头设计和对接方面达到了最先进的性能，但大规模预训练和多任务训练的效果有限。

Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Authors: Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, Michael Posa

First: 2025-12-04T18:45:14+00:00 · Latest: 2025-12-04T18:45:14+00:00

Comments: Project page: https://contactgen3d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.

中文标题/摘要

标题：遮挡下基于生成先验和接触诱导约束的物体重建

物体几何形状是机器人操作的关键信息。然而，物体重建是一个具有挑战性的任务，因为相机只能捕捉到物体的部分观察结果，尤其是在发生遮挡时。在本文中，我们利用两种额外的信息来源来减少视觉信号的不确定性。首先，生成模型学习常见物体形状的先验，使我们能够合理猜测未见部分的几何形状。其次，可以从视频和物理交互中获得的接触信息为几何体的边界提供了稀疏约束。我们通过接触引导的3D生成将两种信息源结合起来。指导公式受到生成模型中基于拖动编辑的启发。在合成和真实世界数据上的实验表明，我们的方法在重建方面优于纯3D生成和基于接触的优化。

Summary / 总结

This paper addresses the challenge of reconstructing object geometry under occlusion for robot manipulation. It proposes a method that combines generative priors from commonly seen objects and contact-induced constraints to reduce the ambiguity of partial observations. Experiments show that this approach outperforms pure 3D generation and contact-based optimization methods in both synthetic and real-world scenarios.

研究旨在通过结合生成先验和接触诱导约束来改善遮挡情况下的物体重建。方法结合了学习常见物体形状的生成模型与来自视频和物理交互的接触信息来引导3D生成。实验表明，该方法在合成和真实世界数据中均优于单纯的3D生成和基于接触的优化方法。

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein

First: 2025-12-04T18:40:52+00:00 · Latest: 2025-12-04T18:40:52+00:00

Comments: Project Page: https://19reborn.github.io/Bullet4D/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

中文标题/摘要

标题：BulletTime：解耦时间与摄像机姿态控制的视频生成框架

新兴的视频扩散模型实现了高视觉保真度，但本质上将场景动态与摄像机运动耦合在一起，限制了其提供精确的空间和时间控制的能力。我们提出了一种4D可控视频扩散框架，明确地将场景动态与摄像机姿态解耦，从而能够对场景动态和摄像机视角进行精细操控。该框架以连续的世界时间和摄像机轨迹作为条件输入，通过注意力层中的4D位置编码和自适应归一化将它们注入视频扩散模型中。为了训练该模型，我们创建了一个独特的数据集，其中时间和摄像机变化独立参数化；该数据集将公开发布。实验表明，我们的模型能够在多种时间模式和摄像机轨迹下实现稳健的4D控制，同时保持高质量的生成效果，并在可控性方面优于先前的工作。请访问我们的网站查看视频结果：https://19reborn.github.io/Bullet4D/

Summary / 总结

The research aims to decouple scene dynamics from camera motion in video generation to achieve precise control over time and camera pose. The method uses a 4D-controllable video diffusion framework that incorporates continuous world-time sequences and camera trajectories through 4D positional encoding and adaptive normalizations. Key findings show that the model can robustly control 4D aspects of videos across various timing patterns and camera trajectories while maintaining high generation quality and outperforming previous methods in controllability. The dataset used for training is publicly available.

研究旨在通过解耦场景动态和摄像机运动来实现精确的空间和时间控制。方法是采用一个4D可控的视频扩散框架，使用连续的世界时间和摄像机轨迹作为条件输入，通过4D位置编码和自适应归一化注入。实验表明，该模型可以稳健地控制时间模式和摄像机轨迹，同时保持高质量的生成效果，并在可控性方面优于先前的方法。

David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?

Authors: Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari, Bhabesh Mali, Animesh Basak Chowdhury, Sukanta Bhattacharjee, Chandan Karfa

First: 2025-12-04T18:37:29+00:00 · Latest: 2025-12-04T18:37:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.

中文标题/摘要

标题：大卫与歌利亚：小模型在硬件设计中借助代理AI能否赢得胜利？

大型语言模型（LLM）推理需要巨大的计算能力和能源，使得特定领域任务昂贵且不可持续。随着基础模型不断扩展，我们不禁质疑：对于硬件设计而言，更大是否总是更好？我们的研究通过在NVIDIA的综合Verilog设计问题（CVDP）基准测试上评估小语言模型与定制代理AI框架的结合来检验这一问题。结果表明，通过任务分解、迭代反馈和修正，代理工作流程不仅能够在极低的成本下实现接近LLM的性能，还能为代理提供学习机会，从而为复杂设计任务开辟高效、自适应的解决方案之路。

Summary / 总结

This study investigates whether small models can outperform large models in hardware design by leveraging an agentic AI framework. The research evaluates small language models on NVIDIA's CVDP benchmark and demonstrates that these models, through task decomposition, iterative feedback, and correction, can achieve near-LLM performance at a lower cost and provide learning opportunities for agents, suggesting a path to efficient and adaptive solutions in complex design tasks.

研究探讨了是否可以通过使用代理AI框架，使较小的模型在硬件设计中达到与大型模型相当的性能。通过在NVIDIA的CVDP基准测试上评估小型语言模型，研究发现这些模型通过任务分解、迭代反馈和修正，可以实现接近大型模型的性能，同时为代理提供学习机会，表明在复杂设计任务中可能实现高效和自适应的解决方案。

MORPH: PDE Foundation Models with Arbitrary Data Modality

Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas, Diane Oyen, Earl Lawrence

First: 2025-09-25T22:38:36+00:00 · Latest: 2025-12-04T18:36:29+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

中文标题/摘要

标题：MORPH：任意数据模态的偏微分方程基础模型

我们介绍了MORPH，一种针对偏微分方程（PDEs）的模态无关自回归基础模型。MORPH基于卷积视觉变换器骨干网络，能够无缝处理不同数据模态（1D-3D）和不同分辨率的异质时空数据集，以及具有混合标量和矢量分量的多个字段。该架构结合了(i) 组件卷积，联合处理标量和矢量通道以捕捉局部交互，(ii) 交叉注意力，建模并选择性地传播不同物理场之间的信息，(iii) 轴向注意力，沿个体空间和时间轴分解全时空自注意力，以减少计算负担同时保持表达能力。我们使用多样化的异质PDE数据集对多个模型变体进行预训练，并评估其在一系列下游预测任务中的迁移性能。使用全模型微调和参数高效低秩适配器（LoRA），MORPH优于从头训练的模型。在广泛的评估中，MORPH匹配或超越了强大的基线和最近的先进模型。这些能力共同展示了学习科学观测的异构和多模态性质的灵活而强大的基础架构，为可扩展和数据高效的科学机器学习铺平了道路。源代码、数据集和模型可在https://github.com/lanl/MORPH/ 获取。

Summary / 总结

MORPH is an autoregressive foundation model for partial differential equations (PDEs) that can handle various data modalities and resolutions. It uses a convolutional vision transformer backbone with component-wise convolution, inter-field cross-attention, and axial attentions to capture local interactions and model information between different physical fields. MORPH outperforms models trained from scratch and matches or surpasses strong baselines in diverse PDE prediction tasks, demonstrating its capability in learning from heterogeneous and multimodal scientific data efficiently. The source code, datasets, and models are publicly available.

MORPH 是一个用于偏微分方程 (PDE) 的自回归基础模型，能够处理多种数据模态和分辨率。它使用卷积视觉变换器骨干网络，并结合组件级卷积、跨域交叉注意力和轴向注意力来捕捉局部交互并建模不同物理域之间的信息。MORPH 在各种 PDE 预测任务中表现出色，超越了从头开始训练的模型，并且能够匹配或超越强大的基线模型，展示了其在学习异构和多模态科学数据方面的高效能力。源代码、数据集和模型已公开可用。

Control Consistency Losses for Diffusion Bridges

Authors: Samuel Howard, Nikolas Nüsken, Jakiw Pidstrigach

Venue: NeurIPS 2025 Oral

First: 2025-12-04T18:31:39+00:00 · Latest: 2025-12-04T18:31:39+00:00

Comments: Frontiers in Probabilistic Inference: Sampling Meets Learning Workshop at NeurIPS 2025 (Oral)

Abs · PDF · Code1 · Code2

Abstract

Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we leverage a self-consistency property of the conditioned dynamics to learn the diffusion bridge in an iterative online manner, and demonstrate promising empirical results in a range of settings.

中文标题/摘要

标题：控制扩散桥梁的一致性损失

给定初始和终止状态模拟扩散过程的条件动态是一个在科学中既重要又具有挑战性的问题。对于罕见事件而言，这一困难尤为突出，因为未条件化的动态过程很少能达到终止状态。在本文中，我们利用条件动态的自一致性性质，以迭代在线的方式学习扩散桥梁，并在多种设置中展示了有希望的实证结果。

Summary / 总结

This paper addresses the challenge of simulating conditioned dynamics of diffusion processes given initial and terminal states, especially for rare events. The authors propose an iterative online method leveraging the self-consistency property of conditioned dynamics to learn the diffusion bridge. Their approach shows promising empirical results across various settings.

本文解决了给定初始和终端状态下的扩散过程条件动态模拟问题，特别是对于罕见事件。作者提出了一种方法，利用条件动态的自一致性性质，以迭代方式学习扩散桥梁，并在多种设置中取得了令人鼓舞的结果。

Minimum Weighted Feedback Arc Sets for Ranking from Pairwise Comparisons

Authors: Soroush Vahidi, Ioannis Koutis

First: 2024-12-10T16:51:11+00:00 · Latest: 2025-12-04T18:29:06+00:00

Comments: This is a preliminary paper

Abs · PDF · Code1 · Code2

Abstract

The Minimum Weighted Feedback Arc Set (MWFAS) problem is closely related to the task of deriving a global ranking from pairwise comparisons. Recent work by He et al. (ICML 2022) advanced the state of the art on ranking benchmarks using learning based methods, but did not examine the underlying connection to MWFAS. In this paper, we investigate this relationship and introduce efficient combinatorial algorithms for solving MWFAS as a means of addressing the ranking problem. Our experimental results show that these simple, learning free methods achieve substantially faster runtimes than recent learning based approaches, while also delivering competitive, and in many cases superior, ranking accuracy. These findings suggest that lightweight combinatorial techniques offer a scalable and effective alternative to deep learning for large scale ranking tasks.

中文标题/摘要

标题：最小加权反馈弧集在从成对比较推导排名中的应用

最小加权反馈弧集（MWFAS）问题与从成对比较推导全局排名的任务密切相关。He等人（ICML 2022）使用基于学习的方法在排名基准上取得了最先进的成果，但没有探讨其与MWFAS的内在联系。本文研究了这种关系，并引入了高效的组合算法来解决MWFAS问题，以应对排名问题。实验结果表明，这些简单、无学习的方法在运行时间上比最近的基于学习的方法快得多，同时在排名准确性上也具有竞争力，甚至在许多情况下更优。这些发现表明，轻量级的组合技术为大规模排名任务提供了一种可扩展且有效的替代方案。

Summary / 总结

The research investigates the connection between the Minimum Weighted Feedback Arc Set (MWFAS) problem and ranking from pairwise comparisons. It introduces efficient combinatorial algorithms for solving MWFAS and shows that these methods achieve faster runtimes compared to recent learning-based approaches, while also providing competitive ranking accuracy. This suggests that lightweight combinatorial techniques can be a scalable and effective alternative to deep learning for large-scale ranking tasks.

本文探讨了最小加权反馈弧集（MWFAS）问题与从成对比较中推导全局排名之间的关系。它引入了解决MWFAS的高效组合算法，并展示了这些方法（不依赖于学习）比最近的基于学习的方法具有更快的运行时间，同时保持了竞争力的排名准确性。这表明，轻量级的组合技术可以是大规模排名任务中一种可扩展且有效的替代方案，而不是深度学习。

Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection

Authors: Mohammad Arif Rasyidi, Omar Alhussein, Sami Muhaidat, Ernesto Damiani

First: 2025-12-04T18:29:05+00:00 · Latest: 2025-12-04T18:29:05+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at https://github.com/arasyi/hqcae-network-intrusion-detection.

中文标题/摘要

标题：混合量子-经典自编码器在无监督网络入侵检测中的应用

无监督异常检测需要能够泛化到训练期间未观察到的攻击模式的模型。本研究首次对混合量子-经典（HQC）自编码器在该任务中的大规模性能进行了评估。我们构建了一个统一的实验框架，迭代了关键的量子设计选择，包括量子层的位置、测量方法、变分和非变分形式以及潜在空间正则化。在三个基准NIDS数据集上的实验表明，HQC自编码器在最佳配置下可以匹配或超越经典性能，尽管它们对架构决策更为敏感。在零日评估中，配置良好的HQC模型提供了比经典和监督基线更强且更稳定的泛化能力。模拟门噪声实验揭示了早期性能下降，表明需要噪声感知的HQC设计。这些结果提供了HQC自编码器在网络入侵检测中行为的首个数据驱动表征，并概述了其实际可行性的关键因素。所有实验代码和配置均可在https://github.com/arasyi/hqcae-network-intrusion-detection/获取。

Summary / 总结

This work evaluates the first large-scale use of hybrid quantum-classical (HQC) autoencoders for unsupervised network intrusion detection. The study explores various quantum design choices and shows that HQC autoencoders can match or exceed classical performance in their best configurations, especially in zero-day scenarios. However, they are more sensitive to architectural decisions and show early performance degradation under simulated gate-noise conditions, suggesting the need for noise-aware designs. These results provide insights into the practical viability of HQC autoencoders for network intrusion detection.

该研究评估了首个大规模使用的混合量子-经典（HQC）自编码器在无监督网络入侵检测中的应用。研究探索了多种量子设计选择，并显示在最佳配置下，HQC自编码器可以匹配或超过经典性能，在零日评估中表现出更强和更稳定的泛化能力，优于经典和监督基线。然而，它们在模拟门噪声条件下表现出早期性能退化，强调了需要噪声感知的设计。

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Authors: Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

Venue: NeurIPS 2025

First: 2025-06-11T19:36:17+00:00 · Latest: 2025-12-04T18:28:33+00:00

Comments: Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025. 34 pages, 26 figures

Abs · PDF · Code1 · Code2

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

中文标题/摘要

标题：路径通道和计划扩展核：使用无模型强化学习训练的卷积递归神经网络的规划机制描述

我们部分地逆向工程了一个使用无模型强化学习训练的卷积递归神经网络（RNN），使其能够玩推箱子游戏Sokoban。我们发现，RNN将未来的动作（计划）存储在隐藏状态的特定通道中，我们称之为路径通道。特定位置的高激活意味着当箱子位于该位置时，它将被推到该通道指定的方向。我们检查了路径通道之间的卷积核，发现它们编码了每种可能动作导致的位置变化，从而代表了学习到的转移模型的一部分。RNN通过从箱子和目标开始构建计划。这些核将激活从箱子向前扩展到路径通道，从目标向后扩展。在障碍物处放置负值。这导致扩展核反向传播负值，从而修剪最后几步，让另一种计划浮现；一种形式的回溯。我们的工作表明，对计划表示的精确理解使我们能够直接用更熟悉的术语理解无模型训练中学习到的双向规划算法。

Summary / 总结

The study aims to understand the planning mechanism in a Sokoban RNN trained with model-free reinforcement learning. By analyzing the hidden state, the researchers identified specific channels, termed path channels, which store future moves. These channels indicate the direction a box will be pushed when it is at a particular location. The convolutional kernels between path channels encode the position change resulting from each action, representing a learned transition model. The RNN constructs plans by extending activations from the boxes and goals, using negative values at obstacles to prune the plan and allow alternative paths to emerge, demonstrating a form of backtracking. This work provides insights into the bidirectional planning algorithm learned by the RNN.

研究部分逆向工程了一个用于解Sokoban的卷积RNN，发现网络使用隐藏状态中的特定通道来存储未来动作计划，称为路径通道。这些通道指示如果箱子位于某个位置，它将被推的方向。这些通道之间的卷积核编码每个动作后的位置变化，代表了一个学习到的转移模型。RNN通过从箱子向前扩展激活值，并从目标向后扩展来构建计划，障碍物处的负值导致回溯的出现。这项工作表明，理解计划表示有助于解释RNN通过无模型训练学习到的双向规划算法。

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Authors: Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

First: 2025-06-11T09:01:59+00:00 · Latest: 2025-12-04T18:28:33+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

中文标题/摘要

标题：Athena：通过数据高效过程奖励模型增强多模态推理

我们提出了Athena-PRM，这是一种多模态过程奖励模型（PRM），旨在评估解决复杂推理问题的每一步的奖励分数。开发高性能的PRM通常需要大量的时间和财务投资，主要是因为需要对推理步骤进行逐级注释。传统的自动标注方法，如蒙特卡洛估计，通常会产生噪声标签并导致巨大的计算成本。为了高效地生成高质量的过程标注数据，我们提出利用弱完成者和强完成者之间预测一致性作为可靠过程标签的识别标准。令人惊讶的是，Athena-PRM仅使用5,000个样本就在各种场景和基准测试中表现出色。此外，我们还开发了两种有效策略以提高PRM的性能：初始化ORM和对负样本进行上采样。我们通过三种特定场景验证了我们的方法：测试时缩放的验证、直接评估推理步骤的正确性以及奖励排名微调。Athena-PRM在多个基准测试和场景中始终表现出色。值得注意的是，当使用Qwen2.5-VL-7B作为策略模型时，Athena-PRM在测试时缩放上提高了WeMath 10.2分和MathVista 7.1分。此外，Athena-PRM在VisualProcessBench中达到了最先进的技术水平，并且在F1分数上比之前最先进的技术水平高出3.9分，展示了其准确评估推理步骤正确性的强大能力。此外，利用Athena-PRM作为奖励模型，我们开发了Athena-7B并使用奖励排名微调，其在五个基准测试中显著优于基线。

Summary / 总结

Athena-PRM is a multimodal process reward model designed to evaluate the reward score for each step in solving complex reasoning problems. It uses prediction consistency between weak and strong completers to generate reliable process labels efficiently. Athena-PRM shows excellent performance with just 5,000 samples across various scenarios and benchmarks. It enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling, and sets the state-of-the-art results in VisualProcessBench with a 3.9 F1-score improvement over previous methods.

Athena-PRM 是一种多模态过程奖励模型，用于评估复杂问题中的推理步骤。它通过弱完成者和强完成者之间的预测一致性来高效生成可靠的过程标签，仅需 5,000 个样本。Athena-PRM 在各种基准和场景中表现出色，将 WeMath 上的测试时间缩放性能提升 10.2 点，MathVista 上提升 7.1 点。此外，它在 VisualProcessBench 中达到最先进的结果，比之前的方法提高了 3.9 F1 分数。

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Authors: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang

First: 2025-12-04T18:15:27+00:00 · Latest: 2025-12-04T18:15:27+00:00

Comments: Code: https://github.com/hustvl/4DLangVGGT, Webpage: https://hustvl.github.io/4DLangVGGT

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

中文标题/摘要

标题：4DLangVGGT：四维语言视觉几何接地变换器

构建四维语言场对于具身人工智能、增强/虚拟现实以及四维场景理解至关重要，因为它们提供了动态环境的丰富语义表示，并在复杂场景中支持开放词汇查询。然而，现有的四维语义场构建方法主要依赖于场景特定的高斯点积，这需要针对每个场景进行优化，表现出有限的泛化能力，并难以扩展到实际应用中。为了解决这些限制，我们提出了4DLangVGGT，这是一种基于变换器的前馈统一框架，用于四维语言接地，该框架在单一架构中联合整合了几何感知和语言对齐。4DLangVGGT有两个关键组件：四维视觉几何变换器StreamVGGT，用于捕获动态场景的时空几何表示；以及语义桥梁解码器（SBD），将几何感知特征投影到语言对齐的语义空间，从而增强语义可解释性并保持结构保真度。与依赖于昂贵的场景特定优化的先前方法不同，4DLangVGGT可以在多个动态场景上联合训练，并在推理时直接应用，实现部署效率和强大的泛化能力。这种设计显著提高了大规模部署的实用性，并建立了开放词汇四维场景理解的新范式。在HyperNeRF和Neu3D数据集上的实验表明，我们的方法不仅泛化效果良好，还在场景训练和多场景训练下分别实现了高达2%和1%的性能提升。我们的代码发布在https://github.com/hustvl/4DLangVGGT

Summary / 总结

The research aims to improve 4D language fields for embodied AI and 4D scene understanding by proposing 4DLangVGGT, a Transformer-based framework that integrates geometric perception and language alignment. This approach uses a 4D Visual Geometry Transformer and a Semantic Bridging Decoder to capture spatio-temporal geometric representations and project them into a language-aligned semantic space, respectively. Experiments show that 4DLangVGGT outperforms existing methods, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training, while enhancing semantic interpretability and preserving structural fidelity. The framework can be jointly trained across multiple scenes and deployed efficiently, addressing limitations of previous scene-specific optimization methods. Code is available at https://github.com/hustvl/4DLangVGGT.

研究旨在通过提出4DLangVGGT，一种结合几何感知和语言对齐的Transformer基座框架，来改进4D语言领域，以支持嵌入式AI和4D场景理解。该方法使用4D视觉几何变换器和语义桥梁解码器分别捕捉时空几何表示并将其投影到语言对齐的语义空间。实验表明，4DLangVGGT 在单场景训练中可获得高达2%的性能提升，在多场景训练中可获得1%的改进，同时增强语义可解释性并保持结构保真度。该框架可以在多个场景上联合训练，并且部署效率高，解决了先前基于场景的优化方法的局限性。代码可在 https://github.com/hustvl/4DLangVGGT 获取。

Improving Graph Neural Network Training, Defense, and Hypergraph Partitioning via Adversarial Robustness Evaluation

Authors: Yongyu Wang

First: 2024-12-19T11:10:48+00:00 · Latest: 2025-12-04T18:10:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks (GNNs) are a highly effective neural network architecture for processing graph-structured data. Unlike traditional neural networks that rely solely on the features of the data as input, GNNs leverage both the graph structure, which represents the relationships between data points, and the feature matrix of the data to optimize their feature representation. This unique capability enables GNNs to achieve superior performance across various tasks. However, it also makes GNNs more susceptible to noise from both the graph structure and data features, which can significantly increase the training difficulty and degrade their performance. To address this issue, this paper proposes a novel method for selecting noise-sensitive training samples from the original training set to construct a smaller yet more effective training set for model training. These samples are used to help improve the model's ability to correctly process data in noisy environments. We have evaluated our approach on three of the most classical GNN models GCN, GAT, and GraphSAGE as well as three widely used benchmark datasets: Cora, Citeseer, and PubMed. Our experiments demonstrate that the proposed method can substantially boost the training of Graph Neural Networks compared to using randomly sampled training sets of the same size from the original training set and the larger original full training set. We further proposed a robust-node based hypergraph partitioning method, an adversarial robustness based graph pruning method for GNN defenses and a related spectral edge attack method.

中文标题/摘要

标题：通过对抗鲁棒性评估提高图神经网络训练、防御及超图分区

图神经网络（GNNs）是一种高效的神经网络架构，用于处理图结构数据。与传统神经网络仅依赖数据特征作为输入不同，GNNs 利用图结构，即数据点之间的关系，以及数据的特征矩阵来优化其特征表示。这种独特的能力使GNNs在各种任务中表现出色。然而，这也使得GNNs更容易受到图结构和数据特征噪声的影响，这会显著增加训练难度并降低其性能。为了解决这一问题，本文提出了一种新颖的方法，从原始训练集中选择对噪声敏感的训练样本，构建一个更小但更有效的训练集用于模型训练。这些样本用于帮助提高模型在噪声环境中正确处理数据的能力。我们已在三种最经典的GNN模型GCN、GAT和GraphSAGE以及三种广泛使用的基准数据集Cora、Citeseer和PubMed上评估了我们的方法。实验结果表明，与从原始训练集随机采样的相同大小的训练集和更大的原始完整训练集相比，所提出的方法可以显著提高图神经网络的训练效果。我们还提出了一种基于鲁棒节点的超图分区方法、一种基于对抗鲁棒性的图神经网络防御的图剪枝方法以及相关的谱边缘攻击方法。

Summary / 总结

This paper addresses the challenges of training and defending Graph Neural Networks (GNNs) by proposing a method to select noise-sensitive training samples, which improves model robustness. The method was evaluated on three GNN models (GCN, GAT, GraphSAGE) using three benchmark datasets (Cora, Citeseer, PubMed), showing significant improvements in training efficiency compared to random sampling. Additionally, the paper introduces a robust-node based hypergraph partitioning method and an adversarial robustness based graph pruning method for GNN defenses, along with a spectral edge attack method.

本文通过提出一种选择噪声敏感训练样本的方法来应对图神经网络（GNN）的训练和防御挑战，以提高模型在噪声环境下的鲁棒性。该方法在GCN、GAT和GraphSAGE模型以及Cora、Citeseer和PubMed数据集上进行了评估，显示了与随机采样相比在训练效率上的显著提升。此外，论文还提出了基于鲁棒节点的超图划分方法和基于对抗鲁棒性的图剪枝方法用于GNN防御，并提出了相关谱边缘攻击方法。

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Authors: Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

First: 2025-12-04T17:59:10+00:00 · Latest: 2025-12-04T17:59:10+00:00

Comments: 18 Pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

中文标题/摘要

标题：单张图像的4D合成联合三维几何重建与运动生成

从单张静态图像生成交互和动态的4D场景仍然是一个核心挑战。大多数现有的生成-然后重建和重建-然后生成方法将几何与运动分离，导致时空不一致性和较差的泛化能力。为了解决这些问题，我们扩展了重建-然后生成框架，以联合执行运动生成和几何重建，用于4D合成（MoRe4D）。我们首先引入了TrajScene-60K，这是一个包含60,000个视频样本的大规模数据集，每个样本都有密集的点轨迹，解决了高质量4D场景数据稀缺的问题。基于此，我们提出了一种基于扩散的4D场景轨迹生成器（4D-STraG），以联合生成几何一致性和运动合理的4D点轨迹。为了利用单视角先验，我们设计了一种深度引导的运动归一化策略和一种运动感知模块，以实现有效的几何和动力学集成。然后，我们提出了一种4D视图合成模块（4D-ViSM），从4D点轨迹表示中渲染具有任意摄像机轨迹的视频。实验表明，MoRe4D可以从单张图像生成具有多视角一致性和丰富动态细节的高质量4D场景。代码：https://github.com/Zhangyr2022/MoRe4D。

Summary / 总结

The research aims to generate interactive and dynamic 4D scenes from a single static image, addressing the limitations of existing methods that decouple geometry from motion. The MoRe4D framework jointly performs motion generation and geometric reconstruction, using a large-scale dataset TrajScene-60K and a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to produce geometrically consistent and motion-plausible 4D point trajectories. The method also includes a depth-guided motion normalization strategy and a motion-aware module for effective integration of geometry and dynamics, as well as a 4D View Synthesis Module (4D-ViSM) for rendering videos from 4D point track representations. Experiments demonstrate that MoRe4D can generate high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image.

研究旨在从单张静态图像生成交互和动态的4D场景，解决现有方法将几何和运动分离而导致的空间时间不一致和泛化能力差的问题。MoRe4D框架联合进行运动生成和几何重建。该研究引入了包含60,000个视频样本的TrajScene-60K大型数据集，并提出了一种4D场景轨迹生成器（4D-STraG），用于生成几何上一致且运动合理的4D点轨迹。该方法还包括深度引导的运动归一化策略和运动感知模块，以有效集成几何和动力学。实验表明，MoRe4D可以从单张图像生成具有多视角一致性和丰富动态细节的高质量4D场景。

Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Authors: Abhigyan Bhattacharya, Hiranmoy Roy

Venue: CVPR

First: 2025-12-04T17:56:08+00:00 · Latest: 2025-12-04T17:56:08+00:00

Comments: Submitted for review CVPR-2025

Abs · PDF · Code1 · Code2

Abstract

Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

中文标题/摘要

标题：基于语义指导的两阶段GAN在具有混合感知编码的面部修复中的应用

面部图像修复的目标是在保留身份、结构一致性和照片写实度的同时恢复面部图像中的缺失或损坏区域，这是一个专门用于照片修复的任务。尽管近年来在深度生成模型方面取得了许多进展，但现有方法在处理大不规则遮罩时仍存在问题，经常在遮罩区域边缘产生模糊的纹理、语义不一致或由于直接像素级合成方法和面部先验利用有限导致不令人信服的面部结构。在本文中，我们提出了一种新颖的架构，通过语义指导的分层合成来解决上述挑战。我们的方法首先基于意义组织和合成信息，然后细化纹理。这个过程在我们开始创建详细图像之前，提供了对面部结构的清晰见解。在第一阶段，我们结合了两种技术：一种使用CNN关注局部特征，另一种使用视觉变换器关注全局特征。这帮助我们创建了清晰且详细的语义布局。在第二阶段，我们使用多模态纹理生成器通过从不同尺度中拉取信息来细化这些布局，确保一切看起来连贯且一致。该架构通过动态注意力自然处理任意遮罩配置，无需针对特定遮罩进行训练。在CelebA-HQ和FFHQ两个数据集上的实验表明，我们的模型优于其他最先进的方法，在LPIPS、PSNR和SSIM等指标上有所提升。在具有挑战性的大面积修复情况下，它产生了视觉上引人注目的结果，具有更好的语义保留。

Summary / 总结

The research aims to improve face image inpainting by addressing issues such as blurriness, semantic inconsistencies, and structural problems. The method employs a two-stage GAN with semantic-guided hierarchical synthesis, using CNNs and Vision Transformers for local and global feature synthesis in the first stage, and a Multi-Modal Texture Generator for texture refinement in the second stage. The model outperforms existing methods on CelebA-HQ and FFHQ datasets, showing improvements in LPIPS, PSNR, and SSIM metrics and producing visually striking results with better semantic preservation in large-area inpainting scenarios.

研究旨在通过解决模糊和语义不一致等问题来改进面部修复。提出了一种两阶段的GAN，使用语义引导和混合感知编码。第一阶段结合了CNN和Vision Transformers进行局部和全局特征合成，第二阶段使用多模态纹理生成器进行细化，确保各部分协调一致。实验在CelebA-HQ和FFHQ数据集上显示，该模型在LPIPS、PSNR和SSIM等指标上优于现有方法，并在大面积修复场景中产生了视觉上引人注目的结果，语义保留更好。

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Authors: Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

First: 2025-12-04T17:50:53+00:00 · Latest: 2025-12-04T17:50:53+00:00

Comments: 22 pages

Abs · PDF · Code1 · Code2

Abstract

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

中文标题/摘要

标题：套利：基于优势感知的高效推理

现代大型语言模型凭借长链推理能力取得了令人印象深刻的推理效果，但在推理过程中会带来巨大的计算成本，这促使人们开发技术以提高性能与成本的比例。这些技术中，推测性解码通过使用快速但不准确的草稿模型自回归地提出令牌，然后由更强大的目标模型并行验证，从而加速推理。然而，由于在语义等价步骤中由于令牌不匹配导致不必要的拒绝，传统的令牌级推测性解码在推理任务中表现不佳。尽管最近的研究转向了步骤级语义验证，这通过接受或拒绝整个推理步骤提高了效率，但现有的步骤级方法仍然会重新生成许多被拒绝的步骤，浪费了宝贵的目标计算资源。为了解决这一挑战，我们提出了套利，这是一种新颖的步骤级推测生成框架，根据草稿模型和目标模型之间的相对优势动态路由生成。与应用固定接受阈值不同，套利使用一个轻量级路由器，该路由器被训练以预测目标模型何时可能生成更有意义的步骤。这种路由近似于理想的套利Oracle，它总是选择质量更高的步骤，从而实现接近最优的效率-准确度权衡。在多个数学推理基准测试中，套利始终超越了先前的步骤级推测性解码基线，将匹配准确度下的推理延迟降低多达约2倍。

Summary / 总结

The paper introduces Arbitrage, a step-level speculative generation framework designed to improve the efficiency of Large Language Models in reasoning tasks. It uses a lightweight router to predict when the target model is likely to produce a better step, reducing unnecessary rejections and improving efficiency. Experiments show that Arbitrage outperforms previous step-level speculative decoding methods, achieving up to 2 times faster inference latency with matched accuracy across various mathematical reasoning benchmarks.

研究旨在通过减少推理过程中的不必要的拒绝来提高大型语言模型的效率。提出了一种名为Arbitrage的新颖步骤级推测生成框架，该框架根据草稿模型和目标模型之间的相对优势动态路由生成。该方法在保持准确性的前提下，相比之前的步骤级推测解码技术，实现了高达2倍的推理延迟减少。

Dual-Path Region-Guided Attention Network for Ground Reaction Force and Moment Regression

Authors: Xuan Li, Samuel Bello

First: 2025-12-04T17:47:01+00:00 · Latest: 2025-12-04T17:47:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate estimation of three-dimensional ground reaction forces and moments (GRFs/GRMs) is crucial for both biomechanics research and clinical rehabilitation evaluation. In this study, we focus on insole-based GRF/GRM estimation and further validate our approach on a public walking dataset. We propose a Dual-Path Region-Guided Attention Network that integrates anatomy-inspired spatial priors and temporal priors into a region-level attention mechanism, while a complementary path captures context from the full sensor field. The two paths are trained jointly and their outputs are combined to produce the final GRF/GRM predictions. Conclusions: Our model outperforms strong baseline models, including CNN and CNN-LSTM architectures on two datasets, achieving the lowest six-component average NRMSE of 5.78% on the insole dataset and 1.42% for the vertical ground reaction force on the public dataset. This demonstrates robust performance for ground reaction force and moment estimation.

中文标题/摘要

标题：基于双路径区域引导注意力网络的地面反作用力和力矩回归

三维地面反作用力和力矩（GRFs/GRMs）的准确估计对于生物力学研究和临床康复评估至关重要。在本研究中，我们专注于基于鞋垫的GRF/GRM估计，并进一步在公共行走数据集上验证了我们的方法。我们提出了一种双路径区域引导注意力网络，将解剖学启发的空间先验和时间先验整合到区域级注意力机制中，而互补路径则从整个传感器场中捕获上下文。两个路径联合训练，其输出结合生成最终的GRF/GRM预测。结论：我们的模型在两个数据集上均优于包括CNN和CNN-LSTM架构在内的强基线模型，分别在鞋垫数据集上六分量平均NRMSE为5.78%，在公共数据集上垂直地面反作用力为1.42%。这表明该模型在地面反作用力和力矩估计方面具有稳健的性能。

Summary / 总结

This study aims to improve the accuracy of ground reaction force and moment estimation using insole data. The authors propose a Dual-Path Region-Guided Attention Network that incorporates spatial and temporal priors into a region-level attention mechanism. The model outperforms CNN and CNN-LSTM architectures, achieving the lowest six-component average NRMSE of 5.78% on the insole dataset and 1.42% for the vertical ground reaction force on a public dataset. This indicates robust performance for ground reaction force and moment estimation.

该研究旨在提高地面反作用力和力矩估计的准确性，以支持生物力学研究和临床应用。作者提出了一种结合空间和时间先验的区域级注意力机制的双路径区域引导注意力网络。该网络在内底数据集上的六分量平均NRMSE最低，为5.78%，在公共数据集上的垂直地面反作用力NRMSE为1.42%，优于CNN和CNN-LSTM模型。

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Authors: Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry

First: 2025-12-04T17:40:17+00:00 · Latest: 2025-12-04T17:40:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

中文标题/摘要

标题：RAMEN：适用于地球观测的可调节分辨率多模态编码器

地球观测（EO）数据涵盖了广泛的空域、光谱和时间分辨率，从高分辨率光学图像到低分辨率多光谱产品或雷达时间序列。虽然最近的预训练模型提高了多模态集成以学习有意义的表示，但它们通常期望固定输入分辨率或基于特定传感器编码器，限制了在异构EO模态之间的泛化能力。为克服这些限制，我们引入了RAMEN，这是一种可调节分辨率的多模态编码器，能够在完全传感器无关的方式下学习EO数据的共享视觉表示。RAMEN 将模态、空域和时域分辨率视为关键输入数据特征，使在统一的潜在空间内跨模态进行一致分析成为可能。其主要方法论贡献在于将空间分辨率定义为可控的输出参数，使用户在推理时能够直接控制所需的细节水平，并允许在空间精度与计算成本之间进行显式权衡。我们训练了一个单一的统一变换器编码器，用于重建来自多种来源的掩码多模态EO数据，确保在传感器和分辨率之间的一般化能力。一旦预训练完成，RAMEN 可以有效地转移到已知和未知的传感器配置中，并在社区标准PANGAEA基准测试中优于更大的最先进的模型，该基准测试包含各种多传感器和多分辨率下游任务。我们的代码和预训练模型可在https://github.com/nicolashoudre/RAMEN/ 获取。

Summary / 总结

RAMEN is a resolution-adjustable multimodal encoder designed to handle Earth observation data with varying resolutions, from high-resolution optical imagery to low-resolution radar time series. By treating spatial, spectral, and temporal resolutions as key input features, RAMEN learns a shared representation across different EO modalities in a sensor-agnostic manner. The key methodological contribution is the ability to control spatial resolution as a parameter, allowing for flexible trade-offs between detail and computational cost. RAMEN outperforms larger state-of-the-art models on the PANGAEA benchmark, demonstrating its effectiveness in various multi-sensor and multi-resolution tasks.

RAMEN旨在解决不同分辨率的地球观测数据多模态集成的挑战。它引入了一种分辨率可调的多模态编码器，能够在不依赖特定传感器编码器的情况下学习不同EO数据类型之间的共享表示。RAMEN允许用户在推理时控制细节水平，平衡空间精度与计算成本。实验表明，RAMEN在PANGAEA基准测试中优于更大规模的最新模型，展示了其在多种多传感器和多分辨率任务中的有效性。

Triangle Multiplication Is All You Need For Biomolecular Structure Representations

Authors: Jeffrey Ouyang-Zhang, Pranav Murugan, Daniel J. Diaz, Gianluca Scarpellini, Richard Strong Bowen, Nate Gruver, Adam Klivans, Philipp Krähenbühl, Aleksandra Faust, Maruan Al-Shedivat

First: 2025-10-21T17:59:02+00:00 · Latest: 2025-12-04T17:39:54+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2 · Code3

Abstract

AlphaFold has transformed protein structure prediction, but emerging applications such as virtual ligand screening, proteome-wide folding, and de novo binder design demand predictions at a massive scale, where runtime and memory costs become prohibitive. A major bottleneck lies in the Pairformer backbone of AlphaFold3-style models, which relies on computationally expensive triangular primitives-especially triangle attention-for pairwise reasoning. We introduce Pairmixer, a streamlined alternative that eliminates triangle attention while preserving higher-order geometric reasoning capabilities that are critical for structure prediction. Pairmixer substantially improves computational efficiency, matching state-of-the-art structure predictors across folding and docking benchmarks, delivering up to 4x faster inference on long sequences while reducing training cost by 34%. Its efficiency alleviates the computational burden of downstream applications such as modeling large protein complexes, high-throughput ligand and binder screening, and hallucination-based design. Within BoltzDesign, for example, Pairmixer delivers over 2x faster sampling and scales to sequences ~30% longer than the memory limits of Pairformer. Code is available at https://github.com/genesistherapeutics/pairmixer.

中文标题/摘要

标题：三角乘法即为生物分子结构表示所需的一切

AlphaFold 已经彻底改变了蛋白质结构预测，但新兴的应用程序，如虚拟配体筛选、全蛋白质组折叠和从头设计配体，需要大规模的预测，这使得运行时间和内存成本变得难以承受。AlphaFold3 类型模型的 Pairformer 主干是一个主要瓶颈，它依赖于计算密集型的三角形基础元素——特别是三角注意力——来进行成对推理。我们引入了 Pairmixer，这是一种简化替代方案，它消除了三角注意力，同时保留了对于结构预测至关重要的高阶几何推理能力。Pairmixer 显著提高了计算效率，在折叠和对接基准测试中与最先进的结构预测器持平，同时在长序列上实现高达 4 倍的更快推理速度，训练成本降低 34%。其效率减轻了下游应用如大型蛋白质复合物建模、高通量配体和配体筛选以及基于幻觉的设计的计算负担。例如，在 BoltzDesign 中，Pairmixer 提供了超过 2 倍的采样速度，并扩展到比 Pairformer 内存限制长 30% 的序列。代码可在 https://github.com/genesistherapeutics/pairmixer 获取。

Summary / 总结

The research addresses the computational challenges of large-scale biomolecular structure predictions, particularly in virtual ligand screening and de novo binder design. It introduces Pairmixer, which removes triangle attention from the Pairformer backbone while maintaining geometric reasoning capabilities. Pairmixer significantly enhances computational efficiency, matching state-of-the-art predictors and reducing training costs by 34%, with up to 4x faster inference on long sequences and improved scalability for complex models and high-throughput screenings.

研究旨在解决大规模蛋白质结构预测中的计算挑战，特别是Pairformer骨干在AlphaFold3模型中的低效问题。研究引入了Pairmixer，它去除了三角注意力的同时保持了必要的几何推理能力。Pairmixer显著提高了计算效率，匹配了最先进的预测器，并将训练成本降低了34%，在长序列上可实现高达4倍的更快推理速度，同时提高了下游应用的可扩展性。

SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition

Authors: Shunpeng Chen, Changwei Wang, Rongtao Xu, Xingtian Pei, Yukun Song, Jinzhou Lin, Wenhao Xu, Jingyi Zhang, Li Guo, Shibiao Xu

First: 2025-09-30T03:34:40+00:00 · Latest: 2025-12-04T17:29:56+00:00

Comments: 23 pages

Abs · PDF · Code1 · Code2

Abstract

Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and models will be released upon acceptance.

中文标题/摘要

标题：SAGE：空间视觉自适应图探索在视觉地点识别中的应用

视觉地点识别（VPR）需要在存在大量外观、视角和环境变化的情况下，稳健地检索地理标记图像。先前的方法集中在描述符微调或固定采样策略上，但忽略了训练过程中空间上下文和视觉相似性之间的动态交互。我们提出了SAGE（空间视觉自适应图探索），这是一种统一的训练管道，通过联合改进局部特征聚合、训练期间组织样本和困难样本挖掘，增强粒度的空间视觉区分能力。我们引入了一个轻量级的Soft Probing模块，在双线性聚合之前从训练数据中学习补丁描述符的残差权重，增强独特的局部线索。在训练过程中，我们重建了一个在线的地理视觉图，融合了地理邻近性和当前的视觉相似性，使得候选邻域反映了不断变化的嵌入景观。为了集中学习最有信息量的地点邻域，我们从高亲和力锚点中初始化簇，并使用贪婪加权团簇扩展采样器迭代扩展它们。使用冻结的DINOv2主干和参数高效的微调实现，SAGE在八个基准测试中达到了SOTA。它分别在SPED、Pitts30k-test、MSLS-val和Nordland上达到了98.9%、95.8%、94.5%和96.0%的Recall@1。值得注意的是，仅使用4096D全局描述符，我们的方法在SPED上达到了100%的Recall@10。代码和模型将在接受后发布。

Summary / 总结

SAGE is designed to improve Visual Place Recognition by addressing the dynamic interplay between spatial context and visual similarity. It uses a unified training pipeline that enhances local feature aggregation, organizes training samples, and mines hard samples. SAGE introduces a Soft Probing module to boost local cues and constructs an online geo-visual graph to reflect the evolving embedding landscape. SAGE achieves state-of-the-art results on eight benchmarks, with Recall@1 scores of 98.9%, 95.8%, 94.5%, and 96.0% on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, it reaches 100% Recall@10 on SPED using 4096D global descriptors.

SAGE 通过解决空间上下文和视觉相似性之间的动态交互来提升视觉地点识别。它引入了轻量级的 Soft Probing 模块和在线地理视觉图，以增强局部特征聚合和样本组织。SAGE 在八个基准测试中取得了最先进的结果，分别在 SPED、Pitts30k-test、MSLS-val 和 Nordland 上实现了 98.9%、95.8%、94.5% 和 96.0% 的 Recall@1。值得注意的是，仅使用 4096D 全局描述符，SAGE 在 SPED 上达到了 100% 的 Recall@10。通过冻结 DINOv2 主干和参数高效微调实现，SAGE 显著提升了 VPR 系统的鲁棒性。

Generative Neural Video Compression via Video Diffusion Prior

Authors: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma

First: 2025-12-04T17:27:32+00:00 · Latest: 2025-12-04T17:27:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

中文标题/摘要

标题：基于视频扩散先验的生成神经视频压缩

我们提出了GNVC-VD，这是第一个基于DiT的生成神经视频压缩框架，它建立在先进的视频生成基础模型之上，其中时空潜空间压缩和序列级生成细化在单一编解码器中统一。现有的感知编解码器主要依赖于预训练的图像生成先验来恢复高频细节，但它们的帧级性质缺乏时间建模，不可避免地导致感知闪烁。为了解决这个问题，GNVC-VD 引入了一个统一的流动匹配潜空间细化模块，利用视频扩散变换器在序列级去噪的同时联合增强帧内和帧间潜空间，确保时空细节的一致性。与视频生成中从纯高斯噪声去噪不同，GNVC-VD 从解码的时空潜空间开始细化，并学习一个适应压缩引起的退化的校正项。进一步的条件适配器将压缩感知线索注入中间的DiT层，使在极端比特率约束下仍能有效去除伪影并保持时间连贯性。广泛的实验表明，GNVC-VD 在感知质量上超过了传统和学习编解码器，并显著减少了先前生成方法中持续存在的闪烁伪影，甚至在低于0.01 bpp的情况下，突显了将视频原生生成先验整合到神经编解码器中进行下一代感知视频压缩的潜力。

Summary / 总结

GNVC-VD is a novel generative neural video compression framework that integrates a video diffusion transformer to unify spatio-temporal latent compression and sequence-level generative refinement. It addresses the limitations of existing perceptual codecs by introducing a flow-matching latent refinement module that enhances intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Experimental results demonstrate that GNVC-VD outperforms traditional and learned codecs in perceptual quality and significantly reduces flickering artifacts, even at very low bitrates.

GNVC-VD 是一种新颖的生成神经视频压缩框架，通过序列级去噪增强帧内和帧间潜变量，解决了现有编解码器的感知闪烁问题。它引入了一个统一的流动匹配潜变量精炼模块和一个条件适配器，以提高时间连贯性和减少伪影。实验结果表明，GNVC-VD 在感知质量上优于传统和学习型编解码器，并且在极低比特率下减少了闪烁伪影。

Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Authors: Noah Oberweis, Semih Cayci

First: 2025-10-24T08:28:53+00:00 · Latest: 2025-12-04T17:25:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

中文标题/摘要

标题：随机梯度拉格朗日动力学在懒训练区间的收敛性

连续时间模型为优化算法在深度学习中的训练动力学提供了重要的见解。在本文中，我们建立了随机梯度拉格朗日动力学（SGLD）在懒训练区间内的非渐近收敛分析，SGLD是随机梯度下降在连续时间下的伊藤随机微分方程（SDE）近似。我们证明，在损失函数海森矩阵的正则条件下，SGLD具有乘法和状态依赖噪声时，(i) 在训练过程中以高概率产生非退化核，(ii) 在期望意义下以指数速度收敛到经验风险最小化器，并建立了优化间隙的有限时间与有限宽度界。我们通过回归设置中的数值例子验证了我们的理论发现。

Summary / 总结

The research aims to analyze the convergence of stochastic gradient Langevin dynamics (SGLD) in the lazy training regime, providing insights into the training dynamics of deep learning optimization algorithms. The study shows that SGLD with multiplicative and state-dependent noise ensures a non-degenerate kernel throughout training and achieves exponential convergence to the empirical risk minimizer in expectation. Theoretical findings are supported by numerical examples in the regression setting.

该研究探讨了SGLD在懒训练阶段的收敛性质。研究证明，在损失函数Hessian矩阵满足一定条件的情况下，SGLD（一种SDE近似）能够保持非退化核，并以指数速度收敛到经验风险最小化器。研究还提供了优化间隙的有限时间和有限宽度界。回归设置中的数值示例支持了这些理论发现。

Detecting Perspective Shifts in Multi-agent Systems

Authors: Eric Bridgeford, Hayden Helm

First: 2025-12-04T17:24:56+00:00 · Latest: 2025-12-04T17:24:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems -- a critical capability as generative agent deployment continues to scale.

中文标题/摘要

标题：多智能体系统中的视角转变检测

增强生成模型并结合外部工具和更新机制（或称为智能体）已经展示了超越基础模型智能提示的能力。随着智能体的广泛应用，动态多智能体系统自然地出现了。近期研究已经探讨了基于查询响应的单一时点智能体低维表示的理论和实证性质。本文引入了时间数据核视角空间（TDKPS），该方法联合嵌入了跨时间的智能体，并提出了一些用于检测黑盒多智能体系统中智能体和群体层面行为变化的若干新颖假设检验方法。我们通过模拟多智能体系统中不断演变的数字人物来表征我们提出的检验方法的实证性质，包括它们对关键超参数的敏感性。最后，通过自然实验表明，我们提出的检验方法能够检测到与真实外生事件高度敏感、具体和显著相关的变化。据我们所知，TDKPS是第一个用于监控黑盒多智能体系统中行为动态的原理性框架——随着生成智能体部署的不断扩大，这一能力至关重要。

Experience Replay with Random Reshuffling

Authors: Yasuhiro Fujita

First: 2025-03-04T04:37:22+00:00 · Latest: 2025-12-04T17:20:31+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings, and analyze their properties via theoretical analysis and simulations. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning. Code is available at https://github.com/pfnet-research/errr.

中文标题/摘要

标题：随机重洗的经验重放

经验重放是强化学习中的关键组件，用于稳定学习并提高样本效率。其典型实现是从重放缓冲区中带放回地采样过渡。相比之下，在监督学习中，固定数据集每轮重新洗牌并顺序消耗数据的做法称为随机重洗（RR）。理论分析表明，RR 具有更优的收敛性质，并且在实验中表现出优于带放回采样的性能。为了在强化学习中利用 RR 的优势，我们提出了扩展 RR 到经验重放的采样方法，包括均匀和优先级设置，并通过理论分析和模拟研究了这些方法的性质。我们在 Atari 基准上评估了我们的采样方法，展示了其在深度强化学习中的有效性。代码可在 https://github.com/pfnet-research/errr 获取。

Summary / 总结

The paper explores the application of random reshuffling (RR) in experience replay for reinforcement learning, aiming to improve sample efficiency and learning stability. It proposes RR-based sampling methods for both uniform and prioritized experience replay and evaluates their performance through theoretical analysis and simulations. Experimental results on Atari benchmarks show that these methods outperform traditional sampling with replacement, enhancing the effectiveness of deep reinforcement learning algorithms.

论文研究了在强化学习的经验回放中应用随机重洗（RR）的方法，旨在提高学习的稳定性和效率。提出了适用于均匀和优先级设置的RR采样方法，并在Atari基准上进行了评估，展示了其在深度强化学习中的有效性。理论分析和模拟支持了所提出的方法。代码可在https://github.com/pfnet-research/errr获取。

Reflection Removal through Efficient Adaptation of Diffusion Transformers

Authors: Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai

First: 2025-12-04T17:12:39+00:00 · Latest: 2025-12-04T17:12:39+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

中文标题/摘要

标题：通过高效适应扩散变换器进行单图像反射去除

我们提出了一种基于扩散变换器（DiT）的单图像反射去除框架，该框架利用了基础扩散模型在恢复设置中的泛化优势。我们不是依赖于特定任务的架构，而是通过将预训练的DiT基础模型重新用于反射污染的输入，并引导其生成清洁的传输层来重新利用它。我们系统地分析了现有的反射去除数据源的多样性、可扩展性和逼真度。为了解决适合数据的短缺，我们在Blender中构建了一个基于物理基础渲染（PBR）的管道，围绕Principled BSDF来合成真实的玻璃材料和反射效果。结合提出的合成数据和高效的LoRA基础模型适应，实现了在领域内和零样本基准上的最佳性能。这些结果表明，当与物理基础的数据合成和高效的适应相结合时，预训练的扩散变换器提供了一种可扩展且高保真的反射去除解决方案。项目页面：https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

Summary / 总结

The research aims to develop an efficient method for single-image reflection removal using a diffusion-transformer (DiT) framework. The method repurposes a pre-trained DiT model by conditioning it on reflection-contaminated images and guiding it towards clean transmission layers. The study constructs a physically based rendering pipeline to synthesize realistic glass materials and reflection effects, and demonstrates that this approach achieves state-of-the-art performance on in-domain and zero-shot benchmarks, showcasing the scalability and high fidelity of the solution for reflection removal.

论文提出了一种基于扩散变换器（DiT）的单图像反光去除框架，利用预训练的DiT基础模型通过条件化反射污染输入来适应清洁的传输层。作者在Blender中构建了一个基于物理渲染（PBR）的管道，以合成真实的玻璃材料和反射效果，解决数据稀缺问题。通过高效LoRA基适应基础模型与合成数据的结合，实现了在域内和零样本基准上的最新性能，展示了预训练扩散变换器在反光去除中的可扩展性和高保真度。