arXiv 论文速递

Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Authors: Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Saining Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu

First: 2025-12-04T18:59:57+00:00 · Latest: 2025-12-04T18:59:57+00:00

Comments: Project Page: https://lightx-ai.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

中文标题/摘要

标题：Light-X：基于相机和照明控制的4D视频生成

近期在照明控制方面的进展将基于图像的方法扩展到了视频，但仍面临照明保真度和时间一致性之间的权衡。超越重新照明，生成现实世界场景的关键步骤是同时控制相机轨迹和照明，因为视觉动态是由几何形状和照明共同塑造的。为此，我们提出了Light-X，这是一种视频生成框架，能够从单目视频中实现具有视点和照明控制的可控渲染。1）我们提出了一种解耦设计，将几何和照明信号分离：几何和运动通过沿用户定义的相机轨迹投影的动态点云捕获，而照明线索由一致投影到相同几何形状的重新照明帧提供。这些明确的细粒度线索有助于有效的解耦，并引导高质量的照明。2）为了解决缺乏多视角和多照明视频配对的问题，我们引入了Light-Syn，这是一种基于退化的方法，通过逆映射从野外单目视频中合成训练配对。这种方法生成了一个涵盖静态、动态和AI生成场景的数据集，确保了稳健的训练。广泛的实验表明，Light-X在相机和照明联合控制方面优于基线方法，并且在文本和背景条件设置下超过了先前的视频重新照明方法。

Summary / 总结

Light-X is a video generation framework that controls both camera trajectory and illumination to achieve high-quality 4D rendering from monocular videos. It uses a disentangled design to capture geometry and motion via dynamic point clouds and provides illumination cues through relit frames. Light-Syn, a degradation-based pipeline, synthesizes training pairs from monocular footage to cover various scenes. Experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods in both text- and background-conditioned settings.

Light-X 是一个能够同时控制摄像机轨迹和照明的视频生成框架，以实现高质量的4D视频渲染。该框架采用解耦设计，通过动态点云捕捉几何和运动，并提供明确的照明线索。此外，还引入了 Light-Syn，这是一种基于退化的方法，从单目视频中合成训练对，确保在各种场景下实现稳健训练。实验表明，Light-X 在联合摄像机-照明控制方面优于基线方法，并且在文本和背景条件设置下超过了之前的视频重新照明方法。

Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

Authors: Hao-Jen Chien, Yi-Chuan Huang, Chung-Ho Wu, Wei-Lun Chao, Yu-Lun Liu

Venue: WACV 2025

First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00

Comments: WACV 2025. Project page: https://chien90190.github.io/splannequin/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/

中文标题/摘要

标题：Splannequin：冻结单目人形挑战视频中的场景与双检测散斑技术

从单目人形挑战（MC）视频中合成高保真冻结3D场景是一个独特的问题，不同于标准的动态场景重建。我们的目标不是建模运动，而是创建一个冻结的场景，同时战略性地保留微妙的动力学，以使用户能够即时选择。为此，我们引入了一种动态高斯散斑的新应用：场景动态建模保留了附近的时间变化，而静止场景则通过固定模型的时间参数进行渲染。然而，在这种使用方式下，单目捕获和稀疏的时间监督引入了鬼影和模糊等伪影，对于在弱监督时间戳下变得不可见或被遮挡的高斯。我们提出了一种架构无关的正则化方法Splannequin，该方法检测高斯原语的两种状态：隐藏状态和缺陷状态，并应用时间锚定。在主要的前向相机运动下，隐藏状态被锚定到其最近的观察良好的过去状态，而缺陷状态则被锚定到具有更强监督的未来状态。我们的方法通过简单的损失项集成到现有的动态高斯管道中，不需要架构更改，并且不增加推理开销。这导致了显著改进的视觉质量，使用户能够选择高保真、可选的冻结时间渲染，得到了96%用户的偏好验证。项目页面：https://chien90190.github.io/splannequin/

Summary / 总结

The research aims to synthesize high-fidelity frozen 3D scenes from monocular Mannequin-Challenge videos by strategically preserving subtle dynamics. The method uses dynamic Gaussian splatting to model the scene while retaining nearby temporal variation and rendering a static scene by fixing the time parameter. Splannequin, an architecture-agnostic regularization, detects hidden and defective Gaussian states and applies temporal anchoring to mitigate artifacts like ghosting and blur, resulting in improved visual quality and user-selectable frozen-time renderings with 96% user preference.

研究旨在从单目Mannequin-Challenge视频中合成高保真冻结3D场景，重点保留微妙的动力学以供用户即时选择。方法使用动态Gaussian splatting建模场景，并通过固定时间参数渲染静态场景。Splannequin是一种正则化技术，检测隐藏和缺陷的Gaussian状态，并应用时间锚定以减轻鬼影和模糊等_artifacts。该方法通过简单的损失项集成到现有管道中，几乎没有额外的推理开销，并通过96%的用户偏好验证了视觉质量的显著提升。

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00

Comments: Project Page: https://github.com/CaraJ7/DraCo

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

中文标题/摘要

标题：DraCo: 文本生成图像预览及稀有概念生成的草图作为CoT方法

近期统一的多模态大型语言模型（MLLMs）展示了令人印象深刻的性能，通过链式思考（CoT）推理增强了文本生成图像的能力。然而，现有方法仍然有限，要么仅将模型视为独立生成器，要么依赖抽象的文本规划。为此，我们提出了Draft-as-CoT（DraCo），一种新颖的交替推理范式，充分利用文本和视觉内容在CoT中的双重作用，以更好地规划和验证。我们的方法首先生成低分辨率的草图图像作为预览，提供更具体的视觉规划和指导。然后，我们利用模型的内在理解能力验证草图与输入提示之间潜在的语义不一致，并通过选择性修正进行超分辨率细化。这样，我们的方法解决了两个基本挑战：文本规划的粗粒度性质和生成稀有属性组合的难度。为了支持训练，我们整理了DraCo-240K，旨在增强一般修正、实例操作和布局重组的三种原子能力。借助DraCo-CFG，一种专门的交替推理无分类器引导（CFG）策略，DraCo在GenEval上取得了8%的巨大提升，在Imagine-Bench上提升了0.91，在GenEval++上提升了3%，显著优于直接生成和其他借助CoT增强的生成方法。

Summary / 总结

DraCo proposes a novel interleaved reasoning paradigm called Draft-as-CoT to enhance text-to-image generation by leveraging both textual and visual contents in the chain-of-thought process. It generates a low-resolution draft image as a preview, which provides concrete visual planning and guidance, and then verifies and refines the draft through selective corrections with super-resolution. DraCo significantly improves generation quality on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%) compared to direct generation and other CoT-empowered methods.

DraCo通过提出一种新颖的交替推理范式——Draft-as-CoT (DraCo)，解决了现有文本到图像生成方法的局限性。它首先生成一个低分辨率的草图作为预览，提供具体的视觉规划和指导，然后通过超分辨率进行选择性修正以进行细化。DraCo在GenEval (+8%)、Imagine-Bench (+0.91) 和 GenEval++ (+3%) 上显著优于直接生成和其他基于CoT的生成方法。

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang

First: 2025-12-04T18:59:52+00:00 · Latest: 2025-12-04T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

中文标题/摘要

标题：ARM-Thinker：通过自主工具使用和视觉推理强化多模态生成奖励模型

奖励模型对于使视觉-语言系统与人类偏好保持一致至关重要，但当前方法存在幻觉、视觉定位弱以及无法使用工具进行验证的问题，限制了它们在复杂多模态推理任务中的可靠性。我们提出了ARM-Thinker，这是一种自主多模态奖励模型，能够自主调用外部工具（例如，图像裁剪、文档页面检索）来使判断基于可验证的证据，替代静态、非交互式的奖励评分。这使模型能够验证细微的视觉细节，跨参考多页证据，并验证推理声明，而这些能力在现有的奖励模型中是不存在的。我们使用多阶段强化学习训练ARM-Thinker，联合优化工具调用决策和判断准确性。为了评估自主奖励建模，我们引入了ARMBench-VL，包含三个基准测试，分别评估细微的视觉定位（图像级工具）、多页文档理解（检索工具）和指令遵循（文本级验证）。ARM-Thinker 在奖励模型基准测试中平均提高了16.2%，在工具使用任务中提高了9.6%，并在多模态数学和逻辑推理基准测试中优于基线。我们的结果表明，自主能力显著提高了奖励模型的准确性和可解释性。

Summary / 总结

ARM-Thinker is a multimodal reward model that autonomously uses external tools for visual reasoning and verification, addressing limitations of current models in visual grounding and tool use. It is trained with multi-stage reinforcement learning to optimize tool-calling decisions and judgment accuracy. ARM-Thinker shows significant improvements in reward modeling and tool-use tasks, and outperforms baselines in multimodal reasoning benchmarks, indicating enhanced accuracy and interpretability.

ARM-Thinker 是一种自主使用外部工具进行视觉推理和验证的多模态奖励模型，解决了当前模型在视觉定位和工具使用方面的局限性。它通过多阶段强化学习训练来优化工具调用决策和判断准确性。ARM-Thinker 在奖励建模和工具使用任务中表现出显著改进，并在多模态推理基准测试中优于基线模型，表明其准确性和可解释性得到了提升。

ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

Authors: Rundong Luo, Noah Snavely, Wei-Chiu Ma

First: 2025-12-04T18:59:51+00:00 · Latest: 2025-12-04T18:59:51+00:00

Comments: Project page: https://red-fairy.github.io/ShadowDraw/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!

中文标题/摘要

标题：ShadowDraw：从任意物体到影绘组合艺术

我们介绍了ShadowDraw，一种将普通3D物体转化为影绘组合艺术的框架。给定一个3D物体，我们的系统预测场景参数，包括物体姿态和照明，以及部分线稿，使得投射的阴影完成绘制，形成可识别的图像。为此，我们优化场景配置以揭示有意义的阴影，利用阴影笔触引导线稿生成，并采用自动评估确保影绘一致性与视觉质量。实验表明，ShadowDraw在从真实世界扫描、精选数据集到生成资产的各种输入下均能产生引人注目的结果，并自然扩展到多物体场景、动画和物理部署。我们的工作提供了一种实用的影绘艺术创作管道，并拓宽了计算视觉艺术的设计空间，弥合了算法设计与艺术叙事之间的差距。请访问我们的项目页面https://red-fairy.github.io/ShadowDraw/获取更多结果和我们管道的端到端现实世界演示！

Summary / 总结

ShadowDraw is a framework that converts 3D objects into shadow-drawing compositional art by predicting scene parameters and optimizing shadow configurations. It generates partial line drawings that, when combined with the cast shadows, form recognizable images. Experiments demonstrate that ShadowDraw can produce compelling results across various inputs and extends to multi-object scenes and animations, offering a practical pipeline for creating shadow-drawing art and expanding the design space of computational visual art.

ShadowDraw 是一个框架，通过预测场景参数并优化阴影配置，将 3D 对象转换为阴影绘画组合艺术。它生成部分线稿，与投射的阴影结合后形成可识别的图像。实验表明，ShadowDraw 可以为各种输入，如现实世界的扫描、精选数据集和生成资产，生成引人注目的结果，并扩展到多对象场景和物理部署。

Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Authors: Purbesh Mitra, Sennur Ulukus

First: 2025-12-04T18:59:18+00:00 · Latest: 2025-12-04T18:59:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.

中文标题/摘要

标题：语义软自举：在LLM中无需强化学习的长上下文推理

大型语言模型（LLM）中的长上下文推理通过链式思考（CoT）推理展示了其认知能力的增强。这类模型通常通过基于推理的问题中的验证奖励（RLVR）强化学习进行训练，例如数学和编程问题。然而，RLVR受到一些瓶颈的限制，如稀疏奖励和样本效率不足。因此，在训练后阶段需要大量计算资源。为克服这些限制，本文提出了一种**语义软自举（SSB）**，这是一种自蒸馏技术，其中同一个基础语言模型在训练时扮演教师和学生的角色，但接收关于其结果正确性的不同语义上下文。模型首先被提示一个数学问题，生成多个展开。从中筛选出正确的和最常见的错误回答，然后提供给模型以产生更稳健、逐步的解释和验证的最终答案。该流程自动从原始问题-答案数据中生成教师-学生训练集，无需任何人工干预。此生成过程还产生了一组logits，学生模型仅从原始问题中尝试匹配这些logits。在我们的实验中，我们使用参数高效微调在GSM8K数据集上对Qwen2.5-3B-Instruct进行训练。然后在MATH500和AIME2024基准测试上测试其准确性。我们的实验结果显示，与常用的RLVR算法GRPO相比，准确率分别提高了10.6%和10%。我们的代码可在https://github.com/purbeshmitra/semantic-soft-bootstrapping获取，模型和整理的数据集可在https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping获取。

Summary / 总结

The research aims to enhance long context reasoning in large language models (LLMs) through a self-distillation technique called Semantic Soft Bootstrapping (SSB), which avoids the need for reinforcement learning with verifiable rewards (RLVR). SSB involves the base model acting as both teacher and student, receiving different semantic contexts about the correctness of its outcomes. This process generates a robust, step-by-step explanation and improves accuracy on MATH500 and AIME2024 benchmarks by 10.6% and 10% respectively, compared to the commonly used RLVR algorithm GRPO.

论文提出了一种自蒸馏技术——语义软自举（SSB），用于在不使用强化学习与验证奖励（RLVR）的情况下增强大型语言模型（LLMs）的长上下文推理能力。SSB 包括用数学问题提示模型并生成滚雪球，从中筛选出正确和错误的回答，然后提供给模型以生成更稳健的、逐步的解释。该模型通过从原始问题-答案数据中生成的自举数据集进行训练，无需人工干预。实验在 GSM8K 数据集上显示，与常用的 GRPO 算法相比，在 MATH500 基准上的准确率提高了 10.6%，在 AIME2024 基准上的准确率提高了 10%。

NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Authors: Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister

First: 2025-12-04T18:59:18+00:00 · Latest: 2025-12-04T18:59:18+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.

中文标题/摘要

标题：NeuralRemaster：相位保持扩散以实现结构对齐生成

标准扩散使用具有随机幅度和随机相位的高斯噪声来破坏数据。虽然对于无条件或文本到图像生成非常有效，但破坏相位成分会破坏空间结构，使其不适合需要几何一致性的任务，如重新渲染、仿真增强和图像到图像转换。我们引入了相位保持扩散φ-PD，这是一种模型无关的扩散过程重新表述，能够保持输入相位的同时随机化幅度，从而在无需架构更改或额外参数的情况下实现结构对齐生成。我们还提出了频率选择性结构（FSS）噪声，通过单一的频率截止参数提供对结构刚性的连续控制。φ-PD 在推理时间上没有任何成本，并且可以与任何图像或视频的扩散模型兼容。在逼真和风格化重新渲染、以及从仿真到现实的驾驶规划器增强方面，φ-PD 生成了可控且空间对齐的结果。当应用于CARLA仿真器时，φ-PD 将CARLA到Waymo规划器的性能提高了50%。该方法与现有的条件方法相辅相成，并广泛适用于图像到图像和视频到视频生成。有关视频、额外示例和代码，请访问我们的\href{https://yuzeng-at-tri.github.io/ppd-page/}{项目页面}。

Summary / 总结

NeuralRemaster introduces Phase-Preserving Diffusion (φ-PD), which preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes. It proposes Frequency-Selective Structured (FSS) noise for continuous control over structural rigidity. φ-PD improves re-rendering, sim-to-real enhancement, and planner performance by 50% in the CARLA simulator, demonstrating its effectiveness in tasks requiring geometric consistency without additional parameters or inference-time cost.

NeuralRemaster提出了Phase-Preserving Diffusion（φ-PD），该方法在保持输入相位的同时随机化幅度，无需架构更改即可实现结构对齐生成。它还提出了Frequency-Selective Structured（FSS）噪声，以连续控制结构刚性。φ-PD在重新渲染、模拟到现实增强以及CARLA模拟器中的规划器性能上提高了50%，证明了其在需要几何一致性的任务中的有效性，且无需额外参数或推理时间成本。

EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Authors: Jiaqi Ma, Shengkai Hu, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan

First: 2025-12-04T18:59:10+00:00 · Latest: 2025-12-04T18:59:10+00:00

Abs · PDF · Code1 · Code2

Abstract

All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

中文标题/摘要

标题：EvoIR：通过进化频率调制实现一站式图像恢复

一站式图像恢复（AiOIR）任务通常涉及多种多样的退化，需要稳健且通用的策略。然而，现有的大多数方法通常缺乏显式的频率建模，并依赖于固定的或启发式的优化计划，这限制了其在异构退化中的泛化能力。为了解决这些限制，我们提出了EvoIR，这是一种针对AiOIR的特定框架，引入了进化频率调制以实现动态和自适应的图像恢复。具体而言，EvoIR 使用频率调制模块（FMM），以显式方式将特征分解为高频频带和低频频带，并自适应地调节它们以增强结构保真度和细粒度细节。EvoIR 的核心是进化优化策略（EOS），通过基于群体的进化过程迭代调整频率感知目标，动态平衡结构准确性和感知保真度。其进化的指导进一步缓解了退化之间的梯度冲突并加速了收敛。通过结合FMM和EOS，EvoIR 的表现优于单独使用任一组件，突显了它们的互补作用。在多个基准上的广泛实验表明，EvoIR 在一站式图像恢复方法中表现优于现有最先进的方法。

Summary / 总结

EvoIR is designed to handle diverse image restoration tasks by introducing evolutionary frequency modulation, which dynamically adjusts frequency-aware objectives through a population-based evolutionary process. This approach enhances both structural fidelity and fine-grained details, leading to better performance compared to existing methods. Experiments show that EvoIR outperforms state-of-the-art all-in-one image restoration methods across multiple benchmarks.

EvoIR 是为了解决现有 All-in-One 图像恢复 (AiOIR) 方法的局限性而设计的，通过引入进化频率调制。它使用频率调制模块 (FMM) 将特征分解为高频和低频分支，并适应性地调制它们以获得更好的结构保真度和细粒度细节。进化优化策略 (EOS) 动态调整频率感知目标，平衡结构准确性和感知保真度。实验表明，EvoIR 在多个基准测试中优于最先进的 AiOIR 方法。

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-04T18:59:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

中文标题/摘要

标题：TV2TV：一种统一的交错语言和视频生成框架

视频生成模型正在迅速发展，但仍可能在需要大量语义分支或反复进行下一步应该发生什么的高层次推理的复杂视频输出上遇到困难。在本文中，我们介绍了一种新的全能视频-文本模型类别，这些模型结合了最近语言模型推理进展的想法，以应对这一挑战。具体来说，我们提出了TV2TV，这是一种统一的生成建模框架，将视频生成分解为交错的语言和视频生成过程。TV2TV 使用混合的变换器（MoT）架构联合学习语言建模（下一个标记预测）和视频流匹配（下一个帧预测）。在推理时，TV2TV 决定何时在生成文本和视频帧之间交替，使模型能够在“用语言思考”后续内容之前“用像素行动”来生成帧。这种设计将决定下一步应该发生什么的责任大部分转移到了语言建模塔上，从而提高了生成视频的视觉质量和提示对齐。它还使用户能够在过程中任何时间通过文本干预来修改视频生成轨迹，实现细粒度的可控性。在对视频游戏数据的受控实验中，TV2TV 在视觉质量和可控性方面都取得了显著改进。TV2TV 还扩展到自然视频，我们通过使用视觉-语言模型（VLMs）交替自然语言动作描述来增强体育视频，展示了这一点。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐，展示了模型能够推理和生成复杂的现实世界动作序列的能力。这些结果共同突显了 TV2TV 作为视频生成与开放文本推理和控制相关联的有希望的一步。

Summary / 总结

The paper introduces TV2TV, a unified generative modeling framework that interleaves text and video generation to address the challenge of complex video outputs. It uses a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching, allowing the model to 'think in words' before 'acting in pixels.' Experiments on video game data show improvements in visual quality and controllability, and the model scales to natural videos by augmenting sports videos with text descriptions, demonstrating strong visual quality and prompt alignment.

TV2TV 是一种统一的生成模型框架，通过整合语言和视频生成过程来解决生成复杂视频的挑战。它使用混合变换器架构同时学习语言建模和视频流匹配，使模型能够在‘像素行动’之前进行‘语言思考’。实验表明，TV2TV 在视频游戏数据中提高了视频生成的视觉质量和可控性，并通过使用视觉语言模型对体育视频进行文本描述的增强，展示了其在自然视频中的应用能力，表现出强大的视觉质量和提示对齐性。

BioAnalyst: A Foundation Model for Biodiversity

Authors: Athanasios Trantas, Martino Mensio, Stylianos Stasinos, Sebastian Gribincea, Taimur Khan, Damian Podareanu, Aliene van der Veen

First: 2025-07-11T23:56:08+00:00 · Latest: 2025-12-04T18:58:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Foundation Models (FMs) offer a path to learn general-purpose representations from heterogeneous ecological data, easily transferable to downstream tasks. However, practical biodiversity modelling remains fragmented; separate pipelines and models are built for each dataset and objective, which limits reuse across regions and taxa. In response, we present BioAnalyst, to our knowledge the first multimodal Foundation Model tailored to biodiversity analysis and conservation planning in Europe at $0.25^{\circ}$ spatial resolution targeting regional to national-scale applications. BioAnalyst employs a transformer-based architecture, pre-trained on extensive multimodal datasets that align species occurrence records with remote sensing indicators, climate and environmental variables. Post pre-training, the model is adapted via lightweight roll-out fine-tuning to a range of downstream tasks, including joint species distribution modelling, biodiversity dynamics and population trend forecasting. The model is evaluated on two representative downstream use cases: (i) joint species distribution modelling and with 500 vascular plant species (ii) monthly climate linear probing with temperature and precipitation data. Our findings show that BioAnalyst can provide a strong baseline both for biotic and abiotic tasks, acting as a macroecological simulator with a yearly forecasting horizon and monthly resolution, offering the first application of this type of modelling in the biodiversity domain. We have open-sourced the model weights, training and fine-tuning pipelines to advance AI-driven ecological research.

中文标题/摘要

标题：BioAnalyst：生物多样性基础模型

多模态基础模型（FMs）提供了一条从异质生态数据中学习通用表示的途径，这些表示可以轻松转移到下游任务。然而，实际的生物多样性建模仍然支离破碎；每个数据集和目标都需要单独构建管道和模型，这限制了跨地区和类群的重用。为应对这一挑战，我们提出了BioAnalyst，据我们所知，这是第一个针对欧洲区域至国家规模应用的生物多样性分析和保护规划的多模态基础模型，空间分辨率为0.25°。BioAnalyst采用基于变换器的架构，预训练在广泛的多模态数据集上，将物种分布记录与遥感指标、气候和环境变量对齐。预训练后，通过轻量级滚动微调，模型适应多种下游任务，包括联合物种分布建模、生物多样性动态和种群趋势预测。该模型在两个代表性下游应用场景中进行了评估：（i）联合物种分布建模，涉及500种维管植物物种；（ii）月度气候线性探测，使用温度和降水量数据。我们的研究结果表明，BioAnalyst可以为生物和非生物任务提供强大的基线，作为宏观生态模拟器，具有每年的预测时间范围和月度分辨率，这是此类建模在生物多样性领域的首次应用。我们已开源了模型权重、训练和微调管道，以促进AI驱动的生态学研究。

Summary / 总结

BioAnalyst is a multimodal Foundation Model designed for biodiversity analysis and conservation planning in Europe, addressing the fragmented nature of existing models. It uses a transformer-based architecture pre-trained on extensive multimodal datasets, aligning species occurrence records with remote sensing and environmental variables. After pre-training, BioAnalyst is fine-tuned for various tasks such as joint species distribution modeling and biodiversity dynamics forecasting. The model demonstrates strong performance in both biotic and abiotic tasks, providing a macroecological simulator with yearly forecasting and monthly resolution, marking the first application of this type in the biodiversity domain.

BioAnalyst 是一个针对欧洲生物多样性和保护规划的多模态基础模型，旨在解决现有生物多样性建模管道的碎片化问题。它采用基于变换器的架构，预训练在广泛的多模态数据集上，将物种分布记录与遥感指标、气候和环境变量对齐。预训练后，BioAnalyst 细调用于各种下游任务，包括联合物种分布建模和生物多样性动态。该模型在生物和非生物任务中表现出色，提供一年的预测周期和月度分辨率，并已开源以促进基于人工智能的生态学研究。

Structured Document Translation via Format Reinforcement Learning

Authors: Haiyue Song, Johannes Eschbach-Dymanus, Hour Kaing, Sumire Honda, Hideki Tanaka, Bianka Buschbeck, Masao Utiyama

First: 2025-12-04T18:58:30+00:00 · Latest: 2025-12-04T18:58:30+00:00

Comments: IJCNLP-AACL 2025 Main (Oral)

Abs · PDF · Code1 · Code2

Abstract

Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

中文标题/摘要

标题：结构化文档翻译通过格式强化学习

近期的结构化文本翻译工作主要集中在句子层面，因为它们难以有效处理复杂的文档级XML或HTML结构。为了解决这一问题，我们提出了**格式强化学习（FormatRL）**，该方法在监督微调模型的基础上使用组相对策略优化，直接优化新颖的结构感知奖励：1）TreeSim，用于测量预测和参考XML树之间的结构相似性；2）Node-chrF，用于衡量XML节点级别的翻译质量。此外，我们还应用了StrucAUC，这是一种细粒度的度量标准，能够区分轻微错误和重大结构失败。在SAP软件文档基准测试上的实验表明，在六个度量标准上均有所改进，并且进一步的分析表明不同的奖励函数如何在结构质量和翻译质量上都带来改进。

Summary / 总结

The research aims to improve structured text translation at the document level by addressing the limitations of existing sentence-level approaches. It introduces Format Reinforcement Learning (FormatRL), which uses Group Relative Policy Optimization to optimize structure-aware rewards such as TreeSim and Node-chrF. The experiments on the SAP software-documentation benchmark show improvements across six metrics, indicating better structural and translation quality.

研究旨在通过解决处理复杂XML或HTML文档结构的挑战，将结构化文本翻译提升到超出句子级别的水平。提出的Format Reinforcement Learning (FormatRL)方法使用Group Relative Policy Optimization来优化结构感知奖励，如TreeSim和Node-chrF，并应用StrucAUC来区分轻微错误和重大结构失败。实验表明，在SAP软件文档基准测试上，六个指标均有改进，表明结构和翻译质量都有提升。

SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

Authors: Yuan Gao, Jin Song

First: 2025-12-04T18:58:18+00:00 · Latest: 2025-12-04T18:58:18+00:00

Abs · PDF · Code1 · Code2

Abstract

In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.

中文标题/摘要

标题：SA-IQA：以多维度奖励重新定义空间美学的图像质量评估

近年来，人工智能生成图像（AIGI）的图像质量评估（IQA）取得了快速进展；然而，现有方法主要针对肖像和艺术图像，缺乏对室内场景的系统评估。我们引入了空间美学这一范式，从布局、和谐、照明和失真四个维度评估室内图像的美学质量。我们构建了SA-BENCH，这是首个空间美学基准，包含18,000张图像和50,000个精确注释。利用SA-BENCH，我们系统地评估了当前的IQA方法，并通过MLLM微调和多维度融合方法开发了SA-IQA，作为全面的空间美学评估奖励框架。我们应用SA-IQA到两个下游任务：（1）作为与GRPO强化学习集成的奖励信号，优化AIGC生成流水线；（2）Best-of-N选择，筛选高质量图像，提高生成质量。实验表明，SA-IQA在SA-BENCH上的表现显著优于现有方法，为空间美学评估设立了新标准。代码和数据集将开源，以促进该领域的研究和应用。

Summary / 总结

The research aims to evaluate the aesthetic quality of interior images by introducing a new paradigm called Spatial Aesthetics, which assesses images based on layout, harmony, lighting, and distortion. The study constructs SA-BENCH, a benchmark dataset with 18,000 images and 50,000 annotations, and develops SA-IQA using MLLM fine-tuning and multidimensional fusion to evaluate spatial aesthetics. The method outperforms existing IQA methods on SA-BENCH and is applied to optimize AIGC generation and filter high-quality images, demonstrating its effectiveness in improving image generation quality.

研究旨在通过引入新的美学评估范式——空间美学，来评估室内图像的质量，该范式基于布局、和谐、照明和失真四个维度进行评估。研究构建了包含18,000张图像和50,000个精确注释的SA-BENCH基准数据集，并开发了使用MLLM微调的多维度奖励框架SA-IQA，以系统地评估现有IQA方法。实验结果表明，SA-IQA在SA-BENCH上显著优于现有方法，并为空间美学评估设立了新标准，应用于优化AIGC生成和筛选高质量图像。

From Generated Human Videos to Physically Plausible Robot Trajectories

Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig

First: 2025-12-04T18:56:03+00:00 · Latest: 2025-12-04T18:56:03+00:00

Comments: For project website, see https://genmimic.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

中文标题/摘要

标题：从生成的人类视频到物理上可信的机器人轨迹

视频生成模型在合成新颖情境下的人类动作方面的能力正在迅速提高，有可能作为上下文机器人控制的高层规划者。为了实现这一潜力，一个关键的研究问题仍然悬而未决：如何以零样本的方式让类人机器人执行生成视频中的人类动作？这一挑战源于生成的视频通常噪声较大且存在形态失真，使得直接模仿比真实视频更困难。为了解决这一问题，我们引入了一个两阶段的流水线。首先，我们将视频像素提升到4D的人类表示，然后重新定位到类人形态。其次，我们提出了GenMimic——一种基于3D关键点的物理感知强化学习策略，并通过对称正则化和关键点加权跟踪奖励进行训练。因此，GenMimic可以从噪声较大的生成视频中模仿人类动作。我们构建了GenMimicBench，这是一个使用两种视频生成模型生成的合成人类动作数据集，涵盖了各种动作和情境，为评估零样本泛化能力和策略鲁棒性建立了基准。广泛的实验表明，与强大的基线相比，在模拟中有所改进，并且在无需微调的情况下，Unitree G1类人机器人上实现了连贯且物理稳定的运动跟踪。这项工作为实现视频生成模型作为机器人控制高层策略的潜力提供了一条有希望的道路。

Summary / 总结

The research aims to enable humanoid robots to execute human actions from generated videos in a zero-shot manner, addressing the challenges of noise and morphological distortions in generated videos. The method involves a two-stage pipeline: lifting video pixels into a 4D human representation and retargeting to the humanoid morphology, followed by training a physics-aware reinforcement learning policy with symmetry regularization and keypoint-weighted tracking rewards. Key experimental findings show that GenMimic can effectively mimic human actions from noisy generated videos and achieve coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning, outperforming strong baselines in simulation and on the robot.

研究旨在使类人机器人能够从生成的视频中零样本执行人类动作，解决生成视频中的噪声和形态失真问题。方法包括两阶段管道：将视频像素提升到4D人体表示并重新目标到类人形态，然后训练一个带有对称正则化和关键点加权跟踪奖励的物理感知强化学习策略。关键实验发现表明，GenMimic可以从噪声生成的视频中有效模仿人类动作，并在Unitree G1类人机器人上实现连贯且物理稳定的运动跟踪，无需微调，模拟和机器人上均优于强基线。

Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

Authors: Vincent Pauline, Tobias Höppe, Kirill Neklyudov, Alexander Tong, Stefan Bauer, Andrea Dittadi

First: 2025-12-04T18:55:36+00:00 · Latest: 2025-12-04T18:55:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in $\mathbb{R}^d$ and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.

中文标题/摘要

标题：一般状态空间扩散模型的基础：一个自包含的介绍

尽管扩散模型现在在生成建模中占据中心地位，但入门性介绍通常假设欧几里得数据，并且很少阐明其与离散状态对应物的联系。本文是对一般状态空间扩散的自包含入门，统一了连续域和离散/分类结构下的连续时间视角——通过马尔可夫核的前向噪声化和学习的反向动力学，以及其连续时间极限——$\mathbb{R}^d$中的随机微分方程（SDEs）和有限字母表上的连续时间马尔可夫链（CTMCs）——并推导了相关的福克-普朗克方程和主方程。一种共同的变分处理给出了支撑标准训练损失的ELBO。我们明确说明了前向污染选择——连续空间中的高斯过程和离散空间中的结构化分类转换核（均匀的、遮蔽/吸收的以及其他）如何塑造反向动力学和ELBO。本文分层面向三类受众：寻求自包含直观介绍的新手；希望获得全局理论综合的扩散实践者；以及希望从类比入手了解离散扩散的连续扩散专家。结果是跨连续域和离散序列的现代扩散方法论的统一路线图，突显了一组紧凑的可重用证明、恒等式和核心理论原则。

Summary / 总结

This paper provides a comprehensive introduction to diffusion models in general state spaces, addressing the lack of clarity in the connection between continuous and discrete-state models. It develops the discrete-time and continuous-time views of diffusion, including forward noising via Markov kernels and learned reverse dynamics, and derives the associated Fokker--Planck and master equations. Key findings include the explicit relationship between forward corruption choices and reverse dynamics, and the derivation of the ELBO that underpins standard training losses for both continuous and discrete diffusion models.

本文提供了一种统一的扩散模型在一般状态空间中的全面介绍，将连续和离散结构统一起来。它发展了离散时间和连续时间的观点，包括马尔可夫核、学习的反向动力学和随机微分方程。关键发现包括Fokker-Planck和马尔可夫方程的推导，以及前向污染选择与反向动力学之间的显式联系。该工作旨在为初学者、从业者和连续扩散专家提供一个基础资源，提供了一个统一的理论框架来理解扩散模型。

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Authors: Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang

First: 2025-12-04T18:55:34+00:00 · Latest: 2025-12-04T18:55:34+00:00

Comments: Technical Report; Project Page: https://harboryuan.github.io/visual-reasoning-tracer

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

中文标题/摘要

标题：视觉推理追踪器：基于对象的视觉推理基准

近年来，多模态大型语言模型（MLLMs）在视觉定位和视觉问答等任务上的表现显著提升。然而，这些模型的推理过程仍然相当不透明；它们通常只输出最终预测，而不揭示导致结果的中间步骤或细粒度证据（例如，像素、位置）。这与人类智能自然通过视觉推理链进行运作的方式形成对比。为解决这一局限，我们引入了视觉推理追踪器（VRT）任务，要求模型不仅要定位目标对象，还要明确预测形成推理路径的中间对象。为了推进该领域的研究，我们贡献了：(1) VRT-Bench，一个经过人工标注的视觉推理基准；(2) 一种评估推理轨迹质量的新指标；以及(3) VRT-80k，一个大规模数据集用于推理模型训练。我们的实验表明，尽管现有模型通常能产生正确的最终输出，但在定位中间推理方面却存在困难。相比之下，使用VRT-80k训练的模型在追踪推理路径方面取得了显著进步。

Summary / 总结

The research aims to enhance the transparency of Multimodal Large Language Models (MLLMs) by introducing the Visual Reasoning Tracer (VRT) task, which requires models to explicitly predict intermediate reasoning steps. The study contributes VRT-Bench, a benchmark for evaluating visual reasoning, a new metric for reasoning trace quality, and VRT-80k, a large dataset for training reasoning models. Experiments show that existing models often produce correct final outputs but struggle to ground their intermediate reasoning, whereas models trained on VRT-80k show significant improvements in tracing the reasoning path.

研究旨在通过引入视觉推理追踪（VRT）任务来增强多模态大型语言模型（MLLMs）的透明度，该任务要求模型预测中间推理步骤。研究贡献了VRT-Bench，一个用于评估视觉推理的基准，一个新的推理轨迹质量评估指标，以及VRT-80k，一个用于训练推理模型的大规模数据集。实验表明，现有模型往往无法在中间推理中进行定位，而使用VRT-80k训练的模型在追踪推理路径方面取得了显著改进。

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim

First: 2025-12-04T18:46:44+00:00 · Latest: 2025-12-04T18:46:44+00:00

Comments: Project Page: https://cvlab-kaist.github.io/DeepForcing/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

中文标题/摘要

标题：深度强迫：无需训练的长视频生成方法

近期自回归视频扩散技术的进步使得实时帧流成为可能，但现有解决方案仍然存在时间重复、漂移和运动减速的问题。我们发现，将类似于StreamingLLM的注意力池化直接应用于视频扩散会导致保真度下降和运动停滞。为解决这一问题，我们引入了深度强迫，这是一种无需训练的机制，可以在不进行微调的情况下解决这些问题。具体来说，1) 深度池化将滑动窗口的一半用于持久池化令牌，并重新对齐其时间RoPE相位以匹配当前时间线，从而在长时间生成过程中稳定全局上下文。2) 参与式压缩执行重要性感知的KV缓存剪枝，仅保留最近参与注意力的活跃令牌，同时安全地丢弃冗余和退化的历史记录，从而在生成超出分布长度时最小化误差累积。这些组件结合在一起，使生成能力提高了超过12倍（例如，5秒训练到60秒以上的生成），同时保持了更好的成像质量、更好的美学质量、几乎保持整体一致性，并在动态程度上取得了显著进步，同时保持实时生成。我们的结果表明，无需训练的KV缓存管理可以与基于训练的方法相媲美或超越自回归流式长视频生成。

Summary / 总结

Deep Forcing is a training-free method for long video generation that addresses temporal repetition and motion deceleration issues in existing solutions. It introduces two mechanisms: Deep Sink, which stabilizes global context by re-aligning persistent sink tokens, and Participative Compression, which prunes the KV cache to preserve only active tokens. These components enable over 12x extrapolation with better imaging and aesthetic quality compared to previous methods, maintaining consistency and dynamic degree while supporting real-time generation.

Deep Forcing通过引入两种无需训练的机制——Deep Sink和Participative Compression来解决自回归视频扩散中的时间重复、漂移和运动减速问题。Deep Sink在长时间生成过程中稳定全局上下文，而Participative Compression通过保留仅参与最近注意力的令牌并安全地丢弃冗余和退化的历史来减少误差累积。这些机制使得生成能力提高了超过12倍，同时保持了更好的成像质量和美学质量，维持了整体一致性和动态程度，并支持实时生成。

OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design

Authors: Ian Dunn, Liv Toft, Tyler Katz, Juhi Gupta, Riya Shah, Ramith Hettiarachchi, David R. Koes

First: 2025-12-04T18:46:35+00:00 · Latest: 2025-12-04T18:46:35+00:00

Comments: Presented at the Machine Learning for Structural Biology Workshop, 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at https://github.com/gnina/OMTRA

中文标题/摘要

标题：OMTRA：基于结构的药物设计的多任务生成模型

基于结构的药物设计（SBDD）专注于设计能够与特定蛋白质口袋结合的小分子配体。现代计算方法在现代SBDD工作流程中至关重要，并且经常利用通过对接或药效团搜索的虚拟筛选方法。现代生成模型方法侧重于通过支持从头设计来提高新颖配体的发现。在本研究中，我们认识到这些任务具有共同的结构，因此可以被表示为一致的生成建模框架的不同实例。我们提出了一种统一的方法，即OMTRA，这是一种多模态流匹配模型，可以灵活地执行与SBDD相关的多种任务，包括一些在传统工作流程中没有对应的任务。此外，我们整理了一个包含50亿个3D分子构象的数据集，补充了蛋白质-配体数据，扩展了可用于训练的化学多样性。OMTRA在口袋条件下的从头设计和对接方面取得了最先进的性能；然而，大规模预训练和多任务训练的效果有限。所有代码、训练模型和数据集均可在https://github.com/gnina/OMTRA获取

Summary / 总结

OMTRA is a multi-task generative model designed for structure-based drug design, integrating various tasks such as de novo ligand design and docking. It leverages a unified framework to handle these tasks efficiently. The model was trained on a large dataset of 500M 3D molecular conformers, enhancing chemical diversity. While OMTRA achieves state-of-the-art performance on pocket-conditioned de novo design and docking, the benefits of large-scale pretraining and multi-task training are limited.

OMTRA 是一种多任务生成模型，用于结构基于的药物设计，整合了诸如从头设计配体和对接等任务。它利用统一框架高效处理这些任务。该模型在包含5亿个3D分子构象的大数据集上进行训练，增强了化学多样性。尽管 OMTRA 在口袋条件下的从头设计和对接方面达到了最先进的性能，但大规模预训练和多任务训练的效果有限。

Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Authors: Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, Michael Posa

First: 2025-12-04T18:45:14+00:00 · Latest: 2025-12-04T18:45:14+00:00

Comments: Project page: https://contactgen3d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.

中文标题/摘要

标题：遮挡下基于生成先验和接触诱导约束的物体重建

物体几何形状是机器人操作的关键信息。然而，物体重建是一个具有挑战性的任务，因为相机只能捕捉到物体的部分观察结果，尤其是在发生遮挡时。在本文中，我们利用两种额外的信息源来减少视觉信号的不确定性。首先，生成模型学习常见物体形状的先验，使我们能够合理猜测未见部分的几何形状。其次，可以从视频和物理交互中获得的接触信息为几何边界提供了稀疏约束。我们通过接触引导的3D生成将两种信息源结合起来。指导公式受到生成模型中基于拖拽编辑的启发。在合成和真实数据上的实验表明，我们的方法在重建方面优于纯3D生成和基于接触的优化。

Summary / 总结

The research aims to improve object reconstruction in scenarios with partial observations and occlusions. It uses generative priors to guess the unseen parts of objects and contact-induced constraints to provide sparse boundary information. The method combines these two sources through contact-guided 3D generation, showing better reconstruction results than pure 3D generation or contact-based optimization in both synthetic and real-world data experiments.

研究旨在通过利用生成先验和接触诱导约束来提高遮挡情况下的物体重建。方法结合了生成模型提供的形状先验和接触信息来引导3D生成，从而提高重建精度。实验表明，该方法优于纯3D生成和基于接触的优化方法。

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein

First: 2025-12-04T18:40:52+00:00 · Latest: 2025-12-04T18:40:52+00:00

Comments: Project Page: https://19reborn.github.io/Bullet4D/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

中文标题/摘要

标题：BulletTime：解耦时间与相机姿态控制的视频生成框架

新兴的视频扩散模型实现了高视觉保真度，但本质上将场景动态与相机运动耦合在一起，限制了其提供精确的空间和时间控制的能力。我们提出了一种4D可控的视频扩散框架，明确地将场景动态与相机姿态解耦，从而能够对场景动态和相机视角进行精细操控。该框架以连续的世界时间序列和相机轨迹作为条件输入，通过注意力层中的4D位置编码和自适应归一化将它们注入视频扩散模型。为了训练该模型，我们创建了一个独特的数据集，其中时间和相机变化独立参数化；该数据集将公开发布。实验表明，我们的模型能够在多种时间模式和相机轨迹下实现稳健的4D控制，同时保持高质量的生成效果，并在可控性方面优于先前的工作。请访问我们的网站查看视频结果：https://19reborn.github.io/Bullet4D/

Summary / 总结

The research aims to address the limitations of existing video diffusion models that couple scene dynamics with camera motion, thereby restricting precise control over spatial and temporal aspects. The authors propose a 4D-controllable video diffusion framework that decouples scene dynamics from camera pose, using continuous world-time sequences and camera trajectories as conditioning inputs. Experiments demonstrate that the model can robustly control 4D aspects across various timing patterns and camera trajectories while maintaining high generation quality and outperforming previous methods in controllability. The dataset used for training is publicly available on the project website.

研究旨在通过解耦场景动态和相机运动，在视频生成中实现精确的空间和时间控制。方法采用一个4D可控制的视频扩散框架，使用连续的世界时间和相机轨迹作为条件输入，通过4D位置编码和自适应归一化注入。实验表明，该模型在各种时间模式和相机轨迹下实现了稳健的4D控制，同时保持了高质量的生成效果，并在可控性方面优于先前的方法。训练所用的数据集已公开，网址为https://19reborn.github.io/Bullet4D/。

David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?

Authors: Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari, Bhabesh Mali, Animesh Basak Chowdhury, Sukanta Bhattacharjee, Chandan Karfa

First: 2025-12-04T18:37:29+00:00 · Latest: 2025-12-04T18:37:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.

中文标题/摘要

标题：大卫与歌利亚：小模型在硬件设计中借助代理AI能否赢得胜利？

大型语言模型（LLM）推理需要巨大的计算能力和能源，使得特定领域任务昂贵且不可持续。随着基础模型不断扩展，我们不禁质疑：对于硬件设计而言，更大是否总是更好？我们的研究通过在NVIDIA的综合Verilog设计问题（CVDP）基准测试上评估小语言模型与定制代理AI框架的结合来检验这一问题。结果表明，通过任务分解、迭代反馈和纠正，代理工作流程不仅能够在极低的成本下实现接近LLM的性能，还能为代理提供学习机会，从而为复杂设计任务开辟高效、自适应的解决方案之路。

Summary / 总结

This study investigates whether smaller models can match the performance of larger models in hardware design tasks by leveraging an agentic AI framework. Using NVIDIA's CVDP benchmark, the research demonstrates that small language models, when combined with task decomposition, iterative feedback, and correction, can achieve near-LLM performance at a significantly lower cost and provide learning opportunities for agents, suggesting a more efficient approach for complex design tasks.

研究探讨了通过使用任务分解、迭代反馈和纠正等方法，小模型是否能在硬件设计中超越大模型。使用NVIDIA的CVDP基准测试，研究显示，当小模型与这些方法结合时，可以在显著降低成本的同时达到接近大模型的性能，并为代理提供学习机会，从而为复杂设计任务提供高效和自适应的解决方案。

MORPH: PDE Foundation Models with Arbitrary Data Modality

Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas, Diane Oyen, Earl Lawrence

First: 2025-09-25T22:38:36+00:00 · Latest: 2025-12-04T18:36:29+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

中文标题/摘要

标题：MORPH：任意数据模态的偏微分方程基础模型

我们介绍了MORPH，一种模态无关的自回归偏微分方程（PDE）基础模型。MORPH基于卷积视觉变换器骨干网络，能够无缝处理不同数据模态（1D-3D）和不同分辨率的异质时空数据集，以及具有混合标量和向量分量的多个字段。该架构结合了(i) 组件卷积，联合处理标量和向量通道以捕捉局部交互，(ii) 交叉场注意力，建模并选择性地传播不同物理场之间的信息，(iii) 轴向注意力，沿个体空间和时间轴分解全时空自注意力，以减少计算负担同时保留表达能力。我们使用多样化的异质PDE数据集对多个模型变体进行预训练，并评估其在一系列下游预测任务中的迁移性能。使用全模型微调和参数高效低秩适配器（LoRA），MORPH优于从头训练的模型。在广泛的评估中，MORPH匹配或超越了强大的基线和最近的先进模型。这些能力共同展示了学习科学观测的异构和多模态性质的灵活而强大的基础架构，为可扩展和数据高效的科学机器学习铺平了道路。源代码、数据集和模型可在https://github.com/lanl/MORPH/ 公开获取。

Summary / 总结

MORPH is a modality-agnostic autoregressive foundation model for partial differential equations, designed to handle heterogeneous spatiotemporal datasets. It uses a convolutional vision transformer backbone with component-wise convolution, inter-field cross-attention, and axial attentions to capture local interactions and model information between different physical fields. MORPH outperforms models trained from scratch and matches or surpasses strong baselines and recent state-of-the-art models across various downstream prediction tasks, demonstrating its effectiveness for scientific machine learning with heterogeneous data. The source code and datasets are publicly available.

MORPH 是一个处理异构时空数据的模态无关自回归基础模型，用于偏微分方程。它使用卷积视觉变换器骨干网络，并结合组件卷积、跨域交叉注意力和轴向注意力来捕捉局部交互并建模不同物理域之间的信息。MORPH 在各种下游预测任务中表现出色，超越了从头训练的模型，并且与强大的基线和最新的先进模型相当或超越，展示了其在处理异构数据的科学机器学习中的灵活性和强大性。源代码和数据集已公开。

Control Consistency Losses for Diffusion Bridges

Authors: Samuel Howard, Nikolas Nüsken, Jakiw Pidstrigach

Venue: NeurIPS 2025 Oral

First: 2025-12-04T18:31:39+00:00 · Latest: 2025-12-04T18:31:39+00:00

Comments: Frontiers in Probabilistic Inference: Sampling Meets Learning Workshop at NeurIPS 2025 (Oral)

Abs · PDF · Code1 · Code2

Abstract

Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we leverage a self-consistency property of the conditioned dynamics to learn the diffusion bridge in an iterative online manner, and demonstrate promising empirical results in a range of settings.

中文标题/摘要

标题：控制扩散桥梁的一致性损失

给定初始和终端状态模拟扩散过程的条件动态是一个在科学中既重要又具有挑战性的问题。对于罕见事件而言，这一困难尤为突出，因为未条件化的动态过程很少能达到终端状态。在本文中，我们利用条件动态的自一致性性质，以迭代在线的方式学习扩散桥梁，并在多种设置中展示了有希望的实证结果。

Summary / 总结

This paper addresses the challenge of simulating conditioned dynamics of diffusion processes given initial and terminal states, especially for rare events. The authors propose an iterative online method based on the self-consistency property of conditioned dynamics to learn the diffusion bridge. Their approach shows promising empirical results across various settings.

本文解决了给定初始和终端状态下的扩散过程条件动态模拟问题，特别是对于罕见事件。作者提出了一种利用条件动态的自一致性性质的迭代在线方法来学习扩散桥梁。他们的方法在多种设置中表现出有希望的实验结果。

Minimum Weighted Feedback Arc Sets for Ranking from Pairwise Comparisons

Authors: Soroush Vahidi, Ioannis Koutis

First: 2024-12-10T16:51:11+00:00 · Latest: 2025-12-04T18:29:06+00:00

Comments: This is a preliminary paper

Abs · PDF · Code1 · Code2

Abstract

The Minimum Weighted Feedback Arc Set (MWFAS) problem is closely related to the task of deriving a global ranking from pairwise comparisons. Recent work by He et al. (ICML 2022) advanced the state of the art on ranking benchmarks using learning based methods, but did not examine the underlying connection to MWFAS. In this paper, we investigate this relationship and introduce efficient combinatorial algorithms for solving MWFAS as a means of addressing the ranking problem. Our experimental results show that these simple, learning free methods achieve substantially faster runtimes than recent learning based approaches, while also delivering competitive, and in many cases superior, ranking accuracy. These findings suggest that lightweight combinatorial techniques offer a scalable and effective alternative to deep learning for large scale ranking tasks.

中文标题/摘要

标题：最小加权反馈弧集在从成对比较推导排名中的应用

最小加权反馈弧集（MWFAS）问题与从成对比较推导全局排名的任务密切相关。He等人（ICML 2022）使用基于学习的方法在排名基准上取得了最先进的成果，但没有探讨其与MWFAS的内在联系。本文研究了这种关系，并引入了高效的组合算法来解决MWFAS问题，以应对排名问题。实验结果表明，这些简单、无学习的方法在运行时间上比最近的基于学习的方法快得多，同时也能提供具有竞争力，甚至在许多情况下更优的排名准确性。这些发现表明，轻量级的组合技术为大规模排名任务提供了一种可扩展且有效的替代方案。

Summary / 总结

This paper investigates the connection between the Minimum Weighted Feedback Arc Set (MWFAS) problem and the task of deriving a global ranking from pairwise comparisons. It introduces efficient combinatorial algorithms for solving MWFAS and shows that these methods achieve faster runtimes compared to recent learning-based approaches, while also providing competitive ranking accuracy. The results indicate that lightweight combinatorial techniques can be a scalable and effective alternative to deep learning for large-scale ranking tasks.

研究探讨了最小加权反馈弧集（MWFAS）问题与基于成对比较的排名之间的关系。引入了求解MWFAS的高效组合算法，并展示了这些方法在运行时间上比最近的基于学习的方法更快，同时还能提供具有竞争力的排名准确性。研究结果表明，轻量级组合技术可以作为大规模排名任务的可扩展且有效的替代方案，与深度学习相比。

Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection

Authors: Mohammad Arif Rasyidi, Omar Alhussein, Sami Muhaidat, Ernesto Damiani

First: 2025-12-04T18:29:05+00:00 · Latest: 2025-12-04T18:29:05+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at https://github.com/arasyi/hqcae-network-intrusion-detection.

中文标题/摘要

标题：混合量子-经典自编码器在无监督网络入侵检测中的应用

无监督异常检测需要能够泛化到训练期间未观察到的攻击模式的模型。本研究首次对混合量子-经典（HQC）自编码器在该任务中的大规模性能进行了评估。我们构建了一个统一的实验框架，迭代了关键的量子设计选择，包括量子层的位置、测量方法、变分和非变分形式以及潜在空间正则化。在三个基准NIDS数据集上的实验表明，HQC自编码器在最佳配置下可以匹配或超越经典性能，尽管它们对架构决策的敏感性更高。在零日评估中，配置良好的HQC模型提供了比经典和监督基线更强且更稳定的泛化能力。模拟门噪声实验揭示了早期性能下降，表明需要噪声感知的HQC设计。这些结果提供了HQC自编码器在网络入侵检测中行为的首个数据驱动表征，并概述了影响其实用性的关键因素。所有实验代码和配置均可在https://github.com/arasyi/hqcae-network-intrusion-detection/获取。

Summary / 总结

This work evaluates hybrid quantum-classical autoencoders for unsupervised network intrusion detection, presenting a unified experimental framework that explores various quantum design choices. Across three benchmark datasets, HQC autoencoders match or exceed classical performance in their best configurations, showing stronger and more stable generalization under zero-day attacks compared to classical and supervised baselines. However, they are more sensitive to architectural decisions and exhibit early performance degradation under simulated gate noise, indicating the need for noise-aware designs. These results provide insights into the practical viability of HQC autoencoders for network intrusion detection.

这项工作评估了混合量子-经典自编码器在无监督网络入侵检测中的应用，通过一个统一的实验框架探索了各种量子设计选择。在三个基准数据集上，HQC自编码器在最佳配置下能够匹配或超越经典模型的表现，并在零日攻击下展现出更强和更稳定的泛化能力，优于经典和监督基线。然而，它们对架构决策更为敏感，并在模拟的门噪声下表现出早期性能下降，表明需要噪声感知的设计。这些结果为混合量子-经典自编码器在网络入侵检测中的实际可行性提供了数据驱动的表征，并指出了其关键影响因素。

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Authors: Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

Venue: NeurIPS 2025

First: 2025-06-11T19:36:17+00:00 · Latest: 2025-12-04T18:28:33+00:00

Comments: Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025. 34 pages, 26 figures

Abs · PDF · Code1 · Code2

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

中文标题/摘要

标题：路径通道和计划扩展核：Sokoban RNN规划的机理描述

我们部分逆向工程了一个通过无模型强化学习训练的卷积递归神经网络（RNN），使其能够玩推箱子游戏Sokoban。我们发现，RNN将未来的动作（计划）存储在隐藏状态的特定通道中，我们称之为路径通道。特定位置的高激活意味着当箱子位于该位置时，它将被推到该通道指定的方向。我们检查了路径通道之间的卷积核，发现它们编码了每种可能动作导致的位置变化，从而代表了学习到的转移模型的一部分。RNN通过从箱子和目标开始构建计划。这些核将路径通道中的激活向前扩展到箱子，向后扩展到目标。在障碍物处放置负值会使得扩展核反向传播负值，从而修剪最后几步，让另一种计划浮现；一种形式的回溯。我们的工作表明，对计划表示的精确理解使我们能够直接用更熟悉的术语理解模型自由训练中学习到的双向规划算法。

Summary / 总结

The study aims to understand the planning mechanism of a convolutional recurrent neural network (RNN) trained for the Sokoban game. The RNN uses specific channels in its hidden state to store future moves, termed path channels, where high activation indicates a future move in a particular direction. The convolutional kernels between these channels encode the effects of actions, representing a learned transition model. The RNN constructs plans by extending activations from the boxes and goals, with obstacles causing negative values that trigger backtracking, allowing alternative plans to emerge. This research demonstrates that understanding the plan representation helps in interpreting the bidirectional planning algorithm learned by the RNN.

研究旨在理解训练用于 sokoban 游戏的卷积递归神经网络 (RNN) 的规划机制。RNN 使用隐藏状态中的特定通道来存储未来动作，称为路径通道，其中高激活表示特定方向的未来动作。这些通道之间的卷积核编码了动作的效果，代表了一个学习到的转移模型。RNN 通过从箱子和目标扩展激活来构建计划，障碍物导致负值，触发回溯，从而允许替代计划的出现。这项研究显示，理解计划表示有助于解释 RNN 通过无模型训练学习到的双向规划算法。

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Authors: Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

First: 2025-06-11T09:01:59+00:00 · Latest: 2025-12-04T18:28:33+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

中文标题/摘要

标题：Athena：通过数据高效过程奖励模型增强多模态推理

我们提出了Athena-PRM，这是一种多模态过程奖励模型（PRM），旨在评估解决复杂推理问题的每一步的奖励分数。开发高性能的PRM通常需要大量的时间和财务投资，主要是因为需要对推理步骤进行逐级注释。传统的自动化标签方法，如蒙特卡洛估计，通常会产生噪声标签并导致巨大的计算成本。为了高效地生成高质量的过程标注数据，我们提出利用弱完成者和强完成者之间预测一致性作为识别可靠过程标签的标准。令人惊讶的是，Athena-PRM仅使用5,000个样本就在各种场景和基准测试中表现出色。此外，我们还开发了两种有效策略以提高PRM的性能：初始化ORM和对负样本进行上采样。我们在三个特定场景中验证了我们的方法：测试时间缩放的验证、直接评估推理步骤的正确性以及奖励排名微调。我们的Athena-PRM在多个基准测试和场景中始终表现出色。值得注意的是，当使用Qwen2.5-VL-7B作为策略模型时，Athena-PRM在测试时间缩放上的WeMath和MathVista上分别提高了10.2分和7.1分。此外，Athena-PRM在VisualProcessBench中达到了最先进的（SoTA）结果，并且在F1分数上比之前的SoTA高出3.9分，展示了其准确评估推理步骤正确性的强大能力。此外，利用Athena-PRM作为奖励模型，我们开发了Athena-7B并使用奖励排名微调，其在五个基准测试中显著优于基线。

Summary / 总结

Athena-PRM is a multimodal process reward model designed to evaluate the reward score for each step in solving complex reasoning problems. It leverages prediction consistency between weak and strong completers to generate reliable process labels efficiently. Athena-PRM shows outstanding performance with only 5,000 samples across various scenarios and benchmarks. It improves performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling, and sets the state-of-the-art results in VisualProcessBench with a 3.9 F1-score improvement over previous methods.

Athena-PRM 是一种多模态过程奖励模型，用于评估解决复杂推理问题的每一步的奖励分数。它通过弱完成者和强完成者之间的预测一致性来高效生成可靠的过程标签。Athena-PRM 在各种场景和基准测试中仅使用 5,000 个样本就表现出色。它在 WeMath 和 MathVista 的测试时间缩放中分别提高了 10.2 分和 7.1 分的性能，并在 VisualProcessBench 中达到了最先进的结果，比之前的方法提高了 3.9 个 F1 分数。

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Authors: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang

First: 2025-12-04T18:15:27+00:00 · Latest: 2025-12-04T18:15:27+00:00

Comments: Code: https://github.com/hustvl/4DLangVGGT, Webpage: https://hustvl.github.io/4DLangVGGT

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

中文标题/摘要

标题：4DLangVGGT：四维语言-视觉几何接地变换器

构建四维语言场对于具身人工智能、增强/虚拟现实以及四维场景理解至关重要，因为它们提供了动态环境的丰富语义表示，并在复杂场景中支持开放词汇查询。然而，现有的四维语义场构建方法主要依赖于场景特定的高斯点积，这需要逐场景优化，泛化能力有限，难以扩展到实际应用。为了解决这些限制，我们提出了4DLangVGGT，这是一种基于变换器的前馈统一框架，用于四维语言接地，该框架在单一架构中联合整合了几何感知和语言对齐。4DLangVGGT有两个关键组件：四维视觉几何变换器StreamVGGT，用于捕获动态场景的时空几何表示；以及语义桥梁解码器（SBD），将几何感知特征投影到语言对齐的语义空间，从而增强语义可解释性同时保持结构保真度。与依赖于昂贵的逐场景优化的先前方法不同，4DLangVGGT可以在多个动态场景上联合训练，并在推理时直接应用，实现了部署效率和强大的泛化能力。这种设计显著提高了大规模部署的实用性，并建立了开放词汇四维场景理解的新范式。在HyperNeRF和Neu3D数据集上的实验表明，我们的方法不仅泛化效果良好，还实现了最先进的性能，在逐场景训练下达到2%的提升，在多场景训练下达到1%的提升。我们的代码发布在https://github.com/hustvl/4DLangVGGT

Summary / 总结

The research aims to improve 4D language fields for embodied AI and 4D scene understanding by proposing 4DLangVGGT, a Transformer-based unified framework that integrates geometric perception and language alignment. It includes StreamVGGT for capturing spatio-temporal geometric representations and SBD for projecting geometry-aware features into a language-aligned semantic space. Experiments show that 4DLangVGGT generalizes well and achieves state-of-the-art performance, with up to 2% gains under per-scene training and 1% improvements under multi-scene training. This design enhances practicality and scalability for large-scale deployment in 4D scene understanding.

研究旨在通过提出4DLangVGGT，一种结合几何感知和语言对齐的Transformer统一框架，来改进4D语言领域，以支持嵌入式AI和4D场景理解。该框架包括StreamVGGT用于捕捉时空几何表示，以及SBD用于将几何感知特征投影到语言对齐的语义空间。实验表明，4DLangVGGT具有良好的泛化能力，并实现了最先进的性能，分别在单场景训练中获得高达2%的提升，在多场景训练中获得1%的改进。这种设计提高了在4D场景理解中的实用性和可扩展性。

Improving Graph Neural Network Training, Defense, and Hypergraph Partitioning via Adversarial Robustness Evaluation

Authors: Yongyu Wang

First: 2024-12-19T11:10:48+00:00 · Latest: 2025-12-04T18:10:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks (GNNs) are a highly effective neural network architecture for processing graph-structured data. Unlike traditional neural networks that rely solely on the features of the data as input, GNNs leverage both the graph structure, which represents the relationships between data points, and the feature matrix of the data to optimize their feature representation. This unique capability enables GNNs to achieve superior performance across various tasks. However, it also makes GNNs more susceptible to noise from both the graph structure and data features, which can significantly increase the training difficulty and degrade their performance. To address this issue, this paper proposes a novel method for selecting noise-sensitive training samples from the original training set to construct a smaller yet more effective training set for model training. These samples are used to help improve the model's ability to correctly process data in noisy environments. We have evaluated our approach on three of the most classical GNN models GCN, GAT, and GraphSAGE as well as three widely used benchmark datasets: Cora, Citeseer, and PubMed. Our experiments demonstrate that the proposed method can substantially boost the training of Graph Neural Networks compared to using randomly sampled training sets of the same size from the original training set and the larger original full training set. We further proposed a robust-node based hypergraph partitioning method, an adversarial robustness based graph pruning method for GNN defenses and a related spectral edge attack method.

中文标题/摘要

标题：通过对抗鲁棒性评估提高图神经网络训练、防御及超图分区

图神经网络（GNNs）是一种高效的神经网络架构，用于处理图结构数据。与传统神经网络仅依赖数据特征作为输入不同，GNNs 利用图结构，即数据点之间的关系，以及数据的特征矩阵来优化其特征表示。这种独特的能力使 GNNs 在各种任务中表现出色。然而，这也使 GNNs 更容易受到图结构和数据特征噪声的影响，这会显著增加训练难度并降低其性能。为了解决这一问题，本文提出了一种新颖的方法，从原始训练集中选择对噪声敏感的训练样本，以构建一个更小但更有效的训练集用于模型训练。这些样本用于帮助提高模型在噪声环境中正确处理数据的能力。我们已在三种最经典的 GNN 模型 GCN、GAT 和 GraphSAGE 以及三种广泛使用的基准数据集 Cora、Citeseer 和 PubMed 上评估了我们的方法。实验结果表明，与从原始训练集随机采样的相同大小的训练集和更大的原始完整训练集相比，所提出的方法可以显著提高图神经网络的训练效果。我们还提出了一种基于鲁棒节点的超图分区方法、一种基于对抗鲁棒性的图神经网络防御的图剪枝方法以及相关谱边缘攻击方法。

Summary / 总结

This paper addresses the challenge of noise sensitivity in Graph Neural Networks (GNNs) by proposing a method to select noise-sensitive training samples, thereby improving model training and performance. The method evaluates adversarial robustness to identify critical samples for training. Experiments on GCN, GAT, and GraphSAGE with Cora, Citeseer, and PubMed datasets show that this approach significantly enhances training efficiency compared to random sampling. Additionally, the paper introduces a robust-node based hypergraph partitioning method and an adversarial robustness-based graph pruning method for GNN defenses.

本文通过提出一种选择噪声敏感训练样本的方法来解决图神经网络（GNN）的训练和防御问题，该方法有助于提高模型的鲁棒性。该方法在三个GNN模型（GCN、GAT和GraphSAGE）和三个基准数据集（Cora、Citeseer和PubMed）上进行了评估，显示了与随机采样相比在训练效率上的显著提升。此外，论文还提出了基于鲁棒节点的超图划分方法、基于对抗鲁棒性的图剪枝方法以及相关频谱边缘攻击方法。

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Authors: Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

First: 2025-12-04T17:59:10+00:00 · Latest: 2025-12-04T17:59:10+00:00

Comments: 18 Pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

中文标题/摘要

标题：单张图像的4D合成联合三维几何重建与运动生成

从单张静态图像生成交互和动态的4D场景仍然是一个核心挑战。大多数现有的生成-然后重建和重建-然后生成方法将几何与运动分离，导致时空不一致性和较差的泛化能力。为了解决这些问题，我们扩展了重建-然后生成框架，以联合执行运动生成和几何重建，用于4D合成（MoRe4D）。我们首先引入了TrajScene-60K，这是一个包含60,000个视频样本的大规模数据集，每个样本都有密集的点轨迹，解决了高质量4D场景数据稀缺的问题。基于此，我们提出了一种基于扩散的4D场景轨迹生成器（4D-STraG），以联合生成几何上一致且运动上合理的4D点轨迹。为了利用单视角先验，我们设计了一种深度引导的运动归一化策略和一种运动感知模块，以实现有效的几何和动力学集成。然后，我们提出了一种4D视图合成模块（4D-ViSM），从4D点轨迹表示中渲染具有任意摄像机轨迹的视频。实验表明，MoRe4D可以从单张图像生成具有多视角一致性和丰富动态细节的高质量4D场景。代码：https://github.com/Zhangyr2022/MoRe4D。

Summary / 总结

The research aims to generate interactive and dynamic 4D scenes from a single static image, addressing the limitations of existing methods that decouple geometry and motion. MoRe4D jointly performs motion generation and geometric reconstruction using a diffusion-based 4D Scene Trajectory Generator and a depth-guided motion normalization strategy. The method generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image, outperforming existing approaches in terms of spatiotemporal consistency and generalization. Code: https://github.com/Zhangyr2022/MoRe4D.

研究旨在从单张静态图像生成交互性和动态性的4D场景，解决现有方法将几何和运动分离导致的空间时间不一致问题。MoRe4D框架联合进行运动生成和几何重建。引入了TrajScene-60K，这是一个包含60,000个视频样本的大规模数据集，每个样本都有密集的点轨迹。提出了4D场景轨迹生成器（4D-STraG）来生成一致且合理的4D点轨迹。方法还包括基于深度的运动归一化策略和运动感知模块，以有效整合几何和动力学。4D视图合成模块（4D-ViSM）可以从4D点轨迹表示中渲染具有任意摄像机轨迹的视频。实验表明，MoRe4D能够从单张图像生成高质量的4D场景，具有多视角一致性和丰富的动态细节。

Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Authors: Abhigyan Bhattacharya, Hiranmoy Roy

Venue: CVPR

First: 2025-12-04T17:56:08+00:00 · Latest: 2025-12-04T17:56:08+00:00

Comments: Submitted for review CVPR-2025

Abs · PDF · Code1 · Code2

Abstract

Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

中文标题/摘要

标题：基于语义指导的两阶段GAN在具有混合感知编码的面部图像修复中的应用

面部图像修复的目标是在保持身份、结构一致性和照片真实感的同时恢复面部图像中的缺失或损坏区域，这是一个专门用于照片修复的任务。尽管近年来在深度生成模型方面取得了许多进展，但现有方法在处理大不规则遮罩时仍存在问题，经常在遮罩区域边缘产生模糊的纹理、语义不一致或由于直接的像素级合成方法和面部先验的有限利用导致不令人信服的面部结构。在本文中，我们提出了一种新颖的架构，通过语义指导的分层合成来解决上述挑战。我们的方法首先基于意义组织和合成信息，然后细化纹理。这个过程在我们开始创建详细图像之前，为我们提供了清晰的面部结构见解。在第一阶段，我们结合了两种技术：一种使用CNN关注局部特征，另一种使用视觉变换器关注全局特征。这帮助我们创建了清晰且详细的语义布局。在第二阶段，我们使用多模态纹理生成器通过从不同尺度中拉取信息来细化这些布局，确保一切看起来连贯且一致。该架构通过动态注意力自然处理任意的遮罩配置，无需针对特定遮罩进行训练。在CelebA-HQ和FFHQ两个数据集上的实验表明，我们的模型优于其他最先进的方法，在LPIPS、PSNR和SSIM等指标上有所提高。在具有挑战性的大面积修复情况下，它产生了视觉上引人注目的结果，具有更好的语义保留。

Summary / 总结

This paper addresses the challenge of facial image inpainting by proposing a semantic-guided two-stage GAN that uses hybrid perceptual encoding. The method first synthesizes semantic layouts using a combination of CNNs and Vision Transformers, followed by texture refinement with a Multi-Modal Texture Generator. Experiments on CelebA-HQ and FFHQ demonstrate that this approach outperforms existing methods in terms of LPIPS, PSNR, and SSIM, and produces visually striking results with better semantic preservation in large-area inpainting scenarios.

论文提出了一种基于语义的两阶段GAN，使用混合感知编码来解决面部图像修复问题。该方法首先使用CNN和Vision Transformers的组合来合成语义布局，然后使用多模态纹理生成器进行纹理细化。实验结果表明，该方法在LPIPS、PSNR和SSIM等指标上优于现有方法，并在大面积修复场景中产生了视觉上引人注目的结果，同时保持了更好的语义一致性。

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Authors: Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

First: 2025-12-04T17:50:53+00:00 · Latest: 2025-12-04T17:50:53+00:00

Comments: 22 pages

Abs · PDF · Code1 · Code2

Abstract

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

中文标题/摘要

标题：套利：通过优势意识的推测实现高效推理

现代大型语言模型凭借长链推理能力取得了令人印象深刻的推理能力，但在推理过程中会产生巨大的计算成本，这促使人们开发技术以提高性能与成本的比例。这些技术中，推测解码通过使用快速但不准确的草稿模型自回归地提出令牌，然后由更强大的目标模型并行验证，从而加速推理。然而，由于在语义等价步骤中由于令牌不匹配导致不必要的拒绝，传统的令牌级推测解码在推理任务中表现不佳。尽管最近的研究转向了步骤级语义验证，这提高了效率，通过接受或拒绝整个推理步骤，但现有的步骤级方法仍然会重新生成许多被拒绝的步骤，浪费了宝贵的目标计算资源。为了解决这一挑战，我们提出了套利，这是一种新颖的步骤级推测生成框架，根据草稿模型和目标模型之间的相对优势动态路由生成。与应用固定接受阈值不同，套利使用一个轻量级路由器，该路由器被训练以预测目标模型何时可能产生更有意义的步骤。这种路由近似于理想的套利Oracle，它总是选择质量更高的步骤，从而实现接近最优的效率-准确度权衡。在多个数学推理基准测试中，套利始终超越了先前的步骤级推测解码基线，将匹配准确度下的推理延迟降低多达约2倍。

Summary / 总结

The research aims to improve the efficiency of reasoning tasks by reducing unnecessary rejections in speculative decoding. The method, Arbitrage, uses a lightweight router to predict when the target model will produce a better step, dynamically routing generation based on the relative advantage between draft and target models. Experiments show that Arbitrage outperforms previous step-level speculative decoding methods, achieving up to 2 times faster inference latency with matched accuracy across various mathematical reasoning benchmarks.

研究旨在通过减少推测解码中的不必要的拒绝，提高大型语言模型在推理任务中的效率。提出的Arbitrage框架根据草稿模型和目标模型之间的相对优势动态路由生成，使用一个轻量级路由器来预测目标模型何时更可能生成更好的步骤。实验表明，Arbitrage在保持匹配准确率的同时，比之前的步骤级推测解码方法更优，可将推理延迟最多减少2倍。

Dual-Path Region-Guided Attention Network for Ground Reaction Force and Moment Regression

Authors: Xuan Li, Samuel Bello

First: 2025-12-04T17:47:01+00:00 · Latest: 2025-12-04T17:47:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate estimation of three-dimensional ground reaction forces and moments (GRFs/GRMs) is crucial for both biomechanics research and clinical rehabilitation evaluation. In this study, we focus on insole-based GRF/GRM estimation and further validate our approach on a public walking dataset. We propose a Dual-Path Region-Guided Attention Network that integrates anatomy-inspired spatial priors and temporal priors into a region-level attention mechanism, while a complementary path captures context from the full sensor field. The two paths are trained jointly and their outputs are combined to produce the final GRF/GRM predictions. Conclusions: Our model outperforms strong baseline models, including CNN and CNN-LSTM architectures on two datasets, achieving the lowest six-component average NRMSE of 5.78% on the insole dataset and 1.42% for the vertical ground reaction force on the public dataset. This demonstrates robust performance for ground reaction force and moment estimation.

中文标题/摘要

标题：基于双路径区域引导注意力网络的地面反作用力和力矩回归

准确估计三维地面反作用力和力矩（GRFs/GRMs）对于生物力学研究和临床康复评估至关重要。在本研究中，我们专注于基于鞋垫的GRF/GRM估计，并进一步在公共行走数据集上验证了我们的方法。我们提出了一种双路径区域引导注意力网络，将解剖学启发的空间先验和时间先验整合到区域级注意力机制中，而互补路径则从整个传感器场中捕获上下文。两个路径联合训练，其输出结合生成最终的GRF/GRM预测。结论：我们的模型在两个数据集上均优于包括CNN和CNN-LSTM架构在内的强基线模型，在鞋垫数据集上六分量平均NRMSE最低为5.78%，在公共数据集上垂直地面反作用力的NRMSE为1.42%。这表明该模型在地面反作用力和力矩估计中具有稳健的性能。

Summary / 总结

The study aims to improve the accuracy of ground reaction force and moment estimation for biomechanics research and clinical applications. It introduces a Dual-Path Region-Guided Attention Network that combines spatial and temporal priors with a region-level attention mechanism. The model outperforms CNN and CNN-LSTM architectures, achieving the lowest six-component average NRMSE of 5.78% on an insole dataset and 1.42% for vertical ground reaction force on a public dataset.

本研究旨在提高地面反作用力和力矩估计的准确性，以支持生物力学研究和临床应用。作者提出了一种结合空间和时间先验的双路径区域引导注意力网络，并采用区域级别的注意力机制。该网络在内底数据集上实现了最低的六分量平均NRMSE为5.78%，在公共数据集上垂直地面反作用力的NRMSE为1.42%，优于CNN和CNN-LSTM模型。

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Authors: Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry

First: 2025-12-04T17:40:17+00:00 · Latest: 2025-12-04T17:40:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

中文标题/摘要

标题：RAMEN：适用于地球观测的分辨率可调多模态编码器

地球观测（EO）数据涵盖了广泛的空域、光谱和时间分辨率，从高分辨率光学图像到低分辨率多光谱产品或雷达时间序列。虽然最近的预训练模型在多模态集成学习有意义的表示方面有所改进，但它们通常需要固定输入分辨率或基于特定传感器的编码器，这限制了在异构EO模态之间的泛化能力。为克服这些限制，我们引入了RAMEN，这是一种分辨率可调的多模态编码器，能够在完全传感器无关的方式下学习EO数据的共享视觉表示。RAMEN 将模态、空域和时域分辨率视为关键输入数据特征，使在统一的潜在空间内跨模态进行一致分析成为可能。其主要方法论贡献在于将空域分辨率定义为可控的输出参数，使用户在推理时能够直接控制所需的细节水平，并允许在空间精度与计算成本之间进行显式权衡。我们训练了一个单一的统一变压器编码器，用于重建来自多种来源的掩码多模态EO数据，确保在传感器和分辨率之间的一般化能力。一旦预训练完成，RAMEN 可以有效地转移到已知和未知的传感器配置中，并在社区标准PANGAEA基准测试中优于更大的最先进的模型，该基准测试包含各种多传感器和多分辨率下游任务。我们的代码和预训练模型可在https://github.com/nicolashoudre/RAMEN/ 获取。

Summary / 总结

RAMEN is a resolution-adjustable multimodal encoder designed for Earth observation data, which spans various resolutions and modalities. It learns a shared representation across different EO data types without relying on sensor-specific encoders. By treating spatial resolution as a controllable parameter, RAMEN allows users to adjust the level of detail during inference. Experiments show that RAMEN outperforms larger state-of-the-art models on the PANGAEA benchmark, demonstrating its effectiveness in handling multi-sensor and multi-resolution tasks.

RAMEN旨在解决不同分辨率的地球观测数据多模态集成的挑战。它提出了一种分辨率可调的多模态编码器，能够在不依赖特定传感器编码器的情况下学习不同EO数据类型之间的共享表示。主要发现是，RAMEN在PANGAEA基准测试中优于更大规模的先进模型，展示了对已知和未知传感器配置的有效迁移性，并在推理过程中直接控制空间精度和计算成本。

Triangle Multiplication Is All You Need For Biomolecular Structure Representations

Authors: Jeffrey Ouyang-Zhang, Pranav Murugan, Daniel J. Diaz, Gianluca Scarpellini, Richard Strong Bowen, Nate Gruver, Adam Klivans, Philipp Krähenbühl, Aleksandra Faust, Maruan Al-Shedivat

First: 2025-10-21T17:59:02+00:00 · Latest: 2025-12-04T17:39:54+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2 · Code3

Abstract

AlphaFold has transformed protein structure prediction, but emerging applications such as virtual ligand screening, proteome-wide folding, and de novo binder design demand predictions at a massive scale, where runtime and memory costs become prohibitive. A major bottleneck lies in the Pairformer backbone of AlphaFold3-style models, which relies on computationally expensive triangular primitives-especially triangle attention-for pairwise reasoning. We introduce Pairmixer, a streamlined alternative that eliminates triangle attention while preserving higher-order geometric reasoning capabilities that are critical for structure prediction. Pairmixer substantially improves computational efficiency, matching state-of-the-art structure predictors across folding and docking benchmarks, delivering up to 4x faster inference on long sequences while reducing training cost by 34%. Its efficiency alleviates the computational burden of downstream applications such as modeling large protein complexes, high-throughput ligand and binder screening, and hallucination-based design. Within BoltzDesign, for example, Pairmixer delivers over 2x faster sampling and scales to sequences ~30% longer than the memory limits of Pairformer. Code is available at https://github.com/genesistherapeutics/pairmixer.

中文标题/摘要

标题：三角乘法即为生物分子结构表示所需的一切

AlphaFold 已经彻底改变了蛋白质结构预测，但新兴应用如虚拟配体筛选、全蛋白质组折叠和从头设计配体需要大规模预测，这使得运行时间和内存成本变得难以承受。AlphaFold3 类型模型的 Pairformer 主干是一个主要瓶颈，它依赖于计算密集型的三角形基础元素——特别是三角注意力——来进行成对推理。我们引入了 Pairmixer，这是一种简化替代方案，它消除了三角注意力，同时保留了对于结构预测至关重要的高级几何推理能力。Pairmixer 显著提高了计算效率，在折叠和对接基准测试中与最先进的结构预测器持平，对于长序列的推理速度可提高 4 倍，同时将训练成本降低 34%。其效率减轻了下游应用如大型蛋白质复合物建模、高通量配体和配体筛选以及基于幻觉的设计的计算负担。例如，在 BoltzDesign 中，Pairmixer 提供了超过 2 倍的采样速度，并可扩展到比 Pairformer 内存限制长 30% 的序列。代码可在 https://github.com/genesistherapeutics/pairmixer 获取。

Summary / 总结

The research addresses the computational challenges of protein structure prediction at a massive scale, particularly the inefficiencies of the Pairformer backbone in AlphaFold3 models. It introduces Pairmixer, which removes triangle attention while maintaining essential geometric reasoning capabilities. Pairmixer enhances computational efficiency, matching state-of-the-art predictors and reducing training costs by 34%, enabling faster inference and scaling to longer sequences.

研究旨在解决大规模蛋白质结构预测中的计算挑战，特别是Pairformer骨干网络在AlphaFold3风格模型中的效率问题。研究引入了Pairmixer，它去除了三角注意力，但仍保留了关键的几何推理能力。Pairmixer显著提高了计算效率，匹配了最先进的性能，并将训练成本降低了34%，同时在长序列推理上达到4倍的加速，并在下游应用如配体筛选和设计中支持更长序列的建模。

SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition

Authors: Shunpeng Chen, Changwei Wang, Rongtao Xu, Xingtian Pei, Yukun Song, Jinzhou Lin, Wenhao Xu, Jingyi Zhang, Li Guo, Shibiao Xu

First: 2025-09-30T03:34:40+00:00 · Latest: 2025-12-04T17:29:56+00:00

Comments: 23 pages

Abs · PDF · Code1 · Code2

Abstract

Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and models will be released upon acceptance.

中文标题/摘要

标题：SAGE：空间视觉自适应图探索在视觉地点识别中的应用

视觉地点识别（VPR）需要在存在大量外观、视角和环境变化的情况下，稳健地检索地理标记图像。先前的方法集中在描述符微调或固定采样策略上，但忽略了训练过程中空间上下文和视觉相似性之间的动态交互。我们提出了SAGE（空间视觉自适应图探索），这是一种统一的训练管道，通过联合改进局部特征聚合、训练期间组织样本和困难样本挖掘，增强粒度的空间视觉区分能力。我们引入了一个轻量级的Soft Probing模块，在双线性聚合之前从训练数据中学习补丁描述符的残差权重，增强独特的局部线索。在训练过程中，我们重建了一个在线的地理视觉图，融合了地理邻近性和当前的视觉相似性，使得候选邻域反映了不断演变的嵌入景观。为了集中学习最有信息量的地点邻域，我们从高亲和力锚点中初始化簇，并使用贪婪加权团簇扩展采样器迭代扩展它们。使用冻结的DINOv2主干和参数高效的微调实现，SAGE在八个基准测试中达到了SOTA。它分别在SPED、Pitts30k-test、MSLS-val和Nordland上达到了98.9%、95.8%、94.5%和96.0%的Recall@1。值得注意的是，仅使用4096D全局描述符，我们的方法在SPED上达到了100%的Recall@10。代码和模型将在接受后发布。

Summary / 总结

SAGE is designed to improve Visual Place Recognition by integrating spatial and visual information during training. It uses a Soft Probing module to enhance local feature aggregation and a geo-visual graph to organize training samples based on geographic and visual similarities. SAGE also employs a greedy weighted clique expansion sampler to focus on the most informative place neighborhoods. The method achieves state-of-the-art performance across eight benchmarks, with Recall@1 scores of 98.9%, 95.8%, 94.5%, and 96.0% on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, it reaches 100% Recall@10 on SPED using only 4096D global descriptors.

SAGE旨在通过解决空间上下文和视觉相似性之间的动态交互来提升视觉位置识别（VPR）。它提出了一种统一的训练管道，增强了局部特征聚合，组织训练样本，并挖掘困难样本。SAGE使用Soft Probing模块学习残差权重以增强patch描述符，并重建一个反映嵌入景观演变的在线地理视觉图。该方法在八个基准测试上取得了最先进的结果，Recall@1得分分别为98.9%、95.8%、94.5%和96.0%在SPED、Pitts30k-test、MSLS-val和Nordland上。值得注意的是，仅使用4096D全局描述符，它在SPED上达到了100%的Recall@10。通过冻结DINOv2主干和参数高效微调实现，SAGE在VPR准确性方面展示了显著的改进。

Generative Neural Video Compression via Video Diffusion Prior

Authors: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma

First: 2025-12-04T17:27:32+00:00 · Latest: 2025-12-04T17:27:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

中文标题/摘要

标题：基于视频扩散先验的生成神经视频压缩

我们提出了GNVC-VD，这是第一个基于DiT的生成神经视频压缩框架，它建立在先进的视频生成基础模型之上，其中时空潜空间压缩和序列级生成细化在单一编解码器中统一。现有的感知编解码器主要依赖于预训练的图像生成先验来恢复高频细节，但它们的帧级性质缺乏时间建模，不可避免地导致感知闪烁。为了解决这个问题，GNVC-VD 引入了一个统一的流动匹配潜空间细化模块，利用视频扩散变换器在序列级去噪的同时联合增强帧内和帧间潜空间，确保时空细节的一致性。与视频生成中从纯高斯噪声去噪不同，GNVC-VD 从解码的时空潜空间开始细化，并学习一个适应压缩引起的退化的校正项。进一步的条件适配器将压缩感知线索注入中间的DiT层，使在极端比特率约束下有效去除伪影并保持时间连贯性。广泛的实验表明，GNVC-VD 在感知质量上超过了传统和学习编解码器，并显著减少了先前生成方法中持续存在的闪烁伪影，甚至在低于0.01 bpp的情况下，突显了将视频原生生成先验整合到神经编解码器中进行下一代感知视频压缩的潜力。

Summary / 总结

GNVC-VD is a novel generative neural video compression framework that integrates a video diffusion transformer to enhance both intra- and inter-frame latents through sequence-level denoising, addressing the perceptual flickering issue of existing codecs. It introduces a unified flow-matching latent refinement module and a conditioning adaptor to improve temporal coherence and reduce artifacts, achieving superior perceptual quality even at very low bitrates.

GNVC-VD 是一种将视频扩散变换器集成到框架中的生成神经视频压缩方法，通过序列级去噪同时增强帧内和帧间潜在特征，解决了现有编解码器的视觉闪烁问题。它在感知质量上超越了传统和学习型编解码器，并通过统一的流动匹配潜在精炼模块和注入压缩感知线索的条件适配器，即使在极低比特率下也减少了闪烁伪影。

Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Authors: Noah Oberweis, Semih Cayci

First: 2025-10-24T08:28:53+00:00 · Latest: 2025-12-04T17:25:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

中文标题/摘要

标题：随机梯度拉格朗日动力学在懒训练区间的收敛性

连续时间模型为优化算法在深度学习中的训练动力学提供了重要见解。在本文中，我们建立了随机梯度拉格朗日动力学（SGLD）在懒训练区间内的非渐近收敛分析，SGLD是随机梯度下降在连续时间下的伊藤随机微分方程（SDE）近似。我们证明，在损失函数海森矩阵的正则条件下，具有乘法和状态依赖噪声的SGLD（i）在整个训练过程中以高概率产生非退化核，（ii）以期望值实现指数收敛至经验风险最小化器，并建立了有限时间与有限宽度下的最优性差距的上界。我们通过回归设置中的数值示例验证了我们的理论发现。

Summary / 总结

This study investigates the convergence properties of stochastic gradient Langevin dynamics (SGLD) in the lazy training regime. SGLD is an Itô stochastic differential equation approximation of stochastic gradient descent in continuous time. The research demonstrates that under certain conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise ensures a non-degenerate kernel throughout training with high probability and achieves exponential convergence to the empirical risk minimizer in expectation. The study also provides finite-time and finite-width bounds on the optimality gap. Numerical examples in the regression setting support these theoretical findings.

该研究分析了在懒训练状态下随机梯度拉格朗日动力学（SGLD）的收敛性，提供了在损失函数Hessian正则条件下非渐近收敛分析。SGLD带有乘性且状态依赖的噪声在整个训练过程中保持非退化核，并以期望实现指数收敛至经验风险最小化器，同时给出了有限时间与有限宽度下的最优性间隙界。回归设置下的数值示例支持这些理论发现。

Detecting Perspective Shifts in Multi-agent Systems

Authors: Eric Bridgeford, Hayden Helm

First: 2025-12-04T17:24:56+00:00 · Latest: 2025-12-04T17:24:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems -- a critical capability as generative agent deployment continues to scale.

中文标题/摘要

标题：多智能体系统中视角转换的检测

增强生成模型并结合外部工具和更新机制（或称为智能体）已经展示了超越基础模型智能提示的能力。随着智能体的广泛应用，动态多智能体系统自然地出现了。近期研究已经探讨了基于查询响应的单一时点智能体低维表示的理论和实证性质。本文引入了时间数据核视角空间（TDKPS），该方法联合嵌入了跨时间的智能体，并提出了一些用于检测黑盒多智能体系统中智能体和群体层面行为变化的新型假设检验方法。我们通过模拟多智能体系统中不断演变的数字人物来表征我们提出检验方法的实证性质，包括它们对关键超参数的敏感性。最后，通过自然实验表明，我们提出的方法能够检测与真实外生事件高度敏感、具体和显著相关的变化。据我们所知，TDKPS是第一个用于监控黑盒多智能体系统中行为动态的原理性框架——随着生成智能体部署的不断扩大，这一能力至关重要。

Experience Replay with Random Reshuffling

Authors: Yasuhiro Fujita

First: 2025-03-04T04:37:22+00:00 · Latest: 2025-12-04T17:20:31+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings, and analyze their properties via theoretical analysis and simulations. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning. Code is available at https://github.com/pfnet-research/errr.

中文标题/摘要

标题：随机重洗的经验重放

经验重放是强化学习中的关键组件，用于稳定学习并提高样本效率。其典型实现是从重放缓冲区中带放回地采样过渡。相比之下，在监督学习中，固定数据集每轮重新洗牌并顺序消耗数据的做法称为随机重洗（RR）。理论上，RR 具有更好的收敛性质，并且实验证明它优于带放回采样。为了在强化学习中利用 RR 的优势，我们提出了扩展 RR 到经验重放的采样方法，包括均匀和优先级设置，并通过理论分析和模拟分析了它们的性质。我们在 Atari 基准上评估了我们的采样方法，证明了它们在深度强化学习中的有效性。代码可在 https://github.com/pfnet-research/errr 获取。

Summary / 总结

The paper explores the application of random reshuffling (RR) in experience replay for reinforcement learning, aiming to improve learning stability and sample efficiency. It introduces RR-based sampling methods for both uniform and prioritized experience replay, supported by theoretical analysis and simulations. The proposed methods are evaluated on Atari benchmarks, showing their effectiveness in deep reinforcement learning.

论文研究了在强化学习中的经验重放中应用随机重洗（RR）的方法，以提高学习的稳定性和样本效率。它提出了适用于均匀和优先级经验重放的RR采样方法，并通过理论分析和模拟支持这些方法。所提出的方法在Atari基准测试中进行了评估，展示了其在深度强化学习中的有效性。

Reflection Removal through Efficient Adaptation of Diffusion Transformers

Authors: Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai

First: 2025-12-04T17:12:39+00:00 · Latest: 2025-12-04T17:12:39+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

中文标题/摘要

标题：通过高效适应扩散变换器去除反射

我们提出了一种基于扩散变换器（DiT）的单图像反射去除框架，该框架利用基础扩散模型在恢复设置中的泛化优势。我们不是依赖于特定任务的架构，而是通过将预训练的DiT基础模型重新用于反射污染的输入并引导其生成清洁的传输层来重新利用它。我们系统地分析了现有的反射去除数据源的多样性、可扩展性和逼真度。为了解决适合数据的短缺，我们在Blender中构建了一个基于物理渲染（PBR）的管道，围绕Principled BSDF来合成真实的玻璃材料和反射效果。基于Efficient LoRA的预训练基础模型适应与提出的合成数据相结合，在领域内和零样本基准上实现了最先进的性能。这些结果表明，当与物理基础的数据合成和高效适应相结合时，预训练的扩散变换器提供了一种可扩展且高保真的反射去除解决方案。项目页面：https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

Summary / 总结

This paper introduces a diffusion-transformer (DiT) framework for single-image reflection removal, leveraging a pre-trained DiT-based foundation model adapted to reflection-contaminated inputs. The authors construct a physically based rendering (PBR) pipeline in Blender to synthesize realistic glass materials and reflection effects, addressing data scarcity. Efficient LoRA-based adaptation of the foundation model, combined with the synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks, demonstrating the scalability and high-fidelity of pretrained diffusion transformers for reflection removal.

论文提出了一种基于扩散变换器（DiT）的单图像反射去除框架，利用预训练的DiT基础模型适应反射污染输入。作者在Blender中构建了一个基于物理渲染（PBR）的合成管道，以合成真实的玻璃材料和反射效果，解决数据稀缺问题。通过高效的LoRA基适应方法结合合成数据，该方法在领域内和零样本基准上达到了最先进的性能，展示了预训练的扩散变换器在反射去除中的可扩展性和高保真度。