arXiv 论文速递

MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen

First: 2025-12-02T18:59:58+00:00 · Latest: 2025-12-02T18:59:58+00:00

Comments: Code and demo available at https://magicquill.art/v2/

Abs · PDF · Code1 · Code2

Abstract

We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.

中文标题/摘要

标题：MagicQuillV2：具有分层视觉提示的精确和交互式图像编辑

我们提出了一种名为MagicQuill V2的新系统，该系统引入了分层合成范式，将生成图像编辑与扩散模型的语义能力和传统图形软件的精细控制相结合。虽然扩散变换器在整体生成方面表现出色，但它们使用单一的、单一的提示无法区分用户对内容、位置和外观的不同意图。为了解决这个问题，我们的方法将创意意图分解为可控的视觉提示堆栈：内容层用于创建什么，空间层用于放置位置，结构层用于形状，颜色层用于调色板。我们的技术贡献包括一种专门的数据生成流水线，用于上下文感知的内容集成，一个统一的控制模块来处理所有视觉提示，以及一个微调的空间分支，用于精确的局部编辑，包括对象删除。广泛的实验验证了这种分层方法有效地解决了用户意图差距，使创作者能够直接、直观地控制生成过程。

Summary / 总结

MagicQuill V2 introduces a layered composition paradigm for generative image editing, addressing the limitations of diffusion models by disentangling user intentions into content, spatial, structural, and color layers. The system includes a specialized data generation pipeline, a unified control module, and a fine-tuned spatial branch for precise local editing. Experiments show that this approach effectively resolves user intention gaps, providing creators with direct and intuitive control over the generative process.

MagicQuill V2 提出了一种分层合成范式，用于生成图像编辑，通过将用户意图分解为内容、空间、结构和颜色层来解决扩散模型的局限性。该系统包括一个专门的数据生成管道、一个统一的控制模块和一个用于精确局部编辑的微调空间分支。实验表明，这种方法有效地解决了用户意图的缺口，为创作者提供了直接和直观的控制生成过程的能力。

OneThinker: All-in-one Reasoning Model for Image and Video

Authors: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue

First: 2025-12-02T18:59:52+00:00 · Latest: 2025-12-02T18:59:52+00:00

Comments: Project page: https://github.com/tulerfeng/OneThinker

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

中文标题/摘要

标题：OneThinker：统一的图像和视频推理模型

强化学习（RL）最近在多模态大型语言模型（MLLMs）中的视觉推理方面取得了显著成功。然而，现有方法通常为不同的任务训练单独的模型，并将图像和视频推理视为分离的领域。这限制了向多模态推理通才的扩展能力，限制了实际的多功能性和妨碍了任务和模态之间的知识共享。为此，我们提出了一种名为OneThinker的统一推理模型，该模型统一了图像和视频理解，涵盖了包括问答、描述、空间和时间定位、跟踪和分割在内的多种基本视觉任务。为了实现这一点，我们构建了涵盖所有这些任务的OneThinker-600k训练语料库，并使用商业模型进行CoT注释，从而得到OneThinker-SFT-340k用于SFT冷启动。此外，我们提出了EMA-GRPO来处理多任务RL中的奖励异质性，通过跟踪每个任务的奖励标准差的移动平均值来实现平衡优化。在多种视觉基准上的广泛实验表明，OneThinker在31个基准上表现出强大的性能，涵盖了10个基本的视觉理解任务。此外，它在某些任务之间表现出有效的知识迁移和初步的零样本泛化能力，标志着向统一的多模态推理通才迈进了一步。所有代码、模型和数据均已发布。

Summary / 总结

OneThinker is an all-in-one reasoning model that unifies image and video understanding across various tasks such as question answering, captioning, and segmentation. It addresses the limitations of existing approaches by training a single model for multiple tasks and modalities, enabling better scalability and knowledge sharing. Experiments show that OneThinker performs well on 31 benchmarks across 10 fundamental visual understanding tasks and demonstrates effective knowledge transfer and zero-shot generalization capabilities.

OneThinker 是一个统一的推理模型，能够跨多种视觉任务（如问答和字幕生成）统一处理图像和视频理解。它通过训练单一模型来应对多个任务和模态的限制，从而实现更好的可扩展性和知识共享。实验表明，OneThinker 在 31 个基准测试中的 10 个基本视觉理解任务上表现出色，并展示了初步的零样本泛化能力。该模型使用 EMA-GRPO 方法来处理多任务强化学习中的奖励异质性。所有代码、模型和数据均已公开发布。

PPTArena: A Benchmark for Agentic PowerPoint Editing

Authors: Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang

First: 2025-12-02T18:59:50+00:00 · Latest: 2025-12-02T18:59:50+00:00

Comments: 25 pages, 26 figures

Abs · PDF · Code1 · Code2

Abstract

We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

中文标题/摘要

标题：PPTArena：一种代理型PowerPoint编辑基准

我们介绍了PPTArena，这是一种PowerPoint编辑基准，用于在自然语言指令下对真实幻灯片进行可靠的修改。与图像-PDF渲染或文本到幻灯片生成不同，PPTArena关注在100个演示文稿、2125张幻灯片和超过800个针对文本、图表、表格、动画和母版样式的目标编辑中的就地编辑。每个案例包括一个真实幻灯片副本、一个完全指定的目标结果以及一个双VLM作为裁判的管道，该管道分别使用结构差异和幻灯片图像对指令遵循和视觉质量进行评分。在此基础上，我们提出了PPTPilot，这是一种结构感知的幻灯片编辑代理，它计划语义编辑序列，通过高级程序化工具和确定性XML操作之间的路由实现精确控制，并通过迭代计划-编辑-检查循环与特定任务约束进行验证。在我们的实验中，PPTPilot在复合、布局敏感和跨幻灯片编辑方面比强大的专有代理和前沿VLM系统高出10个百分点以上，在视觉保真度和演示文稿范围一致性方面尤其表现出色。尽管这些改进，现有的代理在PPTArena中的长周期、文档规模任务中仍然表现不佳，突显了可靠PowerPoint编辑中仍存在的挑战。

Summary / 总结

PPTArena is a benchmark for PowerPoint editing that evaluates natural-language-driven modifications to real slides. It includes 100 decks, 2125 slides, and 800 targeted edits covering various elements. PPTPilot, a structure-aware slide-editing agent, plans semantic edits, uses high-level tools and XML operations, and iteratively verifies outputs. Experiments show PPTPilot outperforms strong proprietary agents and VLM systems by over 10 percentage points on complex edits, particularly in visual fidelity and deck consistency, though long-horizon tasks remain challenging.

PPTArena 是一个 PowerPoint 编辑基准，评估自然语言驱动的真实幻灯片修改。它涉及 100 个幻灯片集、2125 张幻灯片和超过 800 个针对文本、图表、表格、动画和母版样式的编辑。每个案例包括真实幻灯片和目标结果，并使用双重 VLM 作为评判管道分别评估指令遵循和视觉质量。PPTPilot，一种结构感知的幻灯片编辑代理，比现有代理和 VLM 系统在复杂编辑上高出超过 10 个百分点，特别是在视觉保真度和幻灯片集一致性方面，但仍面临长期任务的挑战。

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Authors: Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia

First: 2025-12-02T18:59:48+00:00 · Latest: 2025-12-02T18:59:48+00:00

Comments: Project Page: https://qinghew.github.io/MultiShotMaster

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

中文标题/摘要

标题：MultiShotMaster：一种可控多镜头视频生成框架

当前的视频生成技术在单镜头片段方面表现出色，但在生成具有灵活镜头排列、连贯叙事和超出文本提示的控制能力的多镜头叙事视频方面存在困难。为了解决这些挑战，我们提出了MultiShotMaster，一种高度可控的多镜头视频生成框架。我们通过集成两种新型的RoPE变体扩展了预训练的单镜头模型。首先，我们引入了多镜头叙事RoPE，在镜头转换处应用显式相位偏移，从而实现灵活的镜头排列同时保持时间叙事顺序。其次，我们设计了时空位置感知RoPE，以结合参考标记和接地信号，实现时空定位参考注入。此外，为了克服数据稀缺性，我们建立了一个自动数据注释流水线，以提取多镜头视频、字幕、跨镜头接地信号和参考图像。我们的框架利用内在的架构特性支持多镜头视频生成，具备文本驱动的跨镜头一致性、定制主题与运动控制以及背景驱动的定制场景。镜头数量和持续时间均可灵活配置。大量实验表明，我们的框架具有优越的性能和出色的可控性。

Summary / 总结

MultiShotMaster is a framework designed to generate controllable multi-shot videos, addressing the limitations of single-shot video generation techniques. It integrates two novel RoPE variants to enable flexible shot arrangement and coherent narrative. The framework also includes an automated data annotation pipeline to extract necessary information for multi-shot video generation. Experimental results show that MultiShotMaster outperforms existing methods in terms of performance and controllability.

研究旨在解决当前视频生成技术在生成叙事多镜头视频方面的局限性。提出了MultiShotMaster框架，结合了两种新型RoPE变体，以实现灵活的镜头排列和连贯的叙事。该框架还包括一个自动数据注释流水线来处理数据稀缺问题。实验结果表明，MultiShotMaster在可控性和性能方面优于现有方法。

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Authors: Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan

First: 2025-12-02T18:59:44+00:00 · Latest: 2025-12-02T18:59:44+00:00

Comments: Project page at https://xizaoqu.github.io/video4spatial/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

中文标题/摘要

标题：Video4Spatial：基于视频上下文引导的视频生成与空间智能

我们研究视频生成模型是否能够表现出空间智能，这是一种人类认知的核心能力，仅使用视觉数据。为此，我们提出了Video4Spatial框架，该框架表明，仅基于视频场景上下文条件的视频扩散模型可以执行复杂的空间任务。我们在两个任务上进行了验证：场景导航 - 在遵循摄像机姿态指令的同时保持与场景三维几何的一致性，以及对象定位 - 这需要语义定位、指令跟随和规划。两个任务都使用视频输入，不使用深度或姿态等辅助模态。通过框架中简单而有效的设计选择和数据整理，Video4Spatial展示了强大的空间理解能力：它从视频上下文进行端到端的导航规划和目标对象定位，遵循摄像机姿态指令并保持空间一致性，并且能够泛化到长上下文和跨域环境。这些结果共同推动了视频生成模型向通用空间智能推理的发展。

Summary / 总结

The research aims to explore if video generative models can exhibit visuospatial intelligence using only visual data. Video4Spatial, a framework, shows that video diffusion models conditioned on video-based scene context can perform complex spatial tasks such as scene navigation and object grounding. The model demonstrates strong spatial understanding, planning navigation, grounding target objects, and maintaining spatial consistency, even in long contexts and out-of-domain environments.

研究旨在探索视频生成模型是否能够仅使用视觉数据表现出空间智能。Video4Spatial框架表明，基于视频场景上下文的视频扩散模型可以执行复杂的空间任务，如场景导航和物体定位。该模型能够规划导航、定位目标物体、遵循摄像机姿态指令，并且能够泛化到长上下文和跨域环境，展示了从视频上下文中的强大空间理解能力。

Amortized Sampling with Transferable Normalizing Flows

Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov

Venue: NeurIPS 2025

First: 2025-08-25T16:28:18+00:00 · Latest: 2025-12-02T18:58:37+00:00

Comments: Presented at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

中文标题/摘要

标题：转移可及流

分子构象的有效平衡采样仍然是计算化学和统计推断中的核心挑战。经典方法如分子动力学或马尔可夫链蒙特卡洛方法本身缺乏自动化；每个感兴趣系统的采样计算成本必须全额支付。生成模型的广泛成功激发了通过学习采样算法来克服这一限制的兴趣。尽管在单个系统上训练时与传统方法竞争，但学习采样器迄今为止在系统间转移方面表现出有限的能力。我们通过引入Prose，一种基于肽分子动力学轨迹训练的2.85亿参数的原子级转移可及流，证明了深度学习能够设计出可扩展且转移的采样器。Prose能够零样本生成任意肽系统的未相关提议样本，实现了序列长度上的转移性，同时保留了可及流的高效似然评估。通过广泛的实证评估，我们证明了Prose作为各种采样算法的提议的有效性，发现一种基于重要性采样的微调程序可实现与顺序蒙特卡洛等传统方法相当的性能。我们开源了Prose代码库、模型权重和训练数据集，以进一步刺激对自动化采样方法和微调目标的研究。

Summary / 总结

This paper addresses the challenge of efficient sampling of molecular conformations in computational chemistry and statistical inference. It introduces Prose, a 285 million parameter normalizing flow trained on peptide molecular dynamics trajectories, which enables zero-shot sampling for arbitrary peptide systems. Prose demonstrates transferability across different sequence lengths and retains the efficient likelihood evaluation of normalizing flows. The method outperforms traditional sampling techniques like Markov chain Monte Carlo and sequential Monte Carlo through importance sampling-based fine-tuning, and the codebase is open-sourced for further research.

研究旨在开发高效且可迁移的分子构象采样方法，解决经典采样技术的局限性。引入了Prose，这是一种2.85亿参数的归一化流，能够实现不同肽系统的零样本采样。Prose在序列长度上实现了可迁移性，同时保持了归一化流的高效似然评估。实验证明，通过简单的重要性采样微调，Prose与序列蒙特卡洛等传统方法具有竞争力。

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

Authors: Yuandong Tian

First: 2025-09-25T20:08:09+00:00 · Latest: 2025-12-02T18:56:34+00:00

Comments: Find new mechanism that $G_F$ carries useful signals also at initial stage and thus remove theory's dependency on weight decay. Also add experiments on zero-init output layers, showing the technique is effective in accelerating grokking

Abs · PDF · Code1 · Code2 · Code3

Abstract

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li}_2$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize, and at the same time, the backpropagated gradient $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers. The code is available at https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo.

中文标题/摘要

标题：可证明的特征涌现缩放定律从学习动力学的grokking

虽然grokking现象，即延迟泛化，已经被广泛研究，但仍然没有一个数学框架能够描述什么样的特征会涌现，它如何发生以及在什么条件下发生，且与复杂结构输入的梯度动态密切相关。我们提出了一种新的框架，称为$\mathbf{Li}_2$，捕捉了2层非线性网络grokking行为的三个关键阶段：(I) 懒学习，(II) 独立特征学习和(III) 交互特征学习。在懒学习阶段，顶层过拟合到随机隐藏表示，模型似乎在记忆，同时，从顶层反向传播的梯度$G_F$现在包含了目标标签的信息，具有特定结构，使每个隐藏节点能够独立学习其表示。有趣的是，独立动态遵循能量函数$E$的梯度上升，其局部最大值正是涌现的特征。我们研究了这些局部最优解诱导的特征是否具有泛化能力，它们的表示能力以及在样本数量变化时如何变化，特别是在分组算术任务中。当学习后期隐藏节点开始相互作用时，我们证明了$G_F$如何变化以关注需要学习的缺失特征。我们的研究揭示了关键超参数（如权重衰减、学习率和样本数量）在grokking中的作用，导出了特征涌现、记忆和泛化的可证明缩放定律，并揭示了为什么最近的优化器（如Muon）有效。我们的分析可以扩展到多层。代码可在https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo/获取。

Summary / 总结

The study investigates the phenomenon of grokking, focusing on the mathematical framework that characterizes feature emergence during the training of 2-layer nonlinear networks. It introduces a novel framework, $\mathbf{Li}_2$, which captures three stages: lazy learning, independent feature learning, and interactive feature learning. The research shows that at the initial stage, the model overfits to random hidden representations, but the backpropagated gradient $G_F$ carries information about the target label, enabling independent feature learning. The study also demonstrates how $G_F$ changes to focus on missing features in the later stages, leading to provable scaling laws of feature emergence, memorization, and generalization. The analysis reveals the roles of key hyperparameters and explains why recent optimizers like Muon are effective. Experiments on sample size and group arithmetic tasks further validate the findings.

研究探讨了2层非线性网络中的“grokking”现象，提出了一种名为$\mathbf{Li}_2$的框架，涵盖了三个阶段：懒学习、独立特征学习和交互特征学习。研究显示，在初始阶段，模型会记忆随机隐藏表示，但反向传播的梯度$G_F$包含了目标标签的信息，从而实现独立特征学习。研究还展示了$G_F$如何在后期阶段聚焦于需要学习的缺失特征，从而得出特征涌现、记忆和泛化的可证明的缩放定律。分析揭示了权重衰减、学习率和样本大小等超参数在“grokking”中的作用，并解释了为什么像Muon这样的优化器可以有效。零初始化输出层的实验进一步验证了该技术在加速“grokking”方面的有效性。

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Authors: Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin

First: 2025-12-02T18:56:12+00:00 · Latest: 2025-12-02T18:56:12+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

中文标题/摘要

标题：ViSAudio：端到端基于视频的双耳三维声场生成

尽管在视频到音频生成方面取得了进展，该领域仍主要关注单声道输出，缺乏空间沉浸感。现有的双耳方法仍然受限于两阶段管道，首先生成单声道音频，然后进行空间化处理，这通常会导致误差累积和时空不一致。为解决这一限制，我们提出了直接从静音视频生成端到端双耳三维声场的任务。为了支持这一任务，我们提出了BiAudio数据集，包含约97K个视频-双耳音频对，覆盖了多样化的现实场景和摄像机旋转轨迹，通过半自动化管道构建。此外，我们提出了ViSAudio，这是一种端到端框架，采用条件流匹配和双分支音频生成架构，其中两个专用分支建模音频潜在流。结合条件时空模块，它在保持独特空间特征的同时平衡通道一致性，确保音频与输入视频在时空上的精确对齐。全面的实验表明，ViSAudio在客观指标和主观评估中均优于现有最先进的方法，生成高质量的双耳音频，具有空间沉浸感，能够有效适应视角变化、声源运动和各种声学环境。项目网站：https://kszpxxzmc.github.io/ViSAudio-project.

Summary / 总结

The research aims to generate high-quality binaural spatial audio directly from silent video to enhance spatial immersion. It introduces ViSAudio, an end-to-end framework using a dual-branch audio generation architecture and a conditional spacetime module to balance consistency and spatial characteristics. Experiments show that ViSAudio outperforms existing methods in both objective and subjective evaluations, producing precise and adaptive binaural audio.

研究旨在直接从静音视频生成高质量的立体声空间音频，以增强空间沉浸感。ViSAudio 是一个端到端框架，采用双分支音频生成架构和条件时空模块来平衡一致性和保留空间特征。实验表明，ViSAudio 在客观和主观评估中均优于现有方法，生成精确且适应性强的立体声音频。

Learning Physically Consistent Lagrangian Control Models Without Acceleration Measurements

Authors: Ibrahim Laiche, Mokrane Boudaoud, Patrick Gallinari, Pascal Morin

First: 2025-12-02T18:56:02+00:00 · Latest: 2025-12-02T18:56:02+00:00

Comments: Submitted to the L4DC 2026

Abs · PDF · Code1 · Code2

Abstract

This article investigates the modeling and control of Lagrangian systems involving non-conservative forces using a hybrid method that does not require acceleration calculations. It focuses in particular on the derivation and identification of physically consistent models, which are essential for model-based control synthesis. Lagrangian or Hamiltonian neural networks provide useful structural guarantees but the learning of such models often leads to inconsistent models, especially on real physical systems where training data are limited, partial and noisy. Motivated by this observation and the objective to exploit these models for model-based nonlinear control, a learning algorithm relying on an original loss function is proposed to improve the physical consistency of Lagrangian systems. A comparative analysis of different learning-based modeling approaches with the proposed solution shows significant improvements in terms of physical consistency of the learned models, on both simulated and experimental systems. The model's consistency is then exploited to demonstrate, on an experimental benchmark, the practical relevance of the proposed methodology for feedback linearization and energy-based control techniques.

中文标题/摘要

标题：学习不依赖加速度测量的物理一致拉格朗日控制模型

本文研究了使用混合方法建模和控制涉及非保守力的拉格朗日系统，该方法无需进行加速度计算。特别关注推导和识别物理一致模型，这些模型对于基于模型的控制综合至关重要。拉格朗日或哈密顿神经网络提供了有用的结构保证，但学习此类模型通常会导致不一致的模型，尤其是在训练数据有限、部分和嘈杂的实际物理系统中。基于这一观察和利用这些模型进行基于模型的非线性控制的目标，提出了一种依赖于原始损失函数的学习算法，以提高拉格朗日系统的物理一致性。与提出的解决方案相比，不同学习建模方法的比较分析显示，在模拟和实验系统中，所学习模型的物理一致性有了显著提高。然后利用模型的一致性，在实验基准上展示了所提出方法在反馈线性化和能量控制技术中的实际相关性。

Summary / 总结

This study addresses the challenge of modeling Lagrangian systems with non-conservative forces without needing acceleration measurements. It proposes a hybrid method to derive physically consistent Lagrangian models, which are crucial for model-based control. The method uses a novel loss function to improve the consistency of these models, especially in real-world scenarios with limited and noisy data. The approach shows significant improvements in physical consistency compared to other learning-based methods, both in simulations and experiments. The enhanced models are then used to demonstrate the practical benefits of feedback linearization and energy-based control techniques.

该研究解决了无需加速度测量即可建模具有非保守力的拉格朗日系统的挑战。它提出了一种混合方法，使用拉格朗日或哈密顿神经网络来推导物理上一致的模型，这些模型对于基于模型的控制至关重要。该方法引入了一个原始的损失函数来增强物理一致性，特别是在有限和嘈杂的训练数据场景中。实验结果表明，该方法在模拟和真实系统中显著提高了模型的一致性，并通过反馈线性化和能量控制技术进一步验证了其有效性。

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Authors: Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

First: 2025-12-02T18:55:53+00:00 · Latest: 2025-12-02T18:55:53+00:00

Comments: Our project website is https://carlyx.github.io/MAViD/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.

中文标题/摘要

标题：MAViD：一种多模态音频-视觉对话理解和生成框架

我们提出了一种名为MAViD的新型多模态框架，用于音频-视觉对话理解和生成。现有方法主要集中在非交互系统上，只能生成受限且不自然的人类语音。该任务的主要挑战在于有效整合理解和生成能力，以及实现无缝的多模态音频-视频融合。为了解决这些问题，我们提出了一种指挥者-创造者架构，将对话系统分为两个主要组件。指挥者负责将指令分解为动作和语音组件，以实现对交互的精细控制。创造者则根据这些指令生成互动响应。此外，为了应对使用双DiT结构生成具有一致身份、音色和语调的长视频的难度，创造者采用了结合自回归（AR）和扩散模型的结构。AR模型负责音频生成，而扩散模型确保高质量的视频生成。此外，我们还提出了一种新的融合模块，以增强上下文连续片段和模态之间的连接，从而实现同步的长时音频-视觉内容生成。大量实验表明，我们的框架可以生成生动且上下文连贯的长时对话交互，并准确解释用户的多模态查询。

Summary / 总结

MAViD is a multimodal framework designed for understanding and generating audio-visual dialogue. It addresses the limitations of existing non-interactive systems by introducing a Conductor-Creator architecture. The Conductor processes instructions into motion and speech components, while the Creator generates interactive responses using a dual DiT structure combining autoregressive and diffusion models. Experiments show that MAViD can produce vivid, contextually coherent long-duration dialogue interactions and accurately interpret multimodal queries.

MAViD 是一个新颖的多模态框架，用于音频-视觉对话理解和生成。它通过提出指挥者-创作者架构来解决整合理解和生成能力的挑战。指挥者将指令分解为动作和语音组件，而创作者使用双 DiT 结构和融合模块生成互动响应。实验表明，MAViD 能够生成生动且上下文连贯的长时间对话交互，并准确解释多模态查询。

SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

Authors: Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura, Guy Tevet, Chuan Guo, Michael Taylor, Chang Shu, Pengcheng Xi, Xue Bin Peng

First: 2025-12-02T18:54:12+00:00 · Latest: 2025-12-02T18:54:12+00:00

Comments: 14 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20

中文标题/摘要

标题：SMP：可重用的评分匹配运动先验用于基于物理的角色控制

数据驱动的运动先验可以引导代理产生自然行为，在创建逼真虚拟角色中起着关键作用。对抗性模仿学习是一种从参考运动数据中学习运动先验的非常有效的方法。然而，除了少数例外，对抗性先验需要为每个新控制器重新训练，从而限制了它们的可重用性，并且在下游任务训练时需要保留参考运动数据。在本文中，我们提出了评分匹配运动先验（SMP），它利用预训练的运动扩散模型和评分蒸馏采样（SDS）来创建可重用的任务无关运动先验。SMP可以在与任何控制策略或任务无关的运动数据集上进行预训练。一旦训练完成，SMP可以保持冻结并作为通用奖励函数重用，以训练策略生成自然行为。我们展示了大规模数据集上训练的一般运动先验可以重新用于各种风格特定的先验。此外，SMP可以组合不同的风格以合成原始数据集中不存在的新风格。我们的方法通过可重用和模块化的运动先验生成高质量的运动，与最先进的对抗性模仿学习方法相当。我们展示了SMP在多种物理模拟的人形角色控制任务中的有效性。视频演示可在https://youtu.be/ravlZJteS20观看

Summary / 总结

This paper introduces Score-Matching Motion Priors (SMP), which are pre-trained motion diffusion models that can be reused as general-purpose reward functions for training policies to produce naturalistic behaviors. SMPs are trained independently of specific control policies or tasks and can be adapted to various styles. The method demonstrates high-quality motion comparable to state-of-the-art adversarial imitation learning methods across diverse control tasks with simulated humanoid characters.

本文介绍了Score-Matching Motion Priors (SMP) 方法，该方法用于创建可重用的运动先验，以引导代理产生自然行为，而无需为每个新控制器重新训练。SMP 利用预训练的运动扩散模型和得分蒸馏采样来生成任务无关的先验，这些先验可以用作训练策略的奖励函数。该方法展示了通用运动先验可以适应成特定风格的先验，甚至可以生成原始数据集中不存在的新风格，产生与最先进的方法相当的质量运动。实验表明该方法在各种控制任务中对模拟的人形角色都有效。

A process algebraic framework for multi-agent dynamic epistemic systems

Authors: Alessandro Aldini

First: 2024-07-24T08:35:50+00:00 · Latest: 2025-12-02T18:53:04+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper combines the classical model of labeled transition systems with the epistemic model for reasoning about knowledge. The result is a unifying framework for modeling and analyzing multi-agent, knowledge-based, dynamic systems. On the modeling side, we propose a process algebraic, agent-oriented specification language that makes such a framework easy to use for practical purposes. On the verification side, we define a modal logic encompassing temporal and epistemic operators.

中文标题/摘要

标题：多智能体动态知识系统的过程代数框架

本文将经典的标记转换系统模型与关于知识推理的模型相结合。结果是一个统一的框架，用于建模和分析多智能体、基于知识的动态系统。在建模方面，我们提出了一种过程代数的、面向智能体的规范语言，使得这种框架易于实际使用。在验证方面，我们定义了一种模态逻辑，其中包括时间性和知识性操作符。

Summary / 总结

This paper develops a framework that integrates labeled transition systems with epistemic models to analyze multi-agent, knowledge-based, dynamic systems. It introduces a process algebraic, agent-oriented specification language for practical modeling and a modal logic with temporal and epistemic operators for verification. Key findings include the creation of a unified approach for both modeling and analyzing complex multi-agent systems. Verification through the defined modal logic confirms the framework's effectiveness in capturing dynamic epistemic behaviors.

本文提出了一种将标记转换系统与知识推理模型相结合的框架，用于分析多代理、基于知识的动态系统。该方法包括一种过程代数化的、面向代理的规格语言进行建模，以及包含时态和知识操作符的模态逻辑进行验证。主要发现包括开发了一个统一框架，简化了此类系统的实际建模和验证任务。

The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models

Authors: Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh

First: 2025-12-02T18:52:29+00:00 · Latest: 2025-12-02T18:52:29+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.

中文标题/摘要

标题：道德一致性管道：大型语言模型的持续伦理评估

大型语言模型（LLMs）的快速进步和适应性突显了道德一致性的重要性，即在不同背景下保持伦理一致性的能力。现有的对齐框架，旨在使模型行为与人类伦理和社会规范保持一致的结构化方法，通常依赖于静态数据集和事后评估，这在不同背景或时间尺度上如何影响伦理推理方面提供了有限的见解。本研究提出了道德一致性管道（MoCoP），这是一种无需数据集、闭环的框架，用于持续评估和解释LLMs的道德稳定性。MoCoP 结合了三个支持层：（i）词汇完整性分析，（ii）语义风险估计，（iii）基于推理的判断建模，这些都在一个自维持架构中，该架构能够自主生成、评估和改进伦理场景，无需外部监督。我们在GPT-4-Turbo和DeepSeek上的实证结果表明，MoCoP 有效地捕捉了纵向伦理行为，揭示了伦理和毒性维度之间存在强烈负相关（相关系数rET = -0.81，p值小于0.001），与响应延迟几乎无关（相关系数rEL约等于0）。这些发现表明，道德连贯性和语言安全性更可能作为模型行为的稳定和可解释特征出现，而不是短期波动。此外，通过将伦理评估重新构想为一种动态的、模型无关的道德反省形式，MoCoP 为可扩展的、持续的审计提供了一个可重复的基础，并推动了自主人工智能系统中计算伦理学的研究。

Summary / 总结

This study introduces the Moral Consistency Pipeline (MoCoP), a framework for continuously evaluating the moral stability of Large Language Models (LLMs) without relying on static datasets. MoCoP uses lexical integrity analysis, semantic risk estimation, and reasoning-based judgment modeling to autonomously generate, evaluate, and refine ethical scenarios. Empirical results on GPT-4-Turbo and DeepSeek show that MoCoP effectively captures longitudinal ethical behavior, with a strong inverse relationship between ethical and toxicity dimensions and a near-zero association with response latency. This suggests that moral coherence and linguistic safety are stable characteristics of model behavior. By framing ethical evaluation as a dynamic process, MoCoP provides a scalable foundation for continuous auditing of LLMs.

研究引入了道德一致性管道（MoCoP），这是一种无需依赖静态数据集来连续评估大型语言模型（LLMs）道德稳定性的框架。MoCoP 结合了词汇完整性分析、语义风险估计和基于推理的判断建模，以自主生成、评估和改进道德场景。在 GPT-4-Turbo 和 DeepSeek 上的实验证明，MoCoP 有效地捕捉了纵向道德行为，伦理和毒性维度之间存在强烈的负相关关系，而与响应延迟几乎无关。这表明道德一致性和语言安全性是模型行为的稳定且可解释的特征。通过将道德评估重新定义为动态过程，MoCoP 为 LLM 的连续审计提供了可扩展的基础。

LORE: A Large Generative Model for Search Relevance

Authors: Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao, Junjie Ren, Ruicong Xu, Haoran Li, Songyan Liu, Pengjie Wang, Jian Xu, Bo Zheng

First: 2025-12-02T18:50:42+00:00 · Latest: 2025-12-02T18:50:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.

中文标题/摘要

标题：LORE：一种大规模生成模型用于搜索相关性

成果。我们介绍了LORE，一种基于大规模生成模型的电子商务搜索相关性系统框架。经过三年的部署和迭代，LORE在线GoodRate指标累计提高了27%。本报告分享了其开发生命周期中的宝贵经验，涵盖数据、特征、训练、评估和部署。见解。虽然现有工作利用链式思维（CoT）来增强相关性，但往往遇到性能瓶颈。我们认为这源于将相关性视为单一任务，缺乏原理性的分解。我们的关键见解是，相关性包含不同的能力：知识和推理、多模态匹配以及规则遵守。我们认为，定性驱动的分解对于突破当前性能瓶颈至关重要。贡献。LORE提供了LLM相关性生命周期的完整蓝图。关键贡献包括：(1) 一种两阶段训练范式，结合渐进式CoT合成通过SFT与通过RL的人类偏好对齐。(2) 一个全面的基准RAIR，用于评估这些核心能力。(3) 一种基于查询频率的部署策略，高效地将离线LLM能力转移到在线系统。LORE既是实际解决方案，也是其他垂直领域的参考方法论。

Summary / 总结

LORE is a framework for large generative models in e-commerce search relevance, achieving a 27% improvement in online metrics over three years. It introduces a two-stage training paradigm and a comprehensive benchmark to break through current performance bottlenecks by decomposing relevance into knowledge and reasoning, multi-modal matching, and rule adherence. The query frequency-stratified deployment strategy efficiently transfers offline LLM capabilities to the online system.

LORE 是一个用于电子商务搜索相关性的系统框架，在三年内实现了在线指标累计27%的提升。它引入了结合链式思考合成和人类偏好对齐的两阶段训练范式，以及基于查询频率的部署策略。关键发现包括一个用于评估核心能力的综合基准和LLM相关性生命周期的蓝图。

TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Authors: Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, Yong Chen

Venue: AAAI

First: 2025-12-02T18:50:17+00:00 · Latest: 2025-12-02T18:50:17+00:00

Comments: Accepted by the AAAI'26 Conference Main Track

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

中文标题/摘要

标题：TokenPowerBench：评估LLM推理的功耗

大型语言模型（LLM）服务现在每天回答数十亿次查询，行业报告显示，推理而非训练占总功耗的90%以上。然而，现有的基准测试要么关注训练/微调，要么关注推理性能，对推理的功耗测量和分析支持甚少。我们介绍了TokenPowerBench，这是第一个专为LLM推理功耗研究设计的轻量级且可扩展的基准测试。该基准测试结合了(i) 一个声明性的配置接口，涵盖模型选择、提示集和推理引擎，(ii) 一个能够捕获GPU、节点和系统级功耗的测量层，无需专用的功率计，以及(iii) 一个与推理阶段对齐的度量流水线，将能量分配到每个请求的预填充和解码阶段。这些元素使得探索LLM推理运行的功耗变得简单；此外，通过改变批量大小、上下文长度、并行策略和量化，用户可以快速评估每种设置对每令牌焦耳和其它能效指标的影响。我们在四个最广泛使用的模型系列（Llama、Falcon、Qwen和Mistral）上评估了TokenPowerBench。我们的实验涵盖了从1亿参数到前沿规模的Llama3-405B模型。此外，我们还开源了TokenPowerBench，以帮助用户测量功耗、预测运营成本并满足部署LLM服务时的可持续发展目标。

Summary / 总结

TokenPowerBench is a benchmark designed to measure the power consumption of large language model inference. It includes a declarative configuration interface, a measurement layer for capturing power consumption at different levels, and a metrics pipeline that attributes energy to specific stages of inference. The benchmark evaluates four widely used model series, ranging from 1 billion parameters to the largest model, Llama3-405B, and allows users to assess the impact of various settings on energy efficiency. This tool helps in forecasting operating expenses and meeting sustainability targets for LLM services.

TokenPowerBench 是一个用于测量 LLM 推断功耗的基准，解决了现有基准在功耗测量方面的不足。它包含一个声明式的配置接口、一个多层级的功耗捕获测量层以及一个用于将能量分配到特定推断阶段的指标管道。该基准评估了四个广泛使用的模型系列和前沿规模的 Llama3-405B 模型，展示了不同设置对能源效率的影响。它作为开源发布，以帮助用户测量和预测 LLM 服务的功耗。

Morphling: Fast, Fused, and Flexible GNN Training at Scale

Authors: Anubhab, Rupesh Nasre

First: 2025-12-01T13:45:03+00:00 · Latest: 2025-12-02T18:50:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. The results show that Morphling improves per-epoch training throughput by an average of 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

中文标题/摘要

标题：Morphling：大规模下快速、融合且灵活的图神经网络训练

图神经网络（GNNs）通过将不规则的、内存绑定的图遍历与规则的、计算密集型的密集矩阵操作融合，提出了基本的硬件挑战。尽管像PyTorch Geometric（PyG）和Deep Graph Library（DGL）这样的框架优先考虑高级易用性，但它们未能解决这些不同的执行特性。因此，它们依赖于通用内核，这些内核遭受着较差的缓存局部性、过度的内存移动和大量的中间分配。为了解决这些限制，我们提出了Morphling，这是一种领域特定的代码合成器，旨在弥合这一差距。Morphling将高级GNN规范编译为针对OpenMP、CUDA和MPI的后端特定实现。它通过为每个执行环境实例化一个优化的、架构感知的原语库来实现这一点。Morphling还包含一个运行时稀疏感知执行引擎，该引擎根据输入特征统计信息动态选择密集或稀疏执行路径，从而减少不必要的零值项计算。我们在涵盖不同图结构、特征维度和稀疏性的11个真实数据集上评估了Morphling。结果显示，与PyG和DGL相比，Morphling在CPU上的每轮训练吞吐量平均提高了20倍，在GPU上的提高了19倍，峰值加速比达到66倍。Morphling的内存高效布局进一步将峰值内存消耗减少了高达15倍，使大规模GNN训练能够在普通硬件上实现。这些发现表明，专门的、架构感知的代码合成为跨各种并行和分布式平台实现高性能GNN执行提供了一条有效且可扩展的途径。

Summary / 总结

Morphling is designed to address the hardware challenges posed by Graph Neural Networks (GNNs) by compiling high-level GNN specifications into backend-specialized implementations for OpenMP, CUDA, and MPI. It uses optimized, architecture-aware primitives and a runtime sparsity-aware execution engine to reduce unnecessary computation. Evaluations on eleven real-world datasets show that Morphling improves per-epoch training throughput by an average of 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Memory-efficient layouts also reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware.

Morphling 是一种针对图神经网络 (GNN) 的硬件挑战进行编译的设计，将高级 GNN 规范编译成针对 OpenMP、CUDA 和 MPI 的后端特定实现。它使用优化的、针对架构的原语和运行时稀疏性感知执行引擎来减少不必要的计算。在 eleven 个真实世界数据集上的评估显示，Morphling 在 CPU 上将每轮训练吞吐量提高了 20 倍，在 GPU 上提高了 19 倍，最高加速比达到 66 倍。它还通过最多减少 15 倍的峰值内存消耗，使大规模 GNN 训练能够在普通硬件上实现。

TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Authors: Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu

First: 2025-11-04T18:25:17+00:00 · Latest: 2025-12-02T18:48:39+00:00

Comments: The library is open source and available at https://github.com/Lexsi-Labs/TabTune

Abs · PDF · Code1 · Code2 · Code3

Abstract

Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models.

中文标题/摘要

标题：TabTune：统一的表格基础模型推理与微调库

表格基础模型代表了结构化数据学习中不断增长的范式，将大规模预训练的优势扩展到表格领域。然而，由于异构预处理管道、碎片化的API、不一致的微调程序以及缺乏针对部署导向度量（如校准和公平性）的标准评估，其采用仍然受到限制。我们提出了TabTune，这是一个统一的库，通过单一接口标准化了表格基础模型的完整工作流程。TabTune 提供了对七种最先进的模型的一致访问，支持多种适应策略，包括零样本推理、元学习、监督微调（SFT）和参数高效微调（PEFT）。该框架自动化了模型感知的预处理，内部管理了架构异质性，并集成了性能、校准和公平性评估模块。TabTune 设计用于扩展性和可重复性，使表格基础模型的适应策略的一致基准测试成为可能。

Summary / 总结

TabTune is a unified library that standardizes the workflow for tabular foundation models by providing a consistent interface for accessing seven state-of-the-art models and supporting various adaptation strategies. It automates preprocessing, manages architectural differences, and integrates evaluation modules for performance, calibration, and fairness. TabTune aims to address the limitations of heterogeneous preprocessing pipelines and inconsistent fine-tuning procedures, making it easier to deploy these models in real-world applications.

TabTune 是一个统一的库，通过提供一致的接口访问七种最先进的模型并支持多种适应策略来标准化表格基础模型的工作流程。它自动化预处理，管理架构差异，并集成性能、校准和公平性的评估模块。TabTune 的目标是解决异构预处理管道和不一致的微调程序的局限性，使其更容易在实际应用中部署这些模型。

Training a Scientific Reasoning Model for Chemistry

Authors: Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, Andrew D. White

First: 2025-06-04T17:57:18+00:00 · Latest: 2025-12-02T18:47:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning models are large language models that emit a long chain-of-thought before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. It is also more data efficient relative to specialized models. We anticipate that this method can be applied to train data-efficient language models specialized for tasks across a wide variety of scientific domains.

中文标题/摘要

标题：训练化学科学推理模型

推理模型是大型语言模型，在回答问题前会发出一个长的推理链，提供更高的准确性和明确的推理。一个主要问题是语言模型推理是否能超越数学、编程和逻辑，而此前大部分工作都集中在这些领域。我们证明了推理模型可以在不进行额外领域预训练的情况下，针对化学进行后训练，并且所需数据量远少于当代的领域特定模型。我们报告了基于Mistral-Small-24B的240亿参数LLM（ether0），它可以以自然语言进行推理并以化学结构形式作出回应。该推理模型通过强化学习在640,730个实验验证的化学问题上进行了训练，这些问题涵盖了375个任务，从合成可行性到血脑屏障渗透性，再到人类受体活性，再到气味。我们的模型在分子设计任务上超过了通用化学模型、前沿模型和人类专家。它在数据效率上也优于专门化模型。我们预计这种方法可以应用于训练针对广泛科学领域任务的数据高效语言模型。

Summary / 总结

The research aims to develop a reasoning model for chemistry that can provide detailed reasoning processes similar to large language models. The method involves post-training a 24B parameter LLM (ether0) on 640,730 chemistry problems across various tasks. Key findings show that the model outperforms general-purpose models, specialized frontier models, and even human experts in molecular design tasks, while being more data-efficient than specialized models.

研究旨在开发一个用于化学的推理模型，能够提供详细的解释和准确的答案。该模型ether0通过强化学习训练了640,730个多样化的化学问题，并在分子设计任务上超过了通用模型和人类专家。此外，它所需的数据量比专门模型少，展示了其高效性。

Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Authors: Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

First: 2025-12-02T18:46:47+00:00 · Latest: 2025-12-02T18:46:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

中文标题/摘要

标题：Distribution-校准推理时间计算用于思考LLM作为法官

作为成对偏好法官的大型语言模型（LLMs）在单样本级别仍然存在噪声，而常见的聚合规则（多数投票、软自一致性或基于指令的自我聚合）在允许平局时是不一致的。我们研究了评估器的推理时间计算（ITC），这些评估器每项生成n个独立的思考评分样本，并提出了一种原理上合理的、分布校准的聚合方案。我们的方法使用Bradley-Terry-Davidson公式对评分计数进行三分类偏好建模，利用极性（非平局之间的差距）和决断性（非平局率）来区分狭窄差距和强烈共识。在各种评估基准上，我们的方法相对于标准基线在降低MAE和提高成对准确性方面表现一致，并且在与人类共识元标签进行评估时，能够匹配或超过个别的人类评分者。这些结果表明，仔细分配ITC并使用分布感知方法进行聚合可以将嘈杂的单个模型判断转化为可靠的评估评分。

Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Authors: Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta

Venue: NeurIPS 2025

First: 2025-12-02T18:41:10+00:00 · Latest: 2025-12-02T18:41:10+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

中文标题/摘要

标题：即时视频模型：用于稳定图像网络的通用适配器

当这些基于帧的网络依次应用于视频时，往往会表现出时间不一致性——例如，输出在帧之间闪烁。当网络输入包含时间变化的干扰时，这个问题会被放大。在本文中，我们介绍了一种适应基于帧的模型以在视频上进行稳定和鲁棒推理的一般方法。我们描述了一类可以插入几乎任何架构中的稳定性适配器，以及一种高效的资源训练过程，该过程可以在冻结的基础网络上执行。我们引入了一个统一的概念框架，用于描述时间稳定性和抗干扰性，以提出的一种建议的准确度-稳定性-鲁棒性损失为中心。通过分析这种损失的理论性质，我们确定了它产生良好行为稳定器训练的条件。我们的实验在包括去噪（NAFNet）、图像增强（HDRNet）、单目深度（Depth Anything v2）和语义分割（DeepLabv3+）等几个视觉任务上验证了我们的方法。我们的方法在一系列图像干扰（包括压缩伪影、噪声和恶劣天气）下提高了时间稳定性和鲁棒性，同时保持或提高了预测的质量。

Summary / 总结

This work addresses the issue of temporal inconsistency in frame-based networks when applied to video, especially in the presence of time-varying corruptions. The authors propose a general approach using stability adapters that can be inserted into various architectures, along with a resource-efficient training process. They introduce an accuracy-stability-robustness loss to improve temporal stability and robustness against various corruptions while maintaining or enhancing prediction quality. Experiments on tasks such as denoising, image enhancement, monocular depth estimation, and semantic segmentation demonstrate the effectiveness of their method.

该研究解决了帧基网络在视频序列中应用时出现的时间不一致性问题，特别是在存在时间变化的干扰时。作者引入了稳定性适配器，可以插入到各种架构中，并提出了一种资源高效的训练过程。他们提出了一个准确度-稳定性和鲁棒性损失函数，以提高稳定性和对抗干扰的能力，同时保持或提升预测质量。实验结果显示，在去噪、图像增强、单目深度估计和语义分割等任务上，稳定性与鲁棒性都有所提升。

In-Context Sync-LoRA for Portrait Video Editing

Authors: Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or

First: 2025-12-02T18:40:35+00:00 · Latest: 2025-12-02T18:40:35+00:00

Comments: Project page: https://sagipolaczek.github.io/Sync-LoRA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

中文标题/摘要

标题：基于上下文同步的LoRA肖像视频编辑

肖像视频编辑是一项具有挑战性的任务，需要对多种修改（如外观变化、表情编辑或添加物体）进行灵活而精确的控制。关键难点在于保持主体原始的时间行为，要求每个编辑帧都与相应的源帧精确同步。我们提出了Sync-LoRA，一种能够在保持帧准确同步和身份一致性的同时实现高质量视觉修改的肖像视频编辑方法。我们的方法使用图像到视频的扩散模型，其中编辑由修改第一帧定义，并传播到整个序列。为了实现准确的同步，我们使用配对视频训练上下文LoRA，这些配对视频具有相同的运动轨迹但外观不同。这些配对通过基于同步的过滤过程自动生成和筛选，仅选择最时间对齐的示例进行训练。这种训练设置教会模型结合源视频中的运动线索与编辑第一帧引入的视觉变化。Sync-LoRA在紧凑的、高度筛选的同步人类肖像数据集上进行训练，能够泛化到未见过的身份和各种编辑（如修改外观、添加物体或改变背景），并能稳健地处理姿态和表情的变化。我们的结果展示了高视觉保真度和强时间一致性，实现了编辑保真度和精确运动保留之间的稳健平衡。

Summary / 总结

The research aims to develop a method for editing portrait videos with high-quality visual modifications while preserving synchronization and identity consistency. The method, Sync-LoRA, uses an image-to-video diffusion model and trains an in-context LoRA on paired videos with identical motion but different appearances. This approach enables accurate synchronization and generalizes well to unseen identities and diverse edits. Key findings include high visual fidelity and strong temporal coherence, with robust handling of pose and expression variations.

Sync-LoRA 是一种用于编辑肖像视频的方法，能够保持时间同步和身份一致性。它使用从第一帧开始传播编辑的图像到视频的扩散模型。该方法通过具有相同运动但外观不同的配对视频训练上下文 LoRA，确保准确的同步。实验结果显示高度的视觉保真度和强时间一致性，能够稳健处理各种编辑，如外观更改、添加对象和背景修改。

SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

Authors: Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

First: 2025-12-02T18:35:54+00:00 · Latest: 2025-12-02T18:35:54+00:00

Comments: Project page: https://lfranke.github.io/surffill

Abs · PDF · Code1 · Code2 · Project1

Abstract

LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

中文标题/摘要

标题：SurfFill：通过高斯素面点扩散完成LiDAR点云

LiDAR捕获的点云通常被认为是主动三维重建的黄金标准。虽然在平坦区域其准确性极佳，但在捕捉小几何结构和深色吸光材料时容易遗漏。相反，通过拍摄场景的多张照片并应用三维摄影测量可以推断这些细节，因为这些区域通常富含特征。然而，LiDAR在无特征区域的准确性很少能达到。因此，我们建议结合LiDAR和基于相机的捕获，引入SurfFill：一种基于高斯素面的LiDAR完成方案。我们分析了LiDAR捕获，并将LiDAR束发散视为主要因素，主要在细长结构和边缘处表现为伪影。我们利用这一洞察引入了一种模糊性启发式方法，通过评估点云密度的变化来识别接近遗漏区域的点，然后从这些点生长额外的点以完成扫描。对于点生长，我们约束高斯素面重建[黄等2024]，使其优化和密集化集中在这些模糊区域。最后，在模糊区域提取重建的高斯原素并采样为点以完成点云。为了解决大规模重建的挑战，我们扩展了该流水线，使用分而治之的方案进行建筑物大小的点云完成。我们在合成和真实场景的LiDAR点云完成任务上进行了评估，并发现我们的方法优于之前的重建方法。

Summary / 总结

SurfFill is a Gaussian surfel-based method for completing LiDAR point clouds. It addresses the limitations of LiDAR in capturing small geometric structures and dark materials by combining LiDAR data with photogrammetry. The method identifies ambiguous areas through changes in point cloud density and uses Gaussian surfel reconstruction to grow points in these regions. On both synthetic and real-world scenes, SurfFill outperforms existing reconstruction methods.

SurfFill 是一种通过高斯表面素散点图来完成 LiDAR 点云的方法。它通过结合 LiDAR 数据和光栅测量来解决 LiDAR 在捕捉小几何结构和暗材料方面的局限性。该方法识别点云中的模糊区域，并从这些区域生长额外的点以完成扫描。在合成和真实场景的重建任务中，该方法优于之前的重建方法。

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Authors: Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

First: 2025-12-02T18:31:18+00:00 · Latest: 2025-12-02T18:31:18+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

中文标题/摘要

标题：从审查到调解：大语言模型能否在在线争吵中担任调解人？

大型语言模型（LLMs）的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流，它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本研究探讨LLMs是否不仅能作为检测有害内容的审查者，还能作为能够理解并缓解在线冲突的调解人。我们的框架将调解分解为两个子任务：判断，即LLM评估对话的公平性和情感动态；引导，即生成同理心、缓解冲突的消息，引导参与者走向解决。为了评估调解质量，我们构建了一个基于Reddit的大规模数据集，并提出了一种结合原则评分、用户模拟和人工对比的多阶段评估管道。实验表明，基于API的模型在进行调解时，在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。

Summary / 总结

This study investigates whether large language models (LLMs) can act as mediators in online conflicts, beyond their role as moderators. The research decomposes mediation into judgment and steering tasks and evaluates LLMs using a multi-stage pipeline. Experiments indicate that API-based models outperform open-source models in both reasoning and intervention alignment during mediation, suggesting both the potential and current limitations of LLMs in online social mediation.

这项研究探讨了大型语言模型（LLMs）是否可以在在线冲突中扮演调解者的角色，而不仅仅是作为内容过滤器。研究将调解分解为判断和引导两个子任务，并使用多阶段评估管道进行评估。实验表明，基于API的模型在推理和干预一致性方面优于开源模型，这既显示了LLMs在线社会调解中的潜力，也揭示了当前的局限性。

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Authors: Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Hongyang Li, Ya-Qin Zhang, Hao Zhao

First: 2025-12-02T18:29:18+00:00 · Latest: 2025-12-02T18:29:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

中文标题/摘要

标题：DGGT：使用未摆姿图像进行动态驾驶场景的前馈4D重建

自动驾驶需要快速且可扩展的4D重建和重模拟来进行训练和评估，但大多数动态驾驶场景的方法仍然依赖于场景优化、已知的相机校准或短帧窗口，这使得它们速度慢且不实用。我们从前馈的角度重新审视了这个问题，并引入了**驾驶高斯接地变换器（DGGT）**，这是一种无需姿态的统一框架，用于动态场景重建。我们注意到现有的公式将相机姿态视为必需的输入，这限制了灵活性和可扩展性。相反，我们将姿态重新定义为模型的输出，从而可以直接从稀疏的未摆姿图像进行重建，并支持长序列中的任意数量的视角。我们的方法联合预测每帧的3D高斯图和相机参数，通过一个轻量级的动力学头部解耦动态，并通过一个寿命头部随着时间调整可见性来保持时间一致性。基于扩散的渲染细化进一步减少了运动/插值伪影，并在稀疏输入下提高了新视角的质量。结果是一个单次通过、无需姿态的算法，实现了最先进的性能和速度。在大规模驾驶基准数据集（Waymo、nuScenes、Argoverse2）上进行训练和评估，我们的方法在每个数据集上训练时都优于先前的工作，并且在跨数据集的零样本迁移中也表现出色，随着输入帧数的增加，其可扩展性也很好。

Summary / 总结

The research aims to address the need for fast and scalable 4D reconstruction for autonomous driving training and evaluation. The authors introduce DGGT, a feedforward framework that reconstructs dynamic scenes directly from unposed images without requiring camera poses. Key findings show that DGGT outperforms previous methods in both single-dataset training and cross-dataset zero-shot transfer, and scales well with more input frames.

研究旨在解决自主驾驶中快速且可扩展的4D重建需求，当前方法受限于场景优化和相机校准。作者提出了DGGT，这是一种统一的前馈框架，可以直接从未标定的图像中重建动态场景，无需将相机姿态作为输入。主要发现包括相比现有方法，该方法在性能和速度上有所提升，并能处理长序列和任意视角，展示了在大规模驾驶基准上的领先成果。

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Authors: Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

First: 2025-12-02T18:24:27+00:00 · Latest: 2025-12-02T18:24:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

中文标题/摘要

标题：DynamicVerse：一种物理感知的多模态4D世界建模框架

理解动态物理世界，其特征为不断演变的三维结构、真实世界的运动以及带有文本描述的语义内容，对于人类代理交互至关重要，使具身代理能够以类人能力感知和行动于真实环境中。然而，现有数据集往往源自有限的模拟器或利用传统的Structure-from-Motion进行缩放标注，提供有限的描述性注释，这限制了基础模型从单目视频中准确解释真实世界动态的能力，这些视频通常来自互联网。为弥合这些差距，我们引入了DynamicVerse，一种物理尺度的多模态4D世界建模框架，用于动态现实视频。我们利用大规模的视觉、几何和多模态模型来解释度量级的静态几何结构、真实世界的动态运动、实例级掩码和整体描述性注释。通过结合基于窗口的Bundle Adjustment与全局优化，我们的方法将长时间的真实世界视频序列转换为全面的4D多模态格式。DynamicVerse提供了一个大规模数据集，包含来自互联网视频的10万多段视频、80多万个注释掩码和1000多万帧。在三个基准任务（即视频深度估计、相机姿态估计和相机内参估计）上的实验评估表明，我们的4D建模在捕捉物理尺度测量方面具有更高的全局准确性，优于现有方法。

Summary / 总结

DynamicVerse is a physically-aware multimodal framework for 4D world modeling, addressing the limitations of existing datasets by providing a large-scale dataset with 100K+ videos, 800K+ instance-level masks, and 10M+ frames. It uses large vision, geometric, and multimodal models to interpret static geometry, dynamic motion, instance-level masks, and descriptive captions. By integrating window-based Bundle Adjustment with global optimization, DynamicVerse converts long video sequences into a 4D multimodal format, achieving superior performance in video depth estimation, camera pose estimation, and camera intrinsics estimation compared to existing methods.

DynamicVerse 是一个物理感知的多模态 4D 世界建模框架，旨在解决现有数据集在捕捉真实世界动态方面的局限性。该方法使用大型视觉、几何和多模态模型来解释静态几何、动态运动、实例级掩码和描述性标题。该方法结合了基于窗口的 Bundle Adjustment 和全局优化，将长视频序列转换为 4D 多模态格式。实验结果表明，DynamicVerse 在视频深度估计、相机姿态估计和相机内参估计等基准任务上优于现有方法，实现了更高的全局精度和物理尺度测量。

LLM-NAS: LLM-driven Hardware-Aware Neural Architecture Search

Authors: Hengyi Zhu, Grace Li Zhang, Shaoyi Huang

First: 2025-10-01T21:29:20+00:00 · Latest: 2025-12-02T18:21:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Hardware-Aware Neural Architecture Search (HW-NAS) requires joint optimization of accuracy and latency under device constraints. Traditional supernet-based methods require multiple GPU days per dataset. Large Language Model (LLM)-driven approaches avoid training a large supernet and can provide quick feedback, but we observe an exploration bias: the LLM repeatedly proposes neural network designs within limited search space and fails to discover architectures across different latency ranges in the entire search space. To address this issue, we propose LLM-NAS: an LLM-driven Neural Architecture Search that can generate neural networks with high accuracy and low latency with reduced search cost. Our proposed LLM-NAS has three key components: 1) a complexity-driven partitioning engine that divides the search space by complexity to enforce diversity and mitigate exploration bias; 2) an LLM-powered architecture prompt co-evolution operator, in which the LLM first updates a knowledge base of design heuristics based on results from the previous round, then performs a guided evolution algorithm on architectures with prompts that incorporate this knowledge base. Prompts and designs improve together across rounds which avoids random guesswork and improve efficiency; 3) a zero-cost predictor to avoid training a large number of candidates from scratch. Experimental results show that on HW-NAS-Bench, LLM-NAS can achieve overall higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Meanwhile, the search cost drops from days to minutes compared with traditional supernet baselines.

中文标题/摘要

标题：LLM-NAS：由大语言模型驱动的硬件感知神经架构搜索

硬件感知神经架构搜索（HW-NAS）需要在设备约束下同时优化准确性和延迟。传统基于超网络的方法需要对每个数据集进行多天的GPU训练。由大语言模型（LLM）驱动的方法可以避免训练大型超网络并能提供快速反馈，但我们观察到探索偏差：LLM 重复提出复杂度有限的神经网络设计，并未能在整个搜索空间中发现不同延迟范围内的架构。为了解决这一问题，我们提出了LLM-NAS：一种由大语言模型驱动的神经架构搜索方法，可以以较低的搜索成本生成具有高准确性和低延迟的神经网络。我们提出的LLM-NAS具有三个关键组件：1）一个复杂性驱动的分区引擎，通过复杂性将搜索空间划分为多个部分，以促进多样性和减轻探索偏差；2）一个由大语言模型支持的架构提示协同进化算子，在此算子中，大语言模型首先根据上一轮的结果更新设计启发式知识库，然后使用包含此知识库的提示对架构进行引导式进化算法。提示和设计在各轮中共同改进，避免了随机猜测并提高了效率；3）一个零成本预测器，以避免从头开始训练大量候选模型。实验结果表明，与基准方法相比，LLM-NAS在相似准确性的条件下，总体HV更高，IGD更低，延迟最多降低54%。同时，与传统的超网络基准方法相比，搜索成本从天级降低到分钟级。

Summary / 总结

LLM-NAS is designed to address the exploration bias in LLM-driven HW-NAS by introducing a complexity-driven partitioning engine, an LLM-powered architecture prompt co-evolution operator, and a zero-cost predictor. This method achieves higher Hypervolume (HV) and lower Inverted Generational Distance (IGD) on HW-NAS-Bench, with up to 54% lower latency compared to baselines at similar accuracy, while significantly reducing the search cost from days to minutes.

LLM-NAS 通过引入复杂性驱动的分区引擎、LLM 助力的架构提示协同进化算子以及零成本预测器来解决 LLM 驱动的 HW-NAS 中的探索偏差问题。该方法在 HW-NAS-Bench 上实现了更高的 Hypervolume (HV) 和更低的 Inverted Generational Distance (IGD)，在相似准确度下最高可降低 54% 的延迟，并将搜索成本从天级降低到分钟级。

TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

Authors: Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, Yao Yao

Venue: www

First: 2025-12-02T18:18:20+00:00 · Latest: 2025-12-02T18:18:20+00:00

Comments: Project Page: https://www.neural4d.com/research-page/textrix

Abs · PDF · Code1 · Code2

Abstract

Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.

中文标题/摘要

标题：TEXTRIX：用于生成和超越的潜在属性网格

现有的3D纹理生成方法通常依赖多视角融合，但经常受到视角间不一致性和复杂表面覆盖不完全的限制，影响生成内容的准确性和完整性。为克服这些挑战，我们提出了TEXTRIX，一种用于高保真纹理合成和精确3D部分分割等下游应用的原生3D属性生成框架。我们的方法构建了一个潜在的3D属性网格，并利用具有稀疏注意力机制的扩散变换器，能够在体空间直接着色3D模型，从根本上避免了多视角融合的局限性。基于这种原生表示，框架自然扩展到高精度3D分割，通过训练相同的架构在网格上预测语义属性。大量实验表明，该框架在两个任务上均表现出最先进的性能，生成无缝、高保真的纹理和精确边界准确的3D部分分割。

Summary / 总结

The research aims to address the limitations of existing 3D texture generation methods, particularly inter-view inconsistencies and incomplete surface coverage. TEXTRIX introduces a native 3D attribute generation framework using a latent 3D attribute grid and a Diffusion Transformer with sparse attention, which directly colors 3D models in volumetric space. The method outperforms existing techniques, generating seamless, high-fidelity textures and achieving precise 3D part segmentation with accurate boundaries.

研究动机是解决现有3D纹理生成方法的局限性，如多视角融合中的不一致性及覆盖不完整，这些都限制了生成内容的保真度和完整性。主要方法是引入TEXTRIX，一种使用潜3D属性网格和具有稀疏注意力机制的扩散变换器的原生3D属性生成框架，允许直接在体空间中着色3D模型。关键实验发现表明，这种方法在生成无缝、高保真纹理和精确的3D部分分割方面优于现有方法，具有精确的边界。

Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer

Authors: Yu Yang, Pan Xu

Venue: Transactions on Machine Learning Research, 2025

First: 2024-08-02T17:25:34+00:00 · Latest: 2025-12-02T18:17:16+00:00

Comments: 2 figures, 10 tables. Published in Transactions on Machine Learning Research (TMLR)

Abs · PDF · Code1 · Code2

Abstract

Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer's capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT) framework, which leverages pretrained language models providing rich prior knowledge for RL tasks and fine-tunes the sequence model using Low-rank Adaptation (LoRA) for meta-RL problems. We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Comprehensive empirical studies demonstrate that initializing with a pre-trained language model provides the prior knowledge and achieves a similar performance with Prompt-DT under only $10\%$ data in some MuJoCo control tasks. We also provide a thorough ablation study to validate the effectiveness of each component, including sequence modeling, language models, prompt regularizations, and prompt strategies.

中文标题/摘要

标题：预训练语言模型提高决策变换器的少量示例提示能力

决策变换器（DT）已成为离线强化学习（RL）任务中的一种有前途的算法类别，利用预收集的数据集和Transformer建模长序列的能力。最近的研究表明，使用训练任务的部分轨迹作为DT的提示可以提高其在未见任务上的性能，从而产生了提示-DT方法。然而，在许多场景中从特定环境中收集数据既昂贵又不安全，导致性能欠佳和少量示例提示能力有限，因为基于Transformer的模型需要大量数据。此外，预训练数据集的限制使得提示-DT类型的方法仅通过提示难以区分各种RL任务。为了解决这些挑战，我们引入了语言模型初始化的提示决策变换器（LPDT）框架，该框架利用预训练语言模型提供的丰富先验知识，并使用低秩适应（LoRA）微调序列模型以解决元RL问题。我们进一步引入了提示正则化，以有效地根据提示特征表示区分任务。全面的经验研究表明，在某些MuJoCo控制任务中，使用预训练语言模型初始化仅需10%的数据即可获得与提示-DT相当的性能。我们还提供了详尽的消融研究以验证每个组件的有效性，包括序列建模、语言模型、提示正则化和提示策略。

Summary / 总结

The research aims to enhance the few-shot prompt ability of Decision Transformer (DT) by leveraging pre-trained language models. The study introduces the Language model-initialized Prompt Decision Transformer (LPDT) framework, which uses pre-trained language models to provide rich prior knowledge and fine-tunes the sequence model using Low-rank Adaptation (LoRA). Comprehensive experiments show that LPDT can achieve similar performance to Prompt-DT with only 10% of the data in MuJoCo control tasks. The study also includes an ablation analysis to validate the effectiveness of each component.

研究旨在通过利用预训练语言模型来提升决策变换器（DT）的少量提示能力。研究引入了语言模型初始化的提示决策变换器（LPDT）框架，利用预训练语言模型提供丰富的先验知识，并使用低秩适应（LoRA）微调序列模型。实验结果表明，LPDT在MuJoCo控制任务中仅使用10%的数据即可达到与提示DT相当的性能，解决了数据收集和少量提示能力有限的问题。

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Authors: Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

First: 2025-11-21T19:18:41+00:00 · Latest: 2025-12-02T18:13:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.

中文标题/摘要

标题：视觉语言模型能数数吗？一种合成基准和注意力干预分析

近期研究表明，视觉语言模型（VLMs）在回答有关图像视觉属性的问题时，往往会依赖于训练过程中学到的固有偏见。当VLMs被要求回答需要它们关注图像特定区域的高具体问题时，这种偏见会加剧，例如在计数任务中。我们在此基础上开发了一个合成基准数据集和评估框架，以系统地确定计数性能如何随着图像和提示属性的变化而变化。使用开源VLMs，我们分析了注意力分配如何随着输入参数的变化（例如图像中的物体数量、物体颜色、背景颜色、物体纹理、背景纹理和提示的具体性）而波动。我们进一步实施了基于注意力的干预措施，以调节不同层面上对视觉标记的关注，并评估其对计数性能的影响。我们的实验表明，尽管VLM计数性能仍然具有挑战性，尤其是在高视觉或语言复杂性的情况下，某些注意力干预措施可以带来适度的计数性能提升。

Summary / 总结

This study investigates the counting capabilities of Vision-Language Models (VLMs) by developing a synthetic benchmark and analyzing attention-based interventions. The research aims to understand how VLMs handle specific visual tasks and the impact of various image and prompt properties on their performance. Key findings show that VLMs struggle with counting under complex visual conditions, but certain attention interventions can improve their performance modestly.

研究旨在探讨视觉语言模型（VLMs）的计数能力及其在不同图像和提示属性下的表现。研究构建了一个合成基准来评估VLMs在各种条件下的计数性能，并使用开源VLMs分析注意力分配。实验表明，尽管VLMs在复杂视觉和语言输入下表现不佳，但特定的注意力干预可以适度提高计数性能。

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

Authors: Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

First: 2025-06-23T20:47:18+00:00 · Latest: 2025-12-02T18:07:53+00:00

Comments: Project page: https://raniatze.github.io/pritti/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. Code and models are publicly available at $\href{https://raniatze.github.io/pritti/}{https://raniatze.github.io/pritti}$.

中文标题/摘要

标题：PrITTI：基于原型的可控和可编辑3D语义城市场景生成

现有的3D语义城市场景生成方法主要依赖于体素表示，这种表示方式受到固定分辨率的限制，难以编辑且在密集形式下占用大量内存。相比之下，我们提倡一种基于原型的范式，其中城市场景由紧凑且具有语义意义的3D元素表示，这些元素易于操作和组合。为此，我们引入了PrITTI，这是一种利用矢量化对象原型和栅格化地面表面的潜在扩散模型，用于生成多样、可控和可编辑的3D语义城市场景。这种混合表示方式产生了一个结构化的潜在空间，便于对象和地面级别的操作。在KITTI-360上的实验表明，基于原型的表示方式释放了扩散变换器的全部能力，实现了在较低内存需求、更快推理速度和更高可编辑性方面优于体素表示方法的3D场景生成质量。除了生成之外，PrITTI 还支持一系列下游应用，包括场景编辑、修复、扩展和逼真的街景合成。代码和模型可在https://raniatze.github.io/pritti/ 获取。

Summary / 总结

PrITTI is a latent diffusion model that uses primitive-based representations to generate diverse, controllable, and editable 3D semantic urban scenes, overcoming the limitations of voxel-based methods. It leverages vectorized object primitives and rasterized ground surfaces, providing a structured latent space for manipulation. Experiments on KITTI-360 demonstrate that PrITTI achieves superior 3D scene generation quality with lower memory usage, faster inference, and greater editability compared to voxel-based methods, supporting applications like scene editing and photo-realistic synthesis.

研究旨在通过提出基于基本元素的方法来解决基于体素的3D城市场景生成的局限性。PrITTI 使用带有矢量化对象基本元素和栅格化地面表面的潜扩散模型来生成多样、可控和可编辑的3D语义城市场景。KITTI-360上的实验表明，该方法在内存效率、推理速度和可编辑性方面优于基于体素的方法，同时实现了最先进的场景生成质量。

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

Authors: Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi

First: 2025-07-29T17:55:58+00:00 · Latest: 2025-12-02T18:05:40+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

中文标题/摘要

标题：Ov3R：基于RGB视频流的开放词汇语义三维重建框架

我们提出了Ov3R，一种基于RGB视频流的开放词汇语义三维重建框架，旨在推动空间人工智能的发展。该系统包含两个关键组件：CLIP3R，一个受CLIP启发的三维重建模块，能够从重叠片段中预测密集点图并嵌入对象级语义；以及2D-3D OVS，一个2D-3D开放词汇语义模块，通过学习融合空间、几何和语义线索的特征来将2D特征提升到3D。与先前的方法不同，Ov3R直接将CLIP语义融入重建过程，从而实现全局一致的几何结构和精细的语义对齐。我们的框架在密集三维重建和开放词汇三维分割方面均达到了最先进的性能，标志着向实时、语义感知的空间人工智能迈进了一步。

Summary / 总结

Ov3R is a novel framework for open-vocabulary semantic 3D reconstruction from RGB videos, featuring CLIP3R, which predicts dense point maps with object-level semantics, and 2D-3D OVS, which integrates spatial, geometric, and semantic cues for 3D lifting. This system directly incorporates CLIP semantics into the reconstruction process, achieving state-of-the-art performance in dense 3D reconstruction and open-vocabulary 3D segmentation, advancing real-time, semantics-aware Spatial AI.

Ov3R 是一种从 RGB 视频流进行开放词汇语义 3D 重建的新型框架，包含 CLIP3R 预测具有对象级语义的密集点云和 2D-3D OVS 将 2D 特征提升到 3D 通过学习融合的空间、几何和语义线索。该系统直接将 CLIP 语义融入重建过程，实现了在密集 3D 重建和开放词汇 3D 分割方面的领先性能，推动了实时、语义感知的 Spatial AI 的发展。

GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection

Authors: Md Sohag Mia, Md Nahid Hasan, Tawhid Ahmed, Muhammad Abdullah Adnan

First: 2025-12-02T18:05:02+00:00 · Latest: 2025-12-02T18:05:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6\% AP$_{25}$ and 51.2\% AP$_{50}$) and ScanNetV2 (75.1\% AP$_{25}$ and 60.8\% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.

中文标题/摘要

标题：GraphFusion3D：动态图注意力卷积与自适应跨模态变换器结合的3D物体检测

尽管在3D物体检测方面取得了显著进展，但由于点云数据稀疏、结构不完整和语义信息有限，点云仍然具有挑战性。捕捉远距离物体之间的上下文关系增加了额外的难度。为了解决这些挑战，我们提出了GraphFusion3D，这是一种结合多模态融合和高级特征学习的统一框架。我们的方法引入了自适应跨模态变换器（ACMT），它能够自适应地将图像特征整合到点表示中，以丰富几何和语义信息。对于提案细化，我们引入了图推理模块（GRM），这是一种新颖的机制，用于建模邻域关系以同时捕获局部几何结构和全局语义上下文。该模块采用多尺度图注意力机制，动态加权提案之间的空间接近性和特征相似性。我们进一步采用级联解码器，通过多阶段预测逐步细化检测结果。在SUN RGB-D（70.6% AP$_{25}$和51.2% AP$_{50}$）和ScanNetV2（75.1% AP$_{25}$和60.8% AP$_{50}$）上的广泛实验表明，与现有方法相比，性能有了显著提高。

Summary / 总结

GraphFusion3D addresses the challenges of 3D object detection in point clouds by integrating multi-modal fusion and advanced feature learning. It introduces the Adaptive Cross-Modal Transformer (ACMT) to enrich point representations with image features, and the Graph Reasoning Module (GRM) to model neighborhood relationships for capturing local and global context. The approach achieves significant performance improvements, with AP$_{25}$ scores of 70.6% and 75.1% on SUN RGB-D and ScanNetV2, respectively, and AP$_{50}$ scores of 51.2% and 60.8% on these datasets.

GraphFusion3D 是一个统一框架，通过整合图像和点云特征来解决稀疏和不完整点云的挑战。它使用了自适应跨模态变换器来丰富点云表示，并使用图推理模块来建模邻域关系，同时捕捉局部几何结构和全局语义上下文。该方法还包括级联解码器进行逐级细化。实验结果显示在 SUN RGB-D 和 ScanNetV2 数据集上显著优于现有方法。

ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics

Authors: Louis McConnell, Jieran Sun, Theo Maffei, Raphael Gottardo, Marianna Rapsomaniki

First: 2025-12-02T18:00:03+00:00 · Latest: 2025-12-02T18:00:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.

中文标题/摘要

标题：ProteinPNet: 原型部分网络在空间蛋白质组学概念学习中的应用

理解肿瘤微环境（TME）的空间架构对于推进精准肿瘤学至关重要。我们提出了ProteinPNet，一种基于原型部分网络的新框架，可以从空间蛋白质组学数据中发现TME模式。与传统的事后解释模型不同，ProteinPNet直接通过监督训练学习具有区分性、可解释性和忠实性的空间原型。我们在具有真实模式的合成数据集上验证了该方法，并进一步在真实的肺癌空间蛋白质组学数据集上进行了测试。ProteinPNet一致地识别出与不同肿瘤亚型相关的生物学上有意义的原型。通过图形和形态学分析，我们展示了这些原型捕捉到了可解释的特征，这些特征指向了免疫浸润和组织模块性的差异。我们的结果突显了基于原型学习揭示TME中可解释的空间生物标志物的潜力，这对空间组学中的机制发现具有重要意义。

Summary / 总结

ProteinPNet is a novel framework that uses prototypical part networks to discover motifs in spatial proteomics data for understanding the tumor microenvironment. Unlike traditional models, ProteinPNet directly learns discriminative and interpretable spatial prototypes through supervised training. The method was validated on synthetic and real-world datasets, consistently identifying biologically meaningful prototypes that align with different tumor subtypes and capture interpretable features related to immune infiltration and tissue modularity.

ProteinPNet 是一种使用原型部分网络来发现空间蛋白质组学数据中肿瘤微环境模式的新框架。与传统模型不同，ProteinPNet 通过监督训练直接学习具有区分性和可解释性的空间原型。该方法在合成数据集和真实世界数据集上进行了验证，一致地识别出了与不同肿瘤亚型相关的具有生物学意义的原型，并捕获了与免疫浸润和组织模块性相关的可解释特征。

U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Authors: Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

First: 2025-12-02T17:59:57+00:00 · Latest: 2025-12-02T17:59:57+00:00

Comments: Preprint; 19 pages, 7 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.

中文标题/摘要

标题：U4D：基于LiDAR序列的不确定性感知4D世界建模

从LiDAR序列建模动态3D环境是构建可靠4D世界的中心，对于自动驾驶和具身AI至关重要。然而，现有的生成框架通常会将所有空间区域视为统一，忽视了现实场景中不确定性水平的变化。这种统一生成会导致复杂或模糊区域出现伪影，限制了真实感和时间稳定性。在本文中，我们提出了U4D，一种基于LiDAR的4D世界建模的不确定性感知框架。我们的方法首先从预训练的分割模型中估计空间不确定性图，以定位语义上具有挑战性的区域。然后，通过两个连续阶段以“难到易”的方式进行生成：（1）不确定性区域建模，重建高熵区域的精细几何细节；（2）基于学习结构先验的不确定性条件合成，合成剩余区域。为了进一步确保时间一致性，U4D引入了一种时空混合（MoST）块，在扩散过程中适应性地融合空间和时间表示。广泛的实验表明，U4D生成了几何上忠实且时间上一致的LiDAR序列，推动了4D世界建模在自主感知和模拟中的可靠性。

Summary / 总结

The research aims to improve the realism and temporal stability of 4D LiDAR world modeling for autonomous driving and embodied AI by addressing the uniform treatment of spatial regions in existing generative frameworks. U4D estimates spatial uncertainty maps and generates LiDAR sequences in a 'hard-to-easy' manner, first modeling high-entropy regions and then synthesizing the remaining areas under learned priors. The framework also incorporates a spatio-temporal block to ensure temporal coherence. Experiments demonstrate that U4D produces geometrically faithful and temporally consistent LiDAR sequences.

研究旨在通过解决现有方法对空间区域的均匀生成问题，提高4D LiDAR世界建模的现实性和时间一致性，特别是在复杂或模糊区域避免出现伪影。U4D框架首先估计空间不确定性图以识别具有挑战性的区域，然后使用高几何精度重建这些区域，并通过学习的结构先验合成剩余区域。通过时空块确保时间连贯性。实验表明，U4D生成的4D LiDAR序列比现有方法更为准确和一致。

InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Authors: Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang

Venue: AAAI 2026

First: 2025-12-02T17:59:52+00:00 · Latest: 2025-12-02T17:59:52+00:00

Comments: Published in AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

中文标题/摘要

标题：InEx：通过内省和跨模态多智能体协作减轻幻觉

幻觉仍然是大型语言模型（LLMs）中的一个关键挑战，阻碍了可靠多模态LLMs（MLLMs）的发展。现有解决方案往往依赖于人工干预或未能充分利用智能体自主减轻幻觉的能力。为了解决这些限制，我们从人类在现实世界中做出可靠决策的方式中汲取灵感。他们首先通过内省推理来减少不确定性并形成初步判断，然后依靠来自不同视角的外部验证来做出最终决策。受这一认知范式启发，我们提出了InEx，这是一种无需训练的多智能体框架，旨在自主减轻幻觉。InEx引入了基于熵不确定性估计的内省推理，以提高决策智能体推理过程的可靠性。智能体首先生成响应，然后通过与编辑智能体和自我反思智能体进行外部跨模态多智能体协作的迭代验证和优化，进一步提高可靠性和减轻幻觉。广泛的实验表明，InEx在通用和幻觉基准测试中始终优于现有方法，实现了4%-27%的性能提升，并且表现出很强的鲁棒性。

Summary / 总结

InEx is a training-free framework that autonomously mitigates hallucination in large language models by combining internal introspective reasoning and external cross-modal multi-agent collaboration. It uses entropy-based uncertainty estimation to improve reasoning reliability and iteratively refines responses through collaboration with editing and self-reflection agents. Experiments show that InEx outperforms existing methods, achieving up to 27% gains on hallucination benchmarks and demonstrating strong robustness.

InEx 是一个无需训练的多代理框架，旨在自主减轻大型语言模型中的幻觉。它通过内部反省推理和外部跨模态多代理协作来提高可靠性。实验表明，InEx 在幻觉基准测试中优于现有方法，最高可实现 27% 的提升，并且表现出很强的鲁棒性。

Rethinking Generalized BCIs: Benchmarking 340,000+ Unique Algorithmic Configurations for EEG Mental Command Decoding

Authors: Paul Barbaste, Olivier Oullier, Xavier Vasques

First: 2025-12-02T17:56:46+00:00 · Latest: 2025-12-02T17:56:46+00:00

Comments: 28 pages, 8 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal 'one-size-fits-all' method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.

中文标题/摘要

标题：重新思考通用BCI：评估超过340,000种独特算法配置的EEG脑命令解码基准

使用脑电图（EEG）测量的脑模式的稳健解码和分类仍然是现实世界（即，科学实验室和医疗设施之外）脑-计算机接口（BCI）应用中的主要挑战，这主要是由于已记录的参与者间和参与者内变异性。在这里，我们提出了一项大规模基准测试，评估了超过340,000种独特的空间和非线性EEG分类组合。我们的方法论管道包括组合共空间模式（CSP）、黎曼几何、功能性连接以及跨越三个开放访问EEG数据集的分形或基于熵的特征。与先前的研究不同，我们的分析在参与者层面和多个频率带（8-15 Hz和8-30 Hz）上进行，这使得可以对群体水平的性能和个体变异性进行直接评估。协方差切空间投影（cov-tgsp）和CSP在平均分类准确性上始终表现最佳。然而，它们的有效性强烈依赖于数据集，而且在最异质的数据集中，参与者之间的差异仍然显著。重要的是，非线性方法在特定个体中优于空间方法，强调了个性化管道选择的必要性。我们的研究结果表明，没有一种通用的‘一刀切’方法可以在所有用户或数据集中最优地解码EEG运动想象模式。未来的工作将需要适应性、多模态的方法，甚至可能是新颖的方法，以全面解决实际BCI应用中的神经生理学变异性。

Summary / 总结

This study benchmarks over 340,000 unique EEG classification configurations, focusing on spatial and nonlinear methods across three open datasets. The analysis evaluates performance at both group and individual levels across multiple frequency bands. While CSP and covariance tangent space projection (cov-tgsp) showed high average accuracies, their effectiveness varied by dataset, and individual differences were significant, especially in the most heterogeneous dataset. Nonlinear methods outperformed spatial approaches for some individuals, indicating the need for personalized BCI pipelines.

研究旨在通过评估超过340,000种独特的算法配置来解决基于EEG的脑-计算机接口(BCI)应用中的可变性问题。方法结合了空间和非线性EEG分类技术，包括共空间模式(CSP)、黎曼几何、功能性连接以及分形或熵基特征。分析结果显示，cov-tgsp和CSP在平均分类准确率上表现最佳，但其效果在不同数据集和参与者之间存在差异。非线性方法在某些个体中表现更优，表明需要个性化的BCI管道。研究发现，没有一种方法能在所有用户或数据集上都能最优地解码EEG运动想象模式，强调了在实际BCI应用中采用自适应、多模态方法的重要性。

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Authors: Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao

First: 2025-12-02T17:51:02+00:00 · Latest: 2025-12-02T17:51:02+00:00

Abs · PDF · Code1 · Code2

Abstract

While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

中文标题/摘要

标题：基于上下文的图像攻击：视觉上下文揭示多模态安全漏洞

尽管多模态大型语言模型（MLLMs）表现出色，但它们的安全对齐却容易受到牢笼突破攻击的影响。现有的攻击方法通常侧重于文本-图像的交互，将视觉模态视为次要提示。这种方法未能充分利用图像携带复杂上下文信息的独特潜力。为解决这一问题，我们提出了一种新的以图像为中心的攻击方法——基于上下文的图像攻击（CIA），该方法利用多智能体系统通过四种不同的可视化策略将有害查询微妙地嵌入看似无害的视觉上下文中。为了进一步提高攻击的有效性，系统还集成了上下文元素增强和自动毒性模糊化技术。在MMSafetyBench-tiny数据集上的实验结果显示，CIA分别对GPT-4o和Qwen2.5-VL-72B模型的毒性得分为4.73和4.83，攻击成功率（ASR）分别为86.31%和91.07%。我们的方法显著优于先前的工作，表明视觉模态本身是突破高级MLLMs牢笼的有效途径。

Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Authors: Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop

Venue: NeurIPS 2025

First: 2025-12-01T18:45:05+00:00 · Latest: 2025-12-02T17:50:43+00:00

Comments: The Thirty-Ninth Annual Conference on Neural Information Processing Systems, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.

中文标题/摘要

标题：离线强化学习在非平稳环境中的预测

离线强化学习（RL）为从预先收集的数据集中训练策略提供了有希望的途径，当收集额外的交互数据不可行时。然而，现有的离线RL方法通常假设平稳性或仅在测试时考虑合成扰动，这些假设在由突然的时间变化偏移特征的现实世界场景中往往无法成立。这些偏移可能导致部分可观测性，使智能体错误地感知其真实状态并降低性能。为克服这一挑战，我们引入了非平稳离线RL中的预测（FORL）框架，该框架统一了（i）基于条件扩散的候选状态生成，无需预设未来非平稳性的任何特定模式，以及（ii）零样本时间序列基础模型。FORL针对那些容易出现意外、可能非马尔可夫偏移的环境，要求智能体从每个回合开始时就表现出鲁棒性。通过在离线RL基准上进行实证评估，并通过现实世界的时间序列数据增强以模拟现实的非平稳性，我们证明FORL在与竞争基线相比时始终能提高性能。通过将零样本预测与智能体的经验相结合，我们旨在弥合离线RL与现实世界非平稳环境复杂性之间的差距。

Summary / 总结

The paper introduces FORL, a framework for Offline Reinforcement Learning (RL) in non-stationary environments, addressing the issue of partial observability due to time-varying offsets. It combines conditional diffusion-based candidate state generation and zero-shot time-series foundation models to improve performance. Experimental results show that FORL outperforms existing methods on offline RL benchmarks with real-world non-stationary data.

论文通过引入FORL框架，结合条件扩散候选状态生成和零样本时间序列基础模型，解决了在非平稳环境中使用离线强化学习（RL）训练策略的问题。该框架通过处理意外的、非马尔可夫性的偏移，提高了从每个新回合开始时的代理性能，并在离线RL基准上的实证评估中展示了相对于竞争基线的一致改进。

BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Authors: Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang

First: 2025-12-02T17:50:33+00:00 · Latest: 2025-12-02T17:50:33+00:00

Comments: Accept by AAAI26

Abs · PDF · Code1 · Code2 · Code3

Abstract

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

中文标题/摘要

标题：BEVDilation：以LiDAR为中心的多模态融合在3D目标检测中的应用

将LiDAR和相机信息整合到鸟瞰图（BEV）表示中，在3D目标检测中显示出其有效性。然而，由于这些传感器在几何精度上的根本差异，之前的方法中的不分青红皂白的融合往往导致性能下降。本文提出了一种新颖的以LiDAR为中心的框架BEVDilation，优先考虑LiDAR信息的融合。通过将图像BEV特征形式化为隐式指导而非简单的拼接，我们的策略有效地缓解了由于图像深度估计误差引起的空间错位。此外，图像指导可以有效地帮助以LiDAR为中心的范式解决点云的稀疏性和语义限制。具体而言，我们提出了一种稀疏体素膨胀块，通过图像先验稀疏化前景体素来缓解固有的点稀疏性。此外，我们引入了一种语义引导的BEV膨胀块，通过图像语义指导和长程上下文捕获增强LiDAR特征扩散处理。在具有挑战性的nuScenes基准测试中，BEVDilation在保持竞争力的同时实现了优于最先进的方法的性能。重要的是，我们的以LiDAR为中心的策略在深度噪声方面表现出更大的鲁棒性。源代码可在https://github.com/gwenzhang/BEVDilation获取。

Summary / 总结

BEVDilation is a LiDAR-centric framework that prioritizes LiDAR information in multi-modal fusion for 3D object detection. It uses image BEV features as implicit guidance to alleviate spatial misalignment and addresses the sparsity and semantic limitations of point clouds. Key findings include improved performance on the nuScenes benchmark and greater robustness to depth noise compared to previous methods, while maintaining computational efficiency. The source code is available at https://github.com/gwenzhang/BEVDilation.

BEVDilation 提出了一种 LiDAR 为中心的框架，通过在鸟瞰图中融合 LiDAR 和相机数据来实现 3D 对象检测。它使用图像 BEV 特征作为隐式指导来缓解空间错位并解决点云的限制。关键发现包括在 nuScenes 挑战基准上的性能改进以及对深度噪声具有更强的鲁棒性，相较于先前的方法。引入了稀疏体素膨胀块和语义引导的 BEV 膨胀块来处理点稀疏性和增强特征扩散。

Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis

Authors: Jennifer Zvonek, Andrew Gillette

First: 2025-12-02T17:49:01+00:00 · Latest: 2025-12-02T17:49:01+00:00

Abs · PDF · Code1 · Code2

Abstract

An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, represent data natively as INRs. While INRs are prized for being more memory-efficient than traditional data stored on a lattice, many visualization tasks still require discretization to a regular grid. We present PruningAMR, an algorithm that builds a mesh with resolution adapted to geometric features encoded by the INR. To identify these geometric features, we use an interpolative decomposition pruning method on the weight matrices of the INR. The resulting pruned network is used to guide adaptive mesh refinement, enabling automatic mesh generation tailored to the underlying resolution of the function. Starting from a pre-trained INR--without access to its training data--we produce a variable resolution visualization with substantial memory savings.

中文标题/摘要

标题：剪枝AMR：通过权重矩阵分析高效可视化隐式神经表示

隐式神经表示（INR）是一种神经网络，用于近似时空函数。许多内存密集型可视化任务，包括现代4D CT扫描方法，将数据原生表示为INR。虽然INR因其比传统格点存储数据更节省内存而受到青睐，但许多可视化任务仍然需要将其离散化到规则网格。我们提出了PruningAMR算法，该算法构建了一个分辨率适应于INR编码的几何特征的网格。为了识别这些几何特征，我们使用插值分解剪枝方法对INR的权重矩阵进行剪枝。剪枝后的网络用于指导自适应网格细化，从而实现针对函数底层分辨率的自动网格生成。从预训练的INR开始（无需访问其训练数据），我们生成了一个具有可变分辨率的可视化，实现了显著的内存节省。

Summary / 总结

The research aims to improve the visualization of implicit neural representations (INRs) by developing an efficient algorithm called PruningAMR. This algorithm uses weight matrix analysis to identify geometric features in INRs and generate a mesh resolution adapted to these features. The key finding is that PruningAMR can produce a variable resolution visualization starting from a pre-trained INR, leading to substantial memory savings without requiring access to the training data.

研究旨在通过开发一种高效的算法PruningAMR来改进隐式神经表示（INRs）的可视化。该算法利用权重矩阵分析来识别INRs中的几何特征，并修剪网络以指导自适应网格细化。关键实验发现是从预训练的INR开始，该方法可以生成具有显著内存节省的可变分辨率可视化，而无需访问训练数据。