arXiv 论文速递

Voxify3D: Pixel Art Meets Volumetric Rendering

Authors: Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu

First: 2025-12-08T18:59:58+00:00 · Latest: 2025-12-08T18:59:58+00:00

Comments: Project page: https://yichuanh.github.io/Voxify-3D/

Abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

中文标题/摘要

标题：Voxify3D：像素艺术与体积渲染的结合

体素艺术是一种广泛应用于游戏和数字媒体的独特风格化，但由于几何抽象、语义保留和离散颜色一致性之间存在冲突要求，从3D网格自动生成仍然具有挑战性。现有方法要么过度简化几何结构，要么无法实现体素艺术的像素精确、调色板约束的美学效果。我们提出了Voxify3D，这是一种结合3D网格优化与2D像素艺术监督的可微分两阶段框架。我们的核心创新在于三个组件的协同整合：（1）正交像素艺术监督，消除透视失真，实现精确的体素-像素对齐；（2）基于块的CLIP对齐，跨离散化级别保留语义；（3）调色板约束的Gumbel-Softmax量化，使在离散颜色空间中进行可微分优化成为可能，并具有可控的调色板策略。这种整合解决了根本挑战：极端离散化下的语义保留、通过体积渲染实现像素艺术美学以及端到端的离散优化。实验结果显示，该方法在多种角色上表现出优越性能（CLIP-IQA：37.12，用户偏好：77.90%），并且具有可控的抽象程度（2-8种颜色，20-50倍分辨率）。项目页面：https://yichuanh.github.io/Voxify-3D/

Summary / 总结

Voxify3D addresses the challenge of generating voxel art from 3D meshes by introducing a differentiable two-stage framework. It integrates orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization to achieve precise voxel-pixel alignment, semantic preservation, and discrete color coherence. Experiments demonstrate that Voxify3D outperforms existing methods with higher CLIP-IQA scores and user preference rates across various characters and abstraction levels.

Voxify3D通过引入一个可微分的两阶段框架解决了从3D网格生成voxel艺术的挑战。该框架结合了正交像素艺术监督、基于补丁的CLIP对齐和调色板约束的Gumbel-Softmax量化，以实现精确的voxel-像素对齐、语义保留和像素艺术美学。实验结果显示了在各种角色和抽象级别上具有高CLIP-IQA分数和用户偏好。

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Authors: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia

First: 2025-12-08T18:59:01+00:00 · Latest: 2025-12-08T18:59:01+00:00

Comments: Project Website https://jackailab.github.io/Projects/UnityVideo

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo

中文标题/摘要

标题：UnityVideo：统一多模态多任务学习以增强世界感知视频生成

近期的视频生成模型展示了出色的合成能力，但仍然受限于单一模态的条件限制，限制了其整体的世界理解能力。这源于跨模态交互不足和模态多样性有限，无法全面地表示世界知识。为了解决这些限制，我们引入了UnityVideo，这是一种统一框架，用于世界感知视频生成，该框架在多种模态（分割掩码、人体骨架、DensePose、光流和深度图）和训练范式之间联合学习。我们的方法包括两个核心组件：（1）动态噪声以统一异构的训练范式，（2）一种模态切换器，带有上下文学习者，能够通过模块化参数和上下文学习实现统一处理。我们贡献了一个大规模的统一数据集，包含130万样本。通过联合优化，UnityVideo 加快了收敛速度并显著提高了对未见数据的零样本泛化能力。我们证明UnityVideo 能够实现更高质量、更一致的视频，并且更好地与物理世界约束对齐。代码和数据可在：https://github.com/dvlab-research/UnityVideo 获取。

Summary / 总结

UnityVideo is a unified framework for world-aware video generation that integrates multiple modalities such as segmentation masks, human skeletons, and optical flow to enhance video synthesis capabilities. It uses dynamic noising to unify different training paradigms and a modality switcher with an in-context learner for modular and contextual processing. The framework significantly improves zero-shot generalization and video quality, achieving better alignment with physical world constraints. A large-scale unified dataset with 1.3 million samples is provided for training and evaluation.

UnityVideo 是一个统一框架，用于世界感知视频生成，整合了多种模态和训练范式。它引入了动态噪声来统一异构训练，并使用模态切换器和上下文学习者进行模块化和上下文处理。该框架加速了收敛并提高了零样本泛化能力。关键发现包括视频质量、一致性和与物理世界约束的更好对齐。

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-08T18:58:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

中文标题/摘要

标题：TV2TV：一种统一的交错语言和视频生成框架

视频生成模型正在迅速发展，但仍可能在需要大量语义分支或反复进行下一步应该发生什么的高层推理的复杂视频输出上遇到困难。在本文中，我们介绍了一种新的全能视频-文本模型类别，该模型结合了最近语言模型推理进展的想法，以应对这一挑战。具体来说，我们提出了TV2TV，这是一种统一的生成建模框架，将视频生成分解为交错的文字和视频生成过程。TV2TV 使用混合的变换器（MoT）架构联合学习语言建模（下一个标记预测）和视频流匹配（下一个帧预测）。在推理时，TV2TV 决定何时在生成文本和视频帧之间交替，使模型能够在“用文字思考”后续内容之前“用像素行动”来生成帧。此设计将决定下一步应该发生什么的责任大部分卸载到语言建模塔上，从而提高了生成视频的视觉质量和提示对齐。它还使细粒度的可控性成为可能，允许用户通过文本干预在过程中的任何点修改视频生成轨迹。在对视频游戏数据的受控实验中，TV2TV 在视觉质量和可控性方面都表现出显著的改进。TV2TV 还扩展到自然视频，正如我们通过使用视觉-语言模型（VLMs）交替自然语言动作描述来增强体育视频所展示的那样。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐，展示了该模型能够推理和生成复杂的现实世界动作序列的能力。这些结果共同突显了 TV2TV 作为视频生成中具有开放文本推理和控制潜力的有希望的一步。

Summary / 总结

TV2TV is a unified generative modeling framework that addresses the challenge of generating complex videos by integrating language and video generation processes. It uses a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching, allowing the model to 'think in words' before 'acting in pixels.' Experiments on video game data show significant improvements in visual quality and controllability, and the model scales to natural videos by augmenting sports videos with text descriptions, demonstrating strong visual quality and prompt alignment.

TV2TV 是一个统一的生成模型框架，将文本和视频生成交织在一起以改进复杂的视频输出。它使用混合的变换器架构同时学习语言建模和视频流匹配。实验表明，TV2TV 在视频生成中提高了视觉质量和可控性，特别是在视频游戏数据中，并且通过视觉语言模型将动作描述插入自然视频中，展示了模型在生成复杂现实动作序列方面的推理和生成能力。

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Authors: Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu

First: 2025-12-08T18:57:26+00:00 · Latest: 2025-12-08T18:57:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.

中文标题/摘要

标题：一个层次就足够：适应预训练视觉编码器进行图像生成

视觉生成模型（例如，扩散模型）通常在压缩的潜在空间中运行，以平衡训练效率和样本质量。与此同时，利用高质量的预训练视觉表示的兴趣也在增长，无论是通过在VAEs内部对齐它们，还是直接在生成模型内部。然而，由于理解导向的特征与生成友好的潜在空间之间存在根本性的不匹配，调整这些表示仍然具有挑战性。表示编码器受益于高维潜在空间，可以捕捉掩码区域的多种假设，而生成模型则偏好低维潜在空间，必须忠实保留注入的噪声。这种差异导致先前的工作依赖于复杂的优化目标和架构。在本工作中，我们提出了FAE（特征自编码器），这是一种简单而有效的框架，仅使用一个注意力层即可将预训练的视觉表示适应为适合生成的低维潜在空间，同时保留足够的信息用于重建和理解。关键在于结合两个独立的深层解码器：一个用于重建原始特征空间，另一个则以重建的特征作为输入进行图像生成。FAE是通用的；它可以与各种自监督编码器（例如，DINO，SigLIP）实例化，并插入两个不同的生成家族：扩散模型和归一化流。在类别条件和文本到图像基准测试中，FAE表现出色。例如，在ImageNet 256x256上，我们的带有CFG的扩散模型达到了接近最先进的FID为1.29（800个周期）和1.70（80个周期）。不使用CFG时，FAE达到了最先进的FID为1.48（800个周期）和2.08（80个周期），展示了高质量和快速学习的特点。

Summary / 总结

This work addresses the challenge of adapting pre-trained visual representations for image generation by proposing FAE (Feature Auto-Encoder), which uses a single attention layer to transform high-dimensional features into low-dimensional latents suitable for generation. FAE consists of two decoders: one for reconstructing the original feature space and another for generating images. The framework is versatile and can be used with various self-supervised encoders and generative models. Experiments show that FAE achieves strong performance, with near state-of-the-art FID scores on ImageNet 256x256, demonstrating both high quality and fast learning capabilities.

该研究提出了一种名为FAE（特征自编码器）的方法，通过单个注意力层将高维特征转换为适合生成的低维潜变量，以解决预训练视觉表示在图像生成中的适应问题。FAE 包含两个解码器：一个用于重建原始特征空间，另一个用于生成图像。该框架具有通用性，可以与多种自监督编码器和生成模型结合使用。实验结果表明，FAE 在 ImageNet 上表现出色，80 和 800 个周期的 FID 分数接近当前最佳水平。

Normalize Filters! Classical Wisdom for Deep Vision

Authors: Gustavo Perez, Stella X. Yu

First: 2025-06-04T19:32:42+00:00 · Latest: 2025-12-08T18:55:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

中文标题/摘要

标题：标准化滤波器！经典智慧与深度视觉

经典的图像滤波器，如平均滤波或差分滤波，经过精心标准化以确保一致性、可解释性，并避免强度偏移、光环或振铃等伪影。相比之下，深度网络中端到端学习的卷积滤波器缺乏此类约束。尽管它们可能类似于小波和斑块/边缘检测器，但它们并不以相同的方式或任何方式标准化。因此，当图像经历大气传输时，它们的响应会失真，导致错误的结果。我们通过提出滤波器标准化，随后是可学习的缩放和偏移，类似于批量标准化，来解决这一局限。这一简单而有效的修改确保滤波器是大气等变的，从而实现目标域对称性。通过将经典滤波原理整合到深度学习中（适用于卷积神经网络和依赖卷积的视觉变换器），我们的方法在人工和自然强度变化基准测试中取得了显著改进。我们的ResNet34甚至在某些情况下可以显著超越CLIP。我们的分析表明，未标准化的滤波器会降低性能，而滤波器标准化则有助于正则化学习、促进多样性并提高鲁棒性和泛化能力。

Summary / 总结

This paper addresses the issue of unnormalized convolutional filters in deep networks, which can lead to distorted responses when images undergo atmospheric transfer. The authors propose filter normalization, followed by learnable scaling and shifting, inspired by classical image filters. This method ensures that the filters are atmosphere-equivariant and improves performance on various benchmarks, even outperforming CLIP in some cases. The analysis shows that normalized filters enhance robustness and generalization compared to unnormalized ones.

本文解决了深度网络中未归一化的卷积滤波器可能导致大气传输下图像响应失真的问题。作者提出了一种滤波器归一化方法，并结合可学习的缩放和偏移，灵感来源于经典图像滤波器。这种方法确保滤波器具有大气不变性，从而在人工和自然强度变化基准测试中取得了显著改进。带有滤波器归一化的ResNet34模型在某些情况下甚至超过了CLIP。研究表明，归一化的滤波器可以提高性能、正则化学习、促进多样性和增强鲁棒性和泛化能力。

An Adaptive Multi-Layered Honeynet Architecture for Threat Behavior Analysis via Deep Learning

Authors: Lukas Johannes Möller

First: 2025-12-08T18:55:26+00:00 · Latest: 2025-12-08T18:55:26+00:00

Abs · PDF · Code1 · Code2

Abstract

The escalating sophistication and variety of cyber threats have rendered static honeypots inadequate, necessitating adaptive, intelligence-driven deception. In this work, ADLAH is introduced: an Adaptive Deep Learning Anomaly Detection Honeynet designed to maximize high-fidelity threat intelligence while minimizing cost through autonomous orchestration of infrastructure. The principal contribution is offered as an end-to-end architectural blueprint and vision for an AI-driven deception platform. Feasibility is evidenced by a functional prototype of the central decision mechanism, in which a reinforcement learning (RL) agent determines, in real time, when sessions should be escalated from low-interaction sensor nodes to dynamically provisioned, high-interaction honeypots. Because sufficient live data were unavailable, field-scale validation is not claimed; instead, design trade-offs and limitations are detailed, and a rigorous roadmap toward empirical evaluation at scale is provided. Beyond selective escalation and anomaly detection, the architecture pursues automated extraction, clustering, and versioning of bot attack chains, a core capability motivated by the empirical observation that exposed services are dominated by automated traffic. Together, these elements delineate a practical path toward cost-efficient capture of high-value adversary behavior, systematic bot versioning, and the production of actionable threat intelligence.

中文标题/摘要

标题：一种基于深度学习的自适应多层蜜网架构以分析威胁行为

随着网络威胁的日益复杂和多样化，静态蜜罐已显得不足，需要采用自适应、基于智能的欺骗技术。本文提出了ADLAH：一种自适应深度学习异常检测蜜网，旨在通过自主编排基础设施来最大化高质量威胁情报的同时降低成本。主要贡献提供了一个端到端的架构蓝图和基于AI的欺骗平台愿景。可行性通过功能原型的中央决策机制得以证明，在该机制中，强化学习（RL）代理能够实时决定何时将会话从低交互传感器节点升级到动态配置的高交互蜜罐。由于缺乏足够的实时数据，未进行现场规模验证；相反，详细描述了设计权衡和限制，并提供了逐步实现大规模实证评估的严格路线图。除了选择性升级和异常检测，该架构还追求自动化提取、聚类和版本化恶意软件攻击链，这一核心能力由实证观察得出，即暴露的服务主要由自动化流量主导。这些元素共同描绘了一条实现高效捕获高价值对手行为、系统化恶意软件版本化和生成可操作威胁情报的实际路径。

OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

Authors: Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, Lei Xie

First: 2025-12-08T18:55:07+00:00 · Latest: 2025-12-08T18:55:07+00:00

Comments: 38 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.

中文标题/摘要

标题：OpenVE-3M：基于指令的视频编辑大规模高质量数据集

基于指令的图像编辑数据集的质量和多样性持续提升，但基于指令的视频编辑的大规模高质量数据集仍然稀缺。为填补这一空白，我们引入了OpenVE-3M，这是一个开源、大规模且高质量的基于指令的视频编辑数据集。它包含两大类编辑：空间对齐编辑（全局风格、背景更改、局部更改、局部移除、局部添加和字幕编辑）和非空间对齐编辑（多镜头摄像机编辑和创意编辑）。所有编辑类型均通过精心设计的数据管道生成，并经过严格的质量筛选。OpenVE-3M 在规模、编辑类型多样性、指令长度和整体质量方面超越了现有的开源数据集。此外，为解决领域内缺乏统一基准的问题，我们构建了OpenVE-Bench，包含431个视频编辑对，涵盖了广泛的编辑任务，并包含三个高度与人类判断一致的关键指标。我们展示了在我们的数据集上训练的OpenVE-Edit 5B模型，该模型在OpenVE-Bench上表现出色，超越了所有先前的开源模型，包括一个14B基线模型。项目页面位于https://github.com/lewandofskee/OpenVE。

Summary / 总结

OpenVE-3M is introduced to fill the gap in large-scale, high-quality datasets for instruction-based video editing. It includes spatially-aligned and non-spatially-aligned edits, generated through a quality-filtered data pipeline. OpenVE-3M surpasses existing datasets in scale, diversity, and quality. Additionally, OpenVE-Bench is constructed with 431 video-edit pairs to evaluate models, and OpenVE-Edit, a 5B model trained on this dataset, sets a new state-of-the-art on OpenVE-Bench, outperforming previous models.

论文介绍了OpenVE-3M，这是一个大规模且高质量的基于指令的视频编辑数据集，旨在解决此类数据集稀缺的问题。该数据集包含六类编辑，并通过严格的数据管道生成。OpenVE-3M在规模、多样性和质量上超越了现有数据集。此外，作者还构建了OpenVE-Bench，包含431个视频编辑对，并训练了OpenVE-Edit模型，该模型在OpenVE-Bench上表现出色，超越了之前的模型。项目页面可在https://github.com/lewandofskee/OpenVE访问。

WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

Authors: Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang

First: 2025-12-08T18:54:12+00:00 · Latest: 2025-12-08T18:54:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.

中文标题/摘要

标题：WorldReel：一致几何与运动建模的4D视频生成

近期的视频生成器实现了令人印象深刻的逼真效果，但在3D上仍然存在根本性的不一致性。我们提出了WorldReel，这是一种本征时空一致的4D视频生成器。WorldReel 联合生成RGB帧以及4D场景表示，包括点图、摄像机轨迹和密集流映射，从而实现时间上的几何和外观建模。我们明确的4D表示确保了一个持续存在的单一场景，即使在非刚性运动和显著的摄像机运动下，视频仍然保持一致。我们通过精心结合合成和真实数据来训练WorldReel：合成数据提供了精确的4D监督（几何、运动和摄像机），而真实视频则贡献了视觉多样性和逼真性。这种结合使WorldReel 能够在保持强烈几何保真的同时泛化到野外视频。大量实验表明，WorldReel 在动态场景和移动摄像机的一致视频生成方面达到了新的技术水平，其几何一致性、运动连贯性指标优于竞争方法，并减少了视点时间伪影。我们认为，WorldReel 使视频生成更接近4D一致的世界建模，其中代理可以通过单一和稳定的时空表示来渲染、交互和推理场景。

Summary / 总结

WorldReel is a 4D video generator that produces spatio-temporally consistent videos by jointly generating RGB frames and 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping. It combines synthetic and real data to ensure precise 4D supervision and visual diversity, enabling strong geometric fidelity and consistent appearance over time. Experiments show that WorldReel outperforms existing methods in terms of geometric consistency, motion coherence, and reduced view-time artifacts in dynamic scenes with moving cameras.

WorldReel 是一个 4D 视频生成器，通过联合生成 RGB 帧和 4D 场景表示（包括点图、相机轨迹和密集流映射）来生成时空一致的视频。它结合使用合成和真实数据，确保精确的 4D 监督和视觉多样性，从而在动态场景和移动相机中实现强大的几何保真度和连贯性。实验表明，WorldReel 在几何一致性、运动连贯性以及减少视点时间伪影方面优于现有方法。

Graph-Based Learning of Spectro-Topographical EEG Representations with Gradient Alignment for Brain-Computer Interfaces

Authors: Prithila Angkan, Amin Jalali, Paul Hungler, Ali Etemad

First: 2025-12-08T18:54:11+00:00 · Latest: 2025-12-08T18:54:11+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a novel graph-based learning of EEG representations with gradient alignment (GEEGA) that leverages multi-domain information to learn EEG representations for brain-computer interfaces. Our model leverages graph convolutional networks to fuse embeddings from frequency-based topographical maps and time-frequency spectrograms, capturing inter-domain relationships. GEEGA addresses the challenge of achieving high inter-class separability, which arises from the temporally dynamic and subject-sensitive nature of EEG signals by incorporating the center loss and pairwise difference loss. Additionally, GEEGA incorporates a gradient alignment strategy to resolve conflicts between gradients from different domains and the fused embeddings, ensuring that discrepancies, where gradients point in conflicting directions, are aligned toward a unified optimization direction. We validate the efficacy of our method through extensive experiments on three publicly available EEG datasets: BCI-2a, CL-Drive and CLARE. Comprehensive ablation studies further highlight the impact of various components of our model.

中文标题/摘要

标题：基于图的学习脑电图光谱地形表示及其在脑-机接口中的梯度对齐

我们提出了一种新的基于图的脑电图表示学习方法（GEEGA），该方法利用多域信息来学习脑-机接口的脑电图表示。我们的模型利用图卷积网络融合基于频率的地形图嵌入和时频光谱图嵌入，捕捉跨域关系。GEEGA通过引入中心损失和成对差异损失来解决由于脑电图信号的时域动态性和个体敏感性而产生的高跨类可分性挑战。此外，GEEGA还引入了梯度对齐策略来解决不同域之间梯度冲突以及融合嵌入之间的冲突，确保分歧（即梯度指向相反方向）朝向统一的优化方向对齐。我们通过在三个公开的脑电图数据集（BCI-2a，CL-Drive和CLARE）上进行广泛的实验验证了该方法的有效性。全面的消融研究进一步突显了我们模型各个组件的影响。

Summary / 总结

The research aims to improve EEG representations for brain-computer interfaces by leveraging graph convolutional networks to integrate frequency-based topographical maps and time-frequency spectrograms. The method, GEEGA, uses gradient alignment and loss functions to enhance inter-class separability and resolve domain conflicts. Experiments on three EEG datasets show improved performance in distinguishing between classes and aligning gradients, validating the method's effectiveness.

研究旨在通过一种称为GEEGA的基于图的学习方法，利用多域信息来提高脑-计算机接口的EEG表示。该方法使用图卷积网络整合基于频率的拓扑图和时间-频率频谱图，并通过损失函数增强类间可分性。GEEGA还包含一个梯度对齐策略，以使不同域的梯度冲突对齐。在三个EEG数据集上的实验表明，所提出的方法在提高分类性能方面是有效的。

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

First: 2025-09-24T23:47:36+00:00 · Latest: 2025-12-08T18:53:34+00:00

Comments: Accepted by IEEE Control Systems Letters (L-CSS)

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.

中文标题/摘要

标题：通过单轮强化学习训练任务推理LLM代理进行多轮任务规划

大型语言模型（LLMs）在知识获取、推理和工具使用方面表现出色，使其成为自主代理应用的有前途的候选者。然而，训练LLM代理进行复杂的多轮任务规划面临着重大挑战，包括稀疏的回合奖励、长时间范围内的信用分配以及多轮交互设置中的计算开销。为此，本文提出了一种新颖的方法，将多轮任务规划转化为单轮任务推理问题，通过组相对策略优化（GRPO）和从专家轨迹中获得的密集且可验证的奖励，实现高效的策略优化。我们的理论分析表明，单轮任务推理上的GRPO改进在最小轮次下多轮成功概率的下界，以及对较短时间范围的子任务的一般化。在复杂的任务规划基准上的实验评估表明，使用单轮GRPO训练的1.5B参数模型在长时规划任务的成功率方面优于高达14B参数的基线模型，成功率为70%。

Summary / 总结

This paper addresses the challenge of training large language models (LLMs) for multi-turn task planning by converting it into single-turn task reasoning. It proposes using Group Relative Policy Optimization (GRPO) with dense rewards from expert trajectories to optimize policies efficiently. The model, with 1.5 billion parameters, outperforms larger models up to 14 billion parameters, achieving a success rate of 70% for long-horizon tasks.

本文提出了一种将复杂多轮任务规划转换为单轮推理问题的新方法，以训练大型语言模型（LLMs）。该方法使用来自专家轨迹的密集奖励进行组相对策略优化（GRPO），从而提高性能。实验表明，一个1.5B参数的模型通过单轮GRPO训练，优于更大规模的模型，对于长时规划任务的成功率为70%。

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

Authors: Wenbo Zhang, Hengrui Cai, Wenyu Chen

Venue: NeurIPS 2025

First: 2025-02-13T03:43:33+00:00 · Latest: 2025-12-08T18:50:25+00:00

Comments: Accepted in NeurIPS 2025 Workshop on LLM Evals

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.

中文标题/摘要

标题：超越单一性：揭示多代在基准评估中的价值

大型语言模型（LLMs）在实际应用中展示了显著的实用性，表现出在自然语言处理和理解方面的出色能力。基准评估对于评估LLMs的能力至关重要，因为它们可以提供对其强项和弱点的全面评估。然而，当前的评估方法往往通过使用确定性生成策略或依赖单一随机样本来忽视LLMs固有的随机性，导致未考虑的采样方差和不可靠的基准评分估计。在本文中，我们提出了一种分层统计模型，通过结合基准特性和LLMs的随机性，提供了更全面的基准评估过程的表示。我们表明，利用多个生成可以提高基准评分估计的准确性并减少方差。多个生成还允许我们定义$\mathbb P\left(\text{正确}\right)$，这是一个基于正确率的提示级难度评分，提供对个别提示的精细洞察。此外，我们创建了一个数据地图，可视化提示的难度和语义，有助于基准构建中的错误检测和质量控制。

Understanding Privacy Risks in Code Models Through Training Dynamics: A Causal Approach

Authors: Hua Yang, Alejandro Velasco, Sen Fang, Bowen Xu, Denys Poshyvanyk

First: 2025-12-08T18:47:40+00:00 · Latest: 2025-12-08T18:47:40+00:00

Comments: 21 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Large language models for code (LLM4Code) have greatly improved developer productivity but also raise privacy concerns due to their reliance on open-source repositories containing abundant personally identifiable information (PII). Prior work shows that commercial models can reproduce sensitive PII, yet existing studies largely treat PII as a single category and overlook the heterogeneous risks among different types. We investigate whether distinct PII types vary in their likelihood of being learned and leaked by LLM4Code, and whether this relationship is causal. Our methodology includes building a dataset with diverse PII types, fine-tuning representative models of different scales, computing training dynamics on real PII data, and formulating a structural causal model to estimate the causal effect of learnability on leakage. Results show that leakage risks differ substantially across PII types and correlate with their training dynamics: easy-to-learn instances such as IP addresses exhibit higher leakage, while harder types such as keys and passwords leak less frequently. Ambiguous types show mixed behaviors. This work provides the first causal evidence that leakage risks are type-dependent and offers guidance for developing type-aware and learnability-aware defenses for LLM4Code.

中文标题/摘要

标题：通过训练动态理解代码模型中的隐私风险：一种因果方法

代码大型语言模型（LLM4Code）极大地提高了开发人员的生产力，但也引发了隐私方面的担忧，因为它们依赖于包含大量个人可识别信息（PII）的开源仓库。先前的研究表明，商业模型可以重现敏感的PII，但现有研究大多将PII视为单一类别，忽略了不同类型之间的异质风险。我们研究了不同类型的PII在被LLM4Code学习和泄露的可能性是否不同，以及这种关系是否具有因果性。我们的方法包括构建包含多种PII类型的数据库，对不同规模的代表性模型进行微调，在真实PII数据上计算训练动态，并构建结构因果模型以估计可学习性对泄露的因果效应。结果表明，不同类型的PII的泄露风险差异显著，并与它们的训练动态相关：易于学习的实例，如IP地址，表现出更高的泄露率，而较难的类型，如密钥和密码，泄露的频率较低。模糊类型表现出混合行为。本研究提供了第一个因果证据，表明泄露风险取决于类型，并为开发针对LLM4Code的类型感知和可学习性感知的防御措施提供了指导。

Summary / 总结

This study investigates the privacy risks of different types of personally identifiable information (PII) in large language models for code (LLM4Code) through a causal approach. By building a dataset with diverse PII types, fine-tuning representative models, and analyzing training dynamics, the research finds that leakage risks vary significantly among PII types, with easier-to-learn instances like IP addresses exhibiting higher leakage, while harder types such as keys and passwords leak less frequently. This work provides causal evidence that leakage risks are type-dependent and offers insights for developing more effective defenses.

该研究通过因果方法探讨了不同类型的个人可识别信息（PII）在代码大型语言模型（LLM4Code）中的隐私风险。研究构建了一个包含多种PII类型的数据集，对代表性模型进行微调，并计算训练动态以估计可学习性对泄漏的影响。研究结果表明，PII类型的泄漏风险差异显著，易于学习的实例如IP地址的泄漏风险较高，而较难的类型如密钥和密码则泄漏较少。

MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

Authors: Alexey Gavryushin, Xi Wang, Robert J. S. Malate, Chenyu Yang, Davide Liconti, René Zurbrügg, Robert K. Katzschmann, Marc Pollefeys

First: 2025-04-08T14:25:25+00:00 · Latest: 2025-12-08T18:47:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that learns features to predict object contact points and detailed hand poses at the moment of contact from egocentric images. We then use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across 4 existing simulation benchmarks, as well as a newly designed set of 4 challenging simulation tasks requiring fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a 17 DoF dexterous robotic hand, whereas the simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work. We additionally showcase the efficacy of our model on an egocentric contact point prediction task, validating its usefulness beyond dexterous manipulation policy learning.

中文标题/摘要

标题：MAPLE: 从第一人称视频中学习灵巧机器人操作先验

大规模第一人称视频数据集捕捉了各种人类活动的多样性，涵盖了广泛的情景，提供了关于人类如何与物体互动的丰富而详细的见解，尤其是那些需要精细灵巧控制的物体。这些复杂的、需要精确控制的灵巧技能对于许多机器人操作任务至关重要，但传统数据驱动的机器人操作方法往往未能充分解决这些问题。为了解决这一差距，我们利用从大规模第一人称视频数据集中学习到的操作先验来改进灵巧机器人操作任务的策略学习。我们提出了MAPLE，一种新颖的灵巧机器人操作方法，该方法从第一人称图像中学习特征以预测接触点和接触瞬间的详细手部姿态。然后，我们使用学习到的特征来训练下游操作任务的策略。实验结果表明，MAPLE在4个现有的仿真基准测试中以及4个新设计的需要精细物体控制和复杂灵巧技能的挑战性仿真任务中均表现出有效性。此外，我们在使用17个自由度的灵巧机器人手中进行了实际实验，进一步突显了MAPLE的优势，而此前的工作中对仿真和实际实验的综合评估尚未得到充分探索。我们还展示了我们的模型在第一人称接触点预测任务中的有效性，验证了其在灵巧操作策略学习之外的应用价值。

Summary / 总结

This paper addresses the challenge of dexterous robotic manipulation by leveraging manipulation priors learned from large-scale egocentric video datasets. The method, MAPLE, predicts object contact points and detailed hand poses from egocentric images to train policies for manipulation tasks. Experiments show MAPLE's effectiveness across various benchmarks and challenging tasks, and it outperforms traditional approaches in both simulation and real-world settings with a 17 DoF robotic hand.

研究旨在通过利用大规模第一人称视角视频数据集中学习到的操纵先验知识，提升灵巧的机器人操作。MAPLE 是一种新颖的方法，可以从第一人称视角图像中预测物体接触点和详细的手部姿态来训练操作任务的策略。实验表明，MAPLE 在各种基准测试和真实世界任务中表现出色，突出了其在精细物体控制和复杂灵巧技能方面的优势。

Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

Authors: Shai Krakovsky, Gal Fiebelman, Sagie Benaim, Hadar Averbuch-Elor

Venue: SIGGRAPH Asia 2025

First: 2025-12-08T18:39:58+00:00 · Latest: 2025-12-08T18:39:58+00:00

Comments: Accepted to SIGGRAPH Asia 2025. Project webpage: https://tau-vailab.github.io/Lang3D-XL

Abs · PDF · Code1 · Code2 · Project1

Abstract

Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

中文标题/摘要

标题：Lang3D-XL：嵌入语言的3D高斯分布用于大规模场景

在3D表示中嵌入语言字段能够通过将几何形状与描述性意义联系起来，丰富对空间环境的语义理解。这使得人机交互更加直观，能够使用自然语言查询或编辑场景，并可能改善场景检索、导航和多模态推理等任务。尽管这些能力可能具有变革性，特别是对于大规模场景，我们发现最近的特征蒸馏方法由于语义特征对齐问题和内存及运行时效率低下，无法有效学习大规模互联网数据。为此，我们提出了一种新的方法来解决这些问题。首先，我们引入了极低维度的语义瓶颈特征作为底层3D高斯表示的一部分。这些特征通过渲染并经过多分辨率、基于特征的哈希编码器处理，显著提高了运行时间和GPU内存的效率。其次，我们引入了衰减下采样模块，并提出了几种正则化方法来解决地面真实2D特征的语义对齐问题。我们在野外HolyScenes数据集上评估了我们的方法，并证明了它在性能和效率上都超过了现有方法。

Summary / 总结

The research aims to enhance semantic understanding of 3D environments through language embedding, enabling more intuitive human-computer interaction. The method introduces low-dimensional semantic bottleneck features in 3D Gaussians and uses a multi-resolution hash encoder for efficient processing. Experimental results show that the proposed approach outperforms existing methods in both performance and efficiency on the HolyScenes dataset.

研究旨在通过嵌入语言字段来增强对3D环境的语义理解，使通过自然语言进行查询和编辑成为可能，从而实现更直观的人机交互。为了解决从大量互联网数据中学习的挑战，作者提出了Lang3D-XL，该方法使用低维度的语义瓶颈特征和多分辨率哈希编码器进行高效处理。该方法在HolyScenes数据集上的性能和效率均优于现有方法。

Multi-view Pyramid Transformer: Look Coarser to See Broader

Authors: Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park

First: 2025-12-08T18:39:27+00:00 · Latest: 2025-12-08T18:39:27+00:00

Comments: Project page: see https://gynjn.github.io/MVP/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

中文标题/摘要

标题：多视图金字塔变换器：看更粗略以见更广阔

我们提出了多视图金字塔变换器（MVP），这是一种可扩展的多视图变换器架构，能够在单次前向传播中直接从数十到数百张图像中重建大型3D场景。MVP基于“看更广阔以见整体，看更精细以见细节”的理念，构建了两个核心设计原则：1）从局部视图到组再到整个场景的局部到全局的视图层次结构，逐步拓宽模型的视角；2）从详细的空域表示逐步聚合为紧凑的信息密集型标记的精细到粗略的视图内层次结构。这种双重层次结构实现了计算效率和表示丰富性，能够快速重建大型和复杂的场景。我们在多种数据集上验证了MVP，并展示了当与3D高斯点云作为底层3D表示结合使用时，它在保持高效性和可扩展性的同时，实现了最先进的可泛化的重建质量。

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Authors: Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie

First: 2025-12-08T18:32:24+00:00 · Latest: 2025-12-08T18:32:24+00:00

Comments: Project Page: https://zhaochongan.github.io/projects/OneStory

Abs · PDF · Code1 · Code2 · Project1

Abstract

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.

中文标题/摘要

标题：OneStory：具有自适应记忆的连贯多镜头视频生成

现实世界视频中的叙事往往通过多个镜头展开——不连续但语义上相连的片段，共同传达一个连贯的故事。然而，现有的多镜头视频生成（MSV）方法难以有效建模跨镜头的长距离上下文，因为它们依赖于有限的时间窗口或单个关键帧条件，导致在复杂叙事下性能下降。在本文中，我们提出了OneStory，使其能够进行全局而紧凑的跨镜头上下文建模，以实现一致和可扩展的叙事生成。OneStory将MSV重新定义为下一个镜头生成任务，从而实现自回归镜头合成，同时利用预训练的图像到视频（I2V）模型进行强大的视觉条件。我们引入了两个关键模块：一个帧选择模块，基于先前镜头中的信息性帧构建语义相关的全局记忆；以及一个自适应条件器，通过重要性引导的片段化生成紧凑的上下文，直接进行条件化。我们还精心制作了一个高质量的多镜头数据集，带有参考性描述，以反映现实世界的叙事模式，并在下一个镜头范式下设计了有效的训练策略。在我们精心制作的60K数据集上从预训练的I2V模型微调后，OneStory在文本和图像条件下的各种复杂场景中实现了最先进的叙事连贯性，使长格式视频叙事可控且沉浸。

Summary / 总结

OneStory addresses the challenge of generating coherent multi-shot videos by introducing a novel approach that models global cross-shot context effectively. It reformulates multi-shot video generation as a next-shot generation task, using a Frame Selection module to construct a global memory and an Adaptive Conditioner to generate compact context. OneStory achieves state-of-the-art narrative coherence in diverse and complex scenes, outperforming existing methods in both text- and image-conditioned settings.

OneStory通过提出一种有效建模全局跨镜头上下文的方法，解决了生成连贯多镜头视频的挑战。它将多镜头视频生成重新定义为下一个镜头的生成任务，使用帧选择模块构建全局记忆，并使用自适应条件器生成紧凑的上下文。OneStory在多样且复杂的场景中实现了最先进的叙事连贯性，在文本和图像条件设置下均优于现有方法。

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Authors: Nearchos Potamitis, Lars Klein, Akhil Arora

First: 2025-12-08T18:26:58+00:00 · Latest: 2025-12-08T18:26:58+00:00

Comments: 11 pages, 3 tables, 4 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .

中文标题/摘要

标题：ReasonBENCH：评估LLM推理的（不）稳定性

大型语言模型（LLMs）在需要推理的环境中（如多步问题解决和推理链）的应用越来越广泛。然而，当前的评估实践主要报告单一运行的准确率，而忽视了从随机解码中自然产生的内在不确定性。这种遗漏导致了一个盲点，因为从业者无法可靠地评估所报告性能的稳定性、可重复性或成本一致性。我们引入了ReasonBENCH，这是第一个用于量化LLM推理内在不稳定性的基准。ReasonBENCH提供了（i）一个模块化的评估库，标准化推理框架、模型和任务，（ii）一个多运行协议，报告统计上可靠的质量和成本指标，以及（iii）一个公共排行榜，以鼓励变异意识的报告。在不同领域的任务中，我们发现大多数推理策略和模型都表现出高度的不稳定性。值得注意的是，即使具有相似平均性能的策略，其置信区间也可能宽出四倍，而表现最好的方法往往成本更高且更不稳定。这种不稳定性破坏了多次运行之间的可重复性，从而影响了报告性能的可靠性。为了更好地理解这些动态，我们进一步分析了提示、模型家族和规模对解决率和稳定性之间权衡的影响。我们的结果强调了可重复性是可靠LLM推理的关键维度，并为未来的推理方法和不确定性量化技术奠定了基础。ReasonBENCH可在https://github.com/au-clan/ReasonBench 获取。

Summary / 总结

ReasonBENCH benchmarks the stability of LLM reasoning by introducing a modular evaluation library, a multi-run protocol, and a public leaderboard. It reveals that most reasoning strategies and models exhibit high instability, with confidence intervals up to four times wider than average performance. This instability affects reproducibility and the reliability of reported performance, highlighting the need for variance-aware reporting in LLM reasoning evaluation.

ReasonBENCH通过引入模块化评估库和多运行协议，报告质量和成本指标来评估LLM推理的稳定性。在各种任务中，大多数推理策略和模型显示出高不稳定性的特征，置信区间比平均性能宽四倍以上。这种不稳定性影响了可重复性和报告性能的可靠性，强调了在LLM评估中进行方差感知报告的必要性。

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

Authors: Jiaxu Liu, Yuhe Bai, Christos-Savvas Bouganis

First: 2025-12-08T18:11:06+00:00 · Latest: 2025-12-08T18:11:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.

中文标题/摘要

标题：GatedFWA：线性闪窗注意机制带门控关联记忆

现代自回归模型依赖于注意机制，但Transformer中的Softmax全注意机制随序列长度呈二次增长。滑动窗口注意（SWA）通过限制注意模式实现了线性时间的编码/解码，但在关联记忆的解释下，其差分式更新使得训练目标实际上变得无界。相比之下，Softmax更新会进行归一化，导致记忆收缩和梯度消失。我们提出了一种GatedFWA：一种带门控（Flash）窗口注意机制，它保持了SWA的效率，同时稳定了记忆更新并使梯度流动可控。本质上，GatedFWA将一个按令牌/头的门控值累加到注意力对数中的衰减偏置，起到可学习的记忆递归收缩作用。我们实现了一次融合门控预处理和FlashAttention兼容内核，该内核在滑动掩码下注入门控，确保了输入/输出效率和数值稳定性。在语言建模基准测试中，GatedFWA提供了可竞争的吞吐量，几乎没有额外开销，并更好地利用了全局上下文，且可以无缝集成到如NSA等令牌压缩/选择方法中，并适用于各种自回归领域。

Summary / 总结

GatedFWA is a novel attention mechanism that addresses the limitations of sliding window attention (SWA) by incorporating a gated associative memory approach. This method improves upon SWA's efficiency while stabilizing memory updates and controlling gradient flow. Experimental results on language modeling benchmarks show that GatedFWA achieves competitive throughput with negligible overhead and better use of global context compared to traditional methods. It also integrates well with token compression techniques and generalizes to various autoregressive domains.

GatedFWA 是一种新型的注意力机制，旨在解决滑动窗口注意力（SWA）和 softmax 注意力在自回归模型中的局限性。它结合了 SWA 的线性时间效率和 normalized 更新的稳定性，通过可学习的门控机制控制记忆更新和梯度流动。实验表明，GatedFWA 在语言建模基准上实现了与最小开销相当的吞吐量，并且更好地利用了全局上下文，同时与 token 压缩方法兼容并适用于各种自回归任务。

Asynchronous Bioplausible Neuron for SNN for Event Vision

Authors: Sanket Kachole, Hussain Sajwani, Fariborz Baghaei Naeini, Dimitrios Makris, Yahya Zweiri

First: 2023-11-20T15:45:16+00:00 · Latest: 2025-12-08T18:11:01+00:00

Comments: 10 pages

Abs · PDF · Code1 · Code2

Abstract

Spiking Neural Networks (SNNs) offer a biologically inspired approach to computer vision that can lead to more efficient processing of visual data with reduced energy consumption. However, maintaining homeostasis within these networks is challenging, as it requires continuous adjustment of neural responses to preserve equilibrium and optimal processing efficiency amidst diverse and often unpredictable input signals. In response to these challenges, we propose the Asynchronous Bioplausible Neuron (ABN), a dynamic spike firing mechanism to auto-adjust the variations in the input signal. Comprehensive evaluation across various datasets demonstrates ABN's enhanced performance in image classification and segmentation, maintenance of neural equilibrium, and energy efficiency.

中文标题/摘要

标题：异步生物可塑性神经元用于事件视觉的SNN

脉冲神经网络（SNNs）提供了一种生物启发的计算机视觉方法，可以实现视觉数据处理的高效性并降低能耗。然而，这些网络中的稳态维持是一个挑战，因为需要不断调整神经响应以保持平衡和最佳处理效率，尤其是在面对多样且经常不可预测的输入信号时。为应对这些挑战，我们提出了异步生物可塑性神经元（ABN），这是一种动态的脉冲放电机制，用于自动调整输入信号的变化。在各种数据集上的全面评估表明，ABN 在图像分类和分割、神经平衡维持和能效方面表现出增强的性能。

Summary / 总结

The research aims to improve the efficiency and energy consumption of Spiking Neural Networks (SNNs) in computer vision by addressing the challenge of maintaining neural homeostasis. The proposed Asynchronous Bioplausible Neuron (ABN) dynamically adjusts spike firing to auto-adjust input signal variations. Experiments show that ABN enhances image classification and segmentation performance, maintains neural equilibrium, and improves energy efficiency across different datasets.

研究旨在通过解决神经元稳态维持问题，提高SNNs在计算机视觉中的效率和能耗。提出的异步生物合理神经元（ABN）动态调整尖峰放电以自动调整输入信号变化。实验表明，ABN在图像分类和分割性能上有所提升，维持了神经元稳态，并提高了能耗效率，适用于多种数据集。

Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Authors: Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

Venue: AAAI 2026

First: 2025-11-14T18:42:18+00:00 · Latest: 2025-12-08T18:05:19+00:00

Comments: Accepted to AAAI 2026 AI Alignment Track

Abs · PDF · Code1 · Code2

Abstract

The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.

中文标题/摘要

标题：使马基雅维利主义代理一致：测试时的行为导向策略塑造

在复杂动态环境中部署决策AI代理，保持与人类价值观或指导方针的一致性是一项关键挑战。仅为实现目标而训练的代理可能会采取有害行为，这揭示了在最大化奖励函数和保持一致性之间的重要权衡。对于预训练的代理，确保一致性尤其具有挑战性，因为重新训练是一个成本高且耗时的过程。此外，代表一致性的伦理价值观多样且可能相互冲突，进一步增加了挑战。为应对这些挑战，我们提出了一种基于模型引导的策略塑造的测试时一致性技术。该方法允许对个体行为属性进行精确控制，适用于多种强化学习(RL)环境，并在不需重新训练代理的情况下，促进伦理一致性和奖励最大化之间的原则性权衡。我们使用MACHIAVELLI基准进行评估，该基准包括134个基于文本的游戏环境和数千个涉及伦理决策的标注场景。首先，RL代理被训练以最大化其各自游戏中的奖励。在测试时，我们通过场景-动作属性分类器应用策略塑造，以确保决策与伦理属性的一致性。我们将我们的方法与先前的训练时方法和通用代理进行比较，并研究了几种类型的伦理违规和权力追求行为。我们的结果表明，测试时策略塑造为在多种环境和一致性属性中缓解不道德行为提供了一种有效且可扩展的解决方案。

Summary / 总结

The paper addresses the challenge of aligning AI agents with human values in complex environments, proposing a test-time policy shaping method for pre-trained agents. This method uses scenario-action attribute classifiers to steer agent behavior without retraining, allowing for precise control over ethical attributes. Experiments on the MACHIAVELLI benchmark show that this approach effectively mitigates unethical behavior across various RL environments and alignment attributes.

该论文通过提出一种测试时策略塑造技术来解决在复杂环境中使AI代理与人类价值观保持一致的挑战。该方法使用场景-动作属性分类器来引导代理行为，无需重新训练即可实现对伦理属性的精确控制。在MACHIAVELLI基准测试中，该方法有效地缓解了各种RL环境和对齐属性中的不道德行为。

GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring

Authors: Maximilian Schall, Felix Leonard Knöfel, Noah Elias König, Jan Jonas Kubeler, Maximilian von Klinski, Joan Wilhelm Linnemann, Xiaoshi Liu, Iven Jelle Schlegelmilch, Ole Woyciniuk, Alexandra Schild, Dante Wasmuht, Magdalena Bermejo Espinet, German Illera Basas, Gerard de Melo

Venue: WACV 2026

First: 2025-12-08T17:58:20+00:00 · Latest: 2025-12-08T17:58:20+00:00

Comments: Accepted at WACV 2026

Abs · PDF · Code1 · Code2

Abstract

Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species

中文标题/摘要

标题：GorillaWatch：一种自动化的野外山地大猩猩再识别和种群监测系统

目前，监测极度濒危的西部低地山地大猩猩受到极大限制，因为需要巨大的人工努力从大量的相机陷阱视频档案中重新识别个体。自动化这一过程的主要障碍是缺乏适合训练稳健深度学习模型的大型“野外”视频数据集。为了解决这一缺口，我们引入了一个全面的基准，包括三个新的数据集：Gorilla-SPAC-Wild，迄今为止最大的用于野外灵长类动物再识别的视频数据集；Gorilla-Berlin-Zoo，用于评估跨域再识别泛化；以及Gorilla-SPAC-MoT，用于评估相机陷阱视频中的多目标跟踪。基于这些数据集，我们提出了GorillaWatch，一个集成了检测、跟踪和再识别的端到端管道。为了利用时间信息，我们引入了一种多帧自我监督预训练策略，利用轨迹片段中的一致性来学习领域特定特征，而无需手动标签。为了确保科学有效性，我们提出了一个可微分的AttnLRP，验证我们的模型依赖于区分性的生物特征，而不是背景相关性。随后的基准测试表明，从大规模图像骨干网络中聚合特征优于专门的视频架构。最后，我们通过将时空约束整合到标准聚类中来解决无监督的种群计数问题，以减轻过度分割。我们公开发布了所有代码和数据集，以促进濒危物种的可扩展和非侵入性监测

Formalized Hopfield Networks and Boltzmann Machines

Authors: Matteo Cipollina, Michail Karatarakis, Freek Wiedijk

First: 2025-12-08T17:48:31+00:00 · Latest: 2025-12-08T17:48:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Neural networks are widely used, yet their analysis and verification remain challenging. In this work, we present a Lean 4 formalization of neural networks, covering both deterministic and stochastic models. We first formalize Hopfield networks, recurrent networks that store patterns as stable states. We prove convergence and the correctness of Hebbian learning, a training rule that updates network parameters to encode patterns, here limited to the case of pairwise-orthogonal patterns. We then consider stochastic networks, where updates are probabilistic and convergence is to a stationary distribution. As a canonical example, we formalize the dynamics of Boltzmann machines and prove their ergodicity, showing convergence to a unique stationary distribution using a new formalization of the Perron-Frobenius theorem.

中文标题/摘要

标题：形式化的霍普菲尔德网络和玻尔兹曼机

神经网络被广泛使用，但对其分析和验证仍然具有挑战性。在这项工作中，我们介绍了使用Lean 4形式化神经网络的方法，涵盖了确定性和随机模型。我们首先形式化了霍普菲尔德网络，这是一种具有存储模式作为稳定状态的循环网络。我们证明了收敛性和基于海宾学习规则的正确性，该规则更新网络参数以编码模式，这里仅限于两两正交模式的情况。然后我们考虑了随机网络，其中更新是概率性的，收敛到一个平稳分布。作为典型的例子，我们形式化了玻尔兹曼机的动力学，并证明了其遍历性，使用新的 Perron-Frobenius 定理形式化展示了其收敛到唯一的平稳分布。

Summary / 总结

This work aims to formally analyze and verify neural networks, focusing on Hopfield networks and Boltzmann machines. The authors use Lean 4 to prove the convergence and correctness of Hebbian learning for Hopfield networks and the ergodicity of Boltzmann machines, demonstrating convergence to a unique stationary distribution using a new Perron-Frobenius theorem formalization.

本文旨在通过形式化神经网络，特别是霍普菲尔德网络和玻尔兹曼机，来促进其分析和验证。作者使用Lean 4进行形式化，证明了霍普菲尔德网络的收敛性和Hebbian学习的正确性。对于玻尔兹曼机，他们证明了遍历性和收敛到唯一平稳分布，并使用Perron-Frobenius定理的新形式化证明了这一点。

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Authors: Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He

First: 2025-12-08T17:42:59+00:00 · Latest: 2025-12-08T17:42:59+00:00

Comments: 19 pages, 15 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.

中文标题/摘要

标题：RL-MTJail：强化学习在大型语言模型自动化黑盒多轮越狱中的应用

大型语言模型容易受到越狱攻击，威胁其在实际应用中的安全部署。本文研究黑盒多轮越狱，旨在通过一系列提示-输出交互训练攻击者语言模型，从黑盒模型中引出有害内容。现有方法通常依赖单轮优化，不足以学习长期攻击策略。为解决这一问题，我们将问题形式化为多轮强化学习任务，直接优化最终轮次输出的有害性作为结果奖励。为缓解稀疏监督并促进长期攻击策略，我们提出了两种启发式过程奖励：(1) 控制中间输出的有害性，以防止触发黑盒模型的拒绝机制；(2) 维护中间输出的语义相关性，以避免偏离无关内容。在多个基准上的实验结果表明，我们的方法在多个模型上均能显著提高攻击成功率，突显了其有效性。代码可在https://github.com/xxiqiao/RL-MTJail 获取。警告：本文包含有害内容示例。

Summary / 总结

This paper addresses the vulnerability of large language models to jailbreak attacks by developing RL-MTJail, a reinforcement learning method for training attacker LLMs to elicit harmful content through multi-turn interactions. The method formulates the task as a multi-turn reinforcement learning problem, optimizing the harmfulness of the final output while using heuristic process rewards to control intermediate outputs and maintain semantic relevance. Experiments show improved attack success rates across multiple models, demonstrating the effectiveness of the approach.

该论文通过将问题形式化为多轮强化学习任务，来解决大型语言模型面临的牢笼破解攻击问题。方法直接优化最终输出的有害性，并包含用于控制中间输出和保持语义相关性的启发式过程奖励。实验结果表明，该方法在多个模型上提高了攻击成功率，证明了其有效性。

Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification

Authors: Menglin Wang, Xiaojin Gong, Jiachen Li, Genlin Ji

Venue: AAAI 2026

First: 2025-12-08T17:42:28+00:00 · Latest: 2025-12-08T17:42:28+00:00

Comments: Accepted to AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast' strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.

中文标题/摘要

标题：模态感知偏差缓解与不变表示学习在无监督可见-红外行人重识别中的应用

无监督可见-红外行人重识别（USVI-ReID）旨在无需任何标注的情况下，在可见光和红外摄像机之间匹配个体。由于可见光和红外模态之间存在显著差异，估计可靠的跨模态关联成为USVI-ReID中的主要挑战。现有方法通常采用最优传输来关联同一模态的簇，这容易传播局部簇的错误，并且忽略了全局实例级关系。通过挖掘和关注可见光-红外模态偏差，本文从两个方面关注跨模态学习：偏差缓解的全局关联和模态不变表示学习。受单模态重识别中相机感知距离校正的启发，我们提出模态感知Jaccard距离来缓解由模态差异引起的距离偏差，从而通过全局聚类估计更可靠的跨模态关联。为了进一步提高跨模态表示学习，设计了一种“分割-对比”策略来获得模态特定的全局原型。通过在全局关联的指导下显式对齐这些原型，可以实现模态不变但ID区分的表示学习。尽管概念上很简单，但我们的方法在基准VI-ReID数据集上取得了最先进的性能，并显著优于现有方法，验证了其有效性。

Summary / 总结

This paper addresses the challenge of unsupervised visible-infrared person re-identification by proposing a modality-aware approach. It introduces modality-aware Jaccard distance to mitigate the bias caused by modality discrepancy and a `split-and-contrast' strategy to learn modality-invariant yet ID-discriminative representations. The method achieves state-of-the-art performance on benchmark datasets, outperforming existing methods significantly.

该论文通过聚焦缓解跨模态偏差和学习模态不变表示来解决无监督可见光-红外行人重识别的挑战。提出了模态感知Jaccard距离来矫正距离偏差，并设计了`分割和对比'策略来获取模态特定的全局原型。该方法在基准数据集上取得了最先进的性能，优于现有方法。

UltrasODM: A Dual Stream Optical Flow Mamba Network for 3D Freehand Ultrasound Reconstruction

Authors: Mayank Anand, Ujair Alam, Surya Prakash, Priya Shukla, Gora Chand Nandi, Domenec Puig

First: 2025-12-08T17:39:34+00:00 · Latest: 2025-12-08T17:39:34+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at https://github.com/AnandMayank/UltrasODM.

中文标题/摘要

标题：UltrasODM：一种用于3D自由手超声重建的双流光学流蟒蛇网络

临床超声成像高度依赖操作者，快速探头运动和亮度波动常导致重建错误，降低信任度和临床实用性。我们提出UltrasODM，一种双流框架，通过校准的每帧不确定性、基于显著性的诊断和可操作的提示，协助超声技师进行成像。UltrasODM 结合了(i) 一种对比排名模块，按运动相似性分组帧，(ii) 一种融合双蟒蛇时序模块的光学流流，用于稳健的6自由度姿态估计，以及(iii) 一种结合贝叶斯不确定性、临床校准阈值和突出低置信度区域的显著性图的人在回路层。当不确定性超过阈值时，系统会发出不显眼的警报，建议采取纠正措施，如重新扫描突出显示的区域或减慢扫查速度。在临床自由手超声数据集上评估，与UltrasOM相比，UltrasODM将漂移减少15.2%，距离误差减少12.1%，Hausdorff距离减少10.1%，同时生成每帧的不确定性输出和显著性输出。通过强调透明度和临床反馈，UltrasODM 提高了重建可靠性，并支持更安全、更可信赖的临床工作流程。我们的代码可在https://github.com/AnandMayank/UltrasODM 公开获取。

Summary / 总结

UltrasODM is designed to improve the accuracy of 3D freehand ultrasound reconstruction by addressing operator-dependent errors. It uses a dual-stream framework with a contrastive ranking module and an optical-flow stream for robust pose estimation, combined with a Human-in-the-Loop layer that provides per-frame uncertainty and saliency outputs. The system reduces drift, distance error, and Hausdorff distance by 15.2%, 12.1%, and 10.1% respectively, and enhances clinical workflow reliability through actionable prompts and diagnostics. The code is publicly available.

UltrasODM旨在通过解决操作员依赖性错误来提高3D自由手超声重建的准确性。它使用一个包含对比度排名模块和光学流流的双流框架，用于稳健的姿态估计，并结合了一个以人为本的循环层，提供每帧的不确定性和显著性输出。该系统通过15.2%的漂移、12.1%的距离误差和10.1%的Hausdorff距离的减少，以及通过提供行动建议和诊断来增强临床工作流程的可靠性。代码已公开可用。

Physics-Informed Neural Networks for Source Inversion and Parameters Estimation in Atmospheric Dispersion

Authors: Brenda Anague, Bamdad Hosseini, Issa Karambal, Jean Medard Ngnotchouye

First: 2025-12-08T17:38:49+00:00 · Latest: 2025-12-08T17:38:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent studies have shown the success of deep learning in solving forward and inverse problems in engineering and scientific computing domains, such as physics-informed neural networks (PINNs). In the fields of atmospheric science and environmental monitoring, estimating emission source locations is a central task that further relies on multiple model parameters that dictate velocity profiles and diffusion parameters. Estimating these parameters at the same time as emission sources from scarce data is a difficult task. In this work, we achieve this by leveraging the flexibility and generality of PINNs. We use a weighted adaptive method based on the neural tangent kernels to solve a source inversion problem with parameter estimation on the 2D and 3D advection-diffusion equations with unknown velocity and diffusion coefficients that may vary in space and time. Our proposed weighted adaptive method is presented as an extension of PINNs for forward PDE problems to a highly ill-posed source inversion and parameter estimation problem. The key idea behind our methodology is to attempt the joint recovery of the solution, the sources along with the unknown parameters, thereby using the underlying partial differential equation as a constraint that couples multiple unknown functional parameters, leading to more efficient use of the limited information in the measurements. We present various numerical experiments, using different types of measurements that model practical engineering systems, to show that our proposed method is indeed successful and robust to additional noise in the measurements.

中文标题/摘要

标题：物理知情神经网络在大气扩散中的源反演和参数估计

近期研究表明，深度学习在工程和科学计算领域，如物理知情神经网络（PINNs）中成功解决了正向和反向问题。在大气科学和环境监测领域，估计排放源位置是一项核心任务，进一步依赖于多个模型参数，这些参数决定了速度剖面和扩散参数。同时从稀缺数据中估计这些参数和排放源是一项困难的任务。在本文中，我们通过利用PINNs的灵活性和通用性实现了这一目标。我们使用基于神经切线核的加权自适应方法，解决二维和三维对流-扩散方程中的源反演问题，其中速度和扩散系数未知且可能随时间和空间变化。我们提出的方法是将PINNs用于前向偏微分方程问题的扩展，以解决一个高度病态的源反演和参数估计问题。我们方法的核心思想是尝试同时恢复解、源以及未知参数，从而利用底层偏微分方程作为约束，将多个未知函数参数耦合在一起，从而更有效地利用有限的测量信息。我们使用不同类型的测量来模拟实际工程系统，展示了我们提出的方法确实有效，并且对测量中的额外噪声具有鲁棒性。

Summary / 总结

This study aims to estimate emission source locations and multiple model parameters in atmospheric dispersion using physics-informed neural networks (PINNs). The authors propose a weighted adaptive method based on neural tangent kernels to solve a source inversion problem with parameter estimation in 2D and 3D advection-diffusion equations. The key finding is that their method can efficiently recover the solution, sources, and unknown parameters, even with limited and noisy data, by leveraging the underlying partial differential equation as a constraint. This approach demonstrates robustness and success in practical engineering systems.

本研究旨在使用物理知情神经网络（PINNs）估计大气扩散中的排放源位置和多个模型参数。作者采用基于神经切线核的加权自适应方法来解决2D和3D对流-扩散方程中的源反演和参数估计问题。主要发现包括成功地同时恢复了解、源和未知参数，并展示了该方法在处理测量噪声时的鲁棒性和高效利用有限信息的能力。

Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

Authors: Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong

First: 2025-12-08T17:34:15+00:00 · Latest: 2025-12-08T17:34:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.

中文标题/摘要

标题：Unison：一种全自动化、任务通用且低成本的统一理解和生成框架

统一理解和生成是多模态学习中极具吸引力的研究方向。存在两种方法：一种通过自回归范式训练变压器，另一种采用两阶段方案连接预训练的理解和生成模型进行对齐微调。前者需要大量数据和计算资源，普通研究人员难以负担。尽管后者训练成本较低，但现有工作往往任务覆盖有限或生成质量不佳。两种方法缺乏解析输入元信息（如任务类型、图像分辨率、视频时长等）的能力，且需要手动配置参数，过程繁琐且不智能。在本文中，我们提出Unison，采用两阶段方案同时保留预训练模型的能力。以极低的训练成本，我们涵盖了多种多模态理解任务，包括文本、图像和视频理解，以及多样化的生成任务，如文本到视觉内容生成、编辑、可控生成和基于IP的参考生成。我们还为模型配备了自动解析用户意图、确定目标任务类型和准确提取相应任务所需元信息的能力。这使得在无需人工干预的情况下，可以自动化完成各种多模态任务。实验表明，在仅50万训练样本和50个GPU小时的低成本设置下，我们的模型能够准确自动识别任务并提取相关参数，并在多种理解和生成任务中表现出色。

Summary / 总结

Unison is a low-cost, fully automatic framework for unified understanding and generation in multimodal learning. It adopts a two-stage scheme, combining pre-trained understanding and generative models, to achieve task coverage and generation quality that surpasses previous approaches. Unison can automatically parse input meta-information and determine the target task type, enabling full automation without human intervention. Experiments show that Unison can accurately identify tasks and extract relevant parameters, achieving superior performance across various tasks with only 500k training samples and 50 GPU hours of training cost.

Unison 是一种低成本的统一理解和生成框架，采用两阶段方案同时保持预训练模型的能力。它可以处理包括文本、图像和视频理解在内的多种多模态任务，以及各种生成任务。Unison 自动解析输入的元信息并确定目标任务类型，实现无需人工干预的全流程自动化。实验表明，Unison 可以准确识别任务并提取相关参数，在仅使用 50 万个训练样本和 50 个 GPU 小时的情况下表现出色。

Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for Robotic Grasping

Authors: Sanket Kachole, Xiaoqian Huang, Fariborz Baghaei Naeini, Rajkumar Muthusamy, Dimitrios Makris, Yahya Zweiri

First: 2023-03-20T16:09:25+00:00 · Latest: 2025-12-08T17:33:10+00:00

Comments: 8 Pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Object segmentation for robotic grasping under dynamic conditions often faces challenges such as occlusion, low light conditions, motion blur and object size variance. To address these challenges, we propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data. The proposed Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions. Encoders capture rich contextual information by pooling the concatenated features at different resolutions while the decoder obtains sharp object boundaries. The evaluation of the proposed method undertakes five unique image degradation challenges including occlusion, blur, brightness, trajectory and scale variance on the Event-based Segmentation (ESD) Dataset. The evaluation results show a 6-10\% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. The model code is available at https://github.com/sanket0707/Bimodal-SegNet.git

中文标题/摘要

标题：双模态SegNet：结合事件数据和RGB帧进行机器人抓取实例分割

在动态条件下进行机器人抓取时，对象分割常面临遮挡、低光照条件、运动模糊和对象尺寸变化等挑战。为应对这些挑战，我们提出了一种深度学习网络，该网络融合了两种类型的视觉信号：事件数据和RGB帧数据。所提出的双模态SegNet网络有两个独立的编码器，分别处理每种信号输入，并采用空间金字塔池化和空洞卷积。编码器通过在不同分辨率下聚合拼接特征来捕获丰富的上下文信息，而解码器则获得清晰的对象边界。评估方法在基于事件的分割（ESD）数据集上进行了五种独特的图像退化挑战评估，包括遮挡、模糊、亮度、轨迹和尺度变化。评估结果表明，与最先进的方法相比，在平均交并比和像素精度方面提高了6-10%的分割准确率。模型代码可在https://github.com/sanket0707/Bimodal-SegNet.git获取

Summary / 总结

The paper addresses the challenges of object segmentation for robotic grasping under dynamic conditions, such as occlusion and low light. It proposes Bimodal SegNet, a deep learning network that fuses event-based data and RGB frame data. The network uses two distinct encoders and spatial pyramidal pooling with atrous convolutions to capture rich contextual information and obtain sharp object boundaries. The evaluation on the ESD Dataset shows a 6-10% improvement in mean intersection over the union and pixel accuracy compared to state-of-the-art methods.

论文针对动态条件下物体分割的挑战，如遮挡和低光照，提出了Bimodal SegNet，一种融合事件数据和RGB帧数据的深度学习网络。该网络使用两个独立的编码器和空间金字塔池化与空洞卷积来捕捉丰富的上下文信息并获得清晰的物体边界。在ESD数据集上的评估显示，与最先进的方法相比，在平均交并比和像素精度上分别提高了6-10%。

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Authors: Zonghan Yang, Shengjie Wang, Kelin Fu, Wenyang He, Weimin Xiong, Yibo Liu, Yibo Miao, Bofei Gao, Yejie Wang, Yingwei Ma, Yanhao Li, Yue Liu, Zhenxing Hu, Kaitai Zhang, Shuyi Wang, Huarong Chen, Flood Sung, Yang Liu, Yang Gao, Zhilin Yang, Tianyu Liu

First: 2025-09-27T01:49:13+00:00 · Latest: 2025-12-08T17:33:03+00:00

Comments: 68 pages. GitHub repo at https://github.com/MoonshotAI/Kimi-Dev

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4\% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6\% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.

中文标题/摘要

标题：Kimi-Dev：无需代理的训练作为SWE代理的技能先验

大型语言模型（LLMs）在软件工程（SWE）中的应用越来越广泛，SWE-bench是关键基准之一。解决方案分为多轮交互的SWE代理框架和单轮可验证步骤的无需代理方法。我们认为这些范式并非互斥：密集推理的无需代理训练会诱导出定位、代码编辑和自我反思等技能先验，从而实现高效的SWE代理适应。在本文中，我们首先整理了无需代理训练的配方，并展示了Kimi-Dev，一个开源的SWE语言模型，在SWE-bench Verified上达到60.4%，在工作流方法中最佳。通过额外的5000个公开可用轨迹的微调，Kimi-Dev使SWE代理达到48.6%的pass@1，与Claude 3.5 Sonnet（241022版本）相当。这些结果表明，从无需代理训练中获得的结构化技能先验可以弥合工作流和代理框架之间的鸿沟，为可转移的编码代理提供支持。

Summary / 总结

This work addresses the integration of agentless training as a skill prior for software engineering (SWE) agents, demonstrating that reasoning-intensive agentless training can induce skills such as localization, code editing, and self-reflection. Kimi-Dev, an open-source SWE LLM, achieved 60.4% on SWE-bench Verified, the best among workflow approaches. Further SFT adaptation on 5k publicly-available trajectories enabled Kimi-Dev to power SWE-Agents to 48.6% pass@1, comparable to Claude 3.5 Sonnet (241022 version).

该研究旨在将无代理训练作为技能先验应用于软件工程（SWE）代理。作者提出了Kimi-Dev，一个开源的SWE大型语言模型（LLM），在SWE-bench Verified上达到了60.4%的准确率，超过了其他工作流方法。进一步在5k公开轨迹上进行微调后，Kimi-Dev达到了48.6%的pass@1，与Claude 3.5 Sonnet相当。这表明，无代理训练中的结构化技能先验可以增强SWE代理在不同框架下的适应性。

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Authors: Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, Xinggang Wang

First: 2025-12-08T17:29:52+00:00 · Latest: 2025-12-08T17:29:52+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2

中文标题/摘要

标题：DiffusionDriveV2: 末尾端自主驾驶中强化学习约束的截断扩散建模

末尾端自主驾驶中的生成扩散模型往往遭受模式崩溃的问题，倾向于生成保守且同质的行为。虽然DiffusionDrive通过预定义代表不同驾驶意图的锚点来划分动作空间并生成多样化的轨迹，但其依赖于模仿学习缺乏足够的约束，导致多样性和一致的高质量之间的困境。在本文中，我们提出了DiffusionDriveV2，该模型利用强化学习来约束低质量模式并探索更优的轨迹。这显著提高了整体输出质量，同时保留了其核心高斯混合模型的固有模态性。首先，我们使用适用于轨迹规划的尺度自适应乘性噪声，促进广泛的探索。其次，我们采用锚点内GRPO来管理来自单一锚点生成样本的优势估计，并采用跨锚点截断GRPO来融入不同锚点之间的全局视角，防止不同意图（如转弯 vs 直行）之间不适当的优劣势比较，这可能导致进一步的模式崩溃。DiffusionDriveV2在与对齐的ResNet-34主干进行闭环评估时，在NAVSIM v1数据集上达到了91.2 PDMS，在NAVSIM v2数据集上达到了85.5 EPDMS，创下了新的记录。进一步的实验验证了我们的方法解决了截断扩散模型中多样性和一致高质量之间的困境，实现了最佳权衡。代码和模型将在https://github.com/hustvl/DiffusionDriveV2上提供

Summary / 总结

DiffusionDriveV2 addresses the issue of mode collapse in generative diffusion models for autonomous driving by integrating reinforcement learning to constrain low-quality modes and explore for better trajectories. It uses scale-adaptive multiplicative noise for broad exploration and intra-anchor and inter-anchor truncated GRPO to prevent improper advantage comparisons. The model achieves 91.2 PDMS on NAVSIM v1 and 85.5 EPDMS on NAVSIM v2, setting a new record and resolving the dilemma between diversity and consistent high quality.

DiffusionDriveV2通过结合强化学习来约束低质量模式并探索更好的轨迹，解决了生成扩散模型在自动驾驶中出现的模式崩溃问题。它使用尺度自适应的乘性噪声进行广泛的探索，并采用GRPO方法在单个锚点内和不同锚点之间管理优势估计，防止模式崩溃。该模型在NAVSIM数据集上实现了高性能，分别获得了91.2 PDMS和85.5 EPDMS的评分，打破了多样性与一致质量之间的权衡。

Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers

Authors: Michal Sadowski, Tadija Radusinović, Maria Wyrzykowska, Lukasz Sztukiewicz, Jan Rzymkowski, Paweł Włodarczyk-Pruszyński, Mikołaj Sacha, Piotr Kozakowski, Ruard van Workum, Stanislaw Kamil Jastrzebski

First: 2025-10-12T14:56:34+00:00 · Latest: 2025-12-08T17:27:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. This approach formed the basis of our winning solution to the Standard Industries \$1 million Retrosynthesis Challenge. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.

中文标题/摘要

标题：可信逆合成反应：通过多样化反应评分器消除幻觉

逆合成反应是生成模型兴起后被改变的领域之一，其中无意义或错误输出（幻觉）的问题尤为棘手：可靠评估合成方案耗时且自动方法缺乏。在本文中，我们提出了RetroTrim，这是一种成功避免无意义方案的逆合成系统，针对一组具有挑战性的药物样目标点。与该领域的常见基线相比，我们的系统不仅是唯一能够过滤掉幻觉反应的方法，而且产生的高质量路径数量最多。RetroTrim的核心洞察是结合了基于机器学习模型和现有化学数据库的不同反应评分策略。我们通过在标记的逆合成中间体数据集上分析它们来展示我们的评分策略捕捉到了不同类别的幻觉。这种方法构成了我们赢得标准工业100万美元逆合成挑战赛解决方案的基础。为了衡量逆合成系统的性能，我们提出了一种基于专家化学家结构化评审的新评价协议，用于反应和合成路径。使用此协议，我们在32个新型目标上比较了系统，这些目标经过精心策划以反映药物结构的最新趋势。虽然我们方法背后的见解广泛适用于逆合成，但我们的重点是药物样目标点。通过发布我们的基准目标和评价协议的详细信息，我们希望激发更多关于可靠逆合成的研究。

Summary / 总结

The research aims to address the issue of nonsensical or erroneous outputs in retrosynthesis, particularly in drug-like targets. The authors developed RetroTrim, which uses a diverse ensemble of reaction scorers, including machine learning models and chemical databases, to filter out hallucinations. This system not only eliminates hallucinated reactions but also generates the highest number of high-quality paths compared to existing methods. The evaluation protocol, based on structured reviews by expert chemists, demonstrates RetroTrim's superior performance on 32 novel targets, reflecting recent trends in drug structures.

研究旨在解决药物类似目标中不合理的合成路径问题。作者开发了RetroTrim，该系统利用机器学习模型和化学数据库等多种反应评分策略来过滤掉幻觉反应。该系统不仅消除了幻觉反应，还生成了比现有方法更多的高质量路径。基于专家化学家的结构化评审，该评估协议展示了RetroTrim在32个新型目标上的优越性能，这些目标反映了药物结构的最新趋势。

HLTCOE Evaluation Team at TREC 2025: VQA Track

Authors: Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme

First: 2025-12-08T17:25:13+00:00 · Latest: 2025-12-08T17:25:13+00:00

Comments: 7 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

中文标题/摘要

标题：HLTCOE评估团队在TREC 2025 VQA赛道

HLTCOE评估团队参加了TREC VQA的生成答案（AG）任务，我们开发了一种列表学习框架，旨在提高答案生成中的语义精确度和排名一致性。给定一个视频-问题对，基础的多模态模型首先生成多个候选答案，然后使用一种新颖的掩码指针交叉熵损失加排名权重进行重排序。该目标将基于指针的候选选择、排名依赖加权和在词汇限制下的掩码交叉熵结合在一起，实现稳定且可解释的列表优化。通过将生成建模与判别性排名相结合，我们的方法生成了连贯且细粒度的答案列表。实验结果显示，在准确性和排名稳定性方面均有所提升，尤其是在需要时间推理和语义消歧的问答中。

Summary / 总结

The HLTCOE Evaluation team participated in TREC VQA's Answer Generation task, developing a listwise learning framework to enhance semantic precision and ranking consistency. The framework uses a base multimodal model to generate candidate answers, which are then reranked using a novel Masked Pointer Cross-Entropy Loss with Rank Weights. Experiments showed consistent improvements in accuracy and ranking stability, particularly for questions involving temporal reasoning and semantic disambiguation.

HLTCOE评估团队参加了TREC VQA的Answer Generation任务，开发了一种列表学习框架以提高语义精度和排名一致性。该框架使用一个基础的多模态模型生成多个候选答案，然后使用一种新颖的Masked Pointer Cross-Entropy Loss with Rank Weights进行重排序。实验结果显示，在涉及时间推理和语义消歧的问答中，准确性和排名稳定性得到了一致的提升。

TranSplat: Instant Cross-Scene Object Relighting in Gaussian Splatting via Spherical Harmonic Transfer

Authors: Boyang, Yu, Yanlin Jin, Yun He, Akshat Dave, Guha Balakrishnan

First: 2025-03-28T17:59:43+00:00 · Latest: 2025-12-08T17:21:38+00:00

Abs · PDF · Code1 · Code2

Abstract

We present TranSplat, a method for fast and accurate object relighting for the 3D Gaussian Splatting (GS) framework when transferring a 3D object from a source GS scene to a target GS scene. TranSplat is based on a theoretical radiance transfer identity for cross-scene relighting of objects with radially symmetric BRDFs that involves only taking simple products of spherical harmonic appearance coefficients of the object, source, and target environment maps without any explicit computation of scene quantities (e.g., the BRDFs themselves). TranSplat is the first method to demonstrate how this theoretical identity may be used to perform relighting within the GS framework, and furthermore, by automatically inferring unknown source and target environment maps directly from the source and target scene GS representations. We evaluated TranSplat on several synthetic and real-world scenes and objects, demonstrating comparable 3D object relighting performance to recent conventional inverse rendering-based GS methods with a fraction of their runtime. While TranSplat is theoretically best-suited for radially symmetric BRDFs, results demonstrate that TranSplat still offers perceptually realistic renderings on real scenes and opens a valuable, lightweight path forward to relighting with the GS framework.

中文标题/摘要

标题：TranSplat：通过球谐变换在高斯点云框架中的跨场景物体重新光照

我们提出了TranSplat，一种在将3D物体从源高斯点云（GS）场景转移到目标GS场景时，用于3D GS框架中快速准确物体重新光照的方法。TranSplat基于一个理论上的辐射传输恒等式，该恒等式适用于具有径向对称BRDF的物体跨场景重新光照，仅涉及对物体、源环境图和目标环境图的球谐外观系数进行简单乘积，而无需显式计算场景量（例如BRDF本身）。TranSplat是第一个展示如何利用该理论恒等式在GS框架中进行重新光照的方法，并且通过直接从源和目标场景的GS表示中自动推断未知的源和目标环境图，进一步实现了这一点。我们在多个合成和真实场景上评估了TranSplat，展示了其在3D物体重新光照性能上与最近的基于逆渲染的GS方法相当，但运行时间仅为它们的一小部分。虽然TranSplat理论上最适合径向对称的BRDF，但结果表明TranSplat在真实场景中仍能提供感知上真实的渲染，并为使用GS框架进行重新光照提供了一条有价值且轻量化的路径。

Summary / 总结

TranSplat is a method for fast and accurate object relighting in the 3D Gaussian Splatting framework, which transfers objects from a source scene to a target scene using spherical harmonic transfer. It leverages a theoretical radiance transfer identity for objects with radially symmetric BRDFs, avoiding explicit computation of scene quantities. TranSplat demonstrates comparable performance to recent inverse rendering-based methods but with significantly reduced runtime. The method is effective for both synthetic and real-world scenes, offering perceptually realistic results even for non-radially symmetric BRDFs.

TranSplat 是一种用于 3D 高斯点云框架中即时跨场景物体光照转移的方法，利用了具有径向对称 BRDF 物体的辐射传输理论身份。通过使用球谐系数并自动推断环境图，TranSplat 达到了与基于逆渲染的方法相当的性能，但运行时间显著减少。结果表明，TranSplat 在合成和真实场景中都能实现逼真的光照转移。

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

Authors: Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang

First: 2025-12-08T17:20:50+00:00 · Latest: 2025-12-08T17:20:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.

中文标题/摘要

标题：SpatialDreamer：通过主动心理想象激励空间推理

尽管多模态大型语言模型（MLLMs）在场景理解方面取得了进展，但在需要心理模拟的复杂空间推理任务上，其表现仍然受到显著限制。当前方法通常依赖于对空间数据的被动观察，未能实现主动心理想象的过程。为弥合这一差距，我们提出SpatialDreamer，这是一种通过闭环过程实现空间推理的强化学习框架，该过程包括主动探索、通过世界模型进行视觉想象以及基于证据的推理。为解决长时序推理任务中缺乏精细的奖励监督问题，我们提出了几何策略优化（GeoPO），该方法引入了树状结构采样和具有几何一致性约束的步骤级奖励估计。广泛的实验表明，SpatialDreamer在多个具有挑战性的基准测试中取得了高度竞争力的结果，标志着在MLLMs中实现类似人类的主动空间心理模拟方面取得了关键进展。

Summary / 总结

The research aims to enhance the spatial reasoning capabilities of Multi-modal Large Language Models (MLLMs) by incorporating active mental imagery. SpatialDreamer, a reinforcement learning framework, is proposed to enable this through active exploration, visual imagination, and evidence-grounded reasoning. The key method involves Geometric Policy Optimization (GeoPO), which uses tree-structured sampling and step-level reward estimation with geometric consistency constraints. Experiments show that SpatialDreamer performs competitively on various benchmarks, indicating a significant step towards human-like spatial mental simulation for MLLMs.

研究旨在通过促进主动的内心想象来提升多模态大型语言模型（MLLMs）的空间推理能力。SpatialDreamer 是一个结合了主动探索、视觉想象和证据驱动推理的强化学习框架。实验结果显示，SpatialDreamer 在复杂的空间推理任务上超过了现有方法，表明 MLLMs 在进行类似人类的主动空间心理模拟方面取得了显著进步。

SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

Authors: Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon

Venue: WACV 2026

First: 2025-12-08T17:20:07+00:00 · Latest: 2025-12-08T17:20:07+00:00

Comments: WACV 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.

中文标题/摘要

标题：SAVE：稀疏自编码驱动的视觉信息增强以减轻对象幻觉

尽管多模态大型语言模型（MLLMs）取得了显著进步，但它们仍然容易受到语言先验和视觉信息丢失导致的对象幻觉的影响。为了解决这一问题，我们提出了SAVE（Sparse Autoencoder-Driven Visual Information Enhancement）框架，通过引导模型沿着稀疏自编码（SAE）潜在特征来减轻幻觉。二元对象存在性问题回答探针识别出最能反映模型视觉信息处理的SAE特征，称为视觉理解特征。沿着这些识别出的特征引导模型增强了基于视觉的理解，有效地减少了幻觉。凭借其简单的设计，SAVE在标准基准上优于最先进的无训练方法，分别在CHAIR_S上提高了10%p，在POPE和MMHal-Bench上保持一致的改进。广泛的评估表明，我们的方法具有鲁棒性和通用性。进一步的分析表明，沿着视觉理解特征引导可以抑制不确定对象标记的生成，并增加对图像标记的关注，从而减轻幻觉。代码发布在https://github.com/wiarae/SAVE。

The Native Spiking Microarchitecture: From Iontronic Primitives to Bit-Exact FP8 Arithmetic

Authors: Zhengzheng Tang

First: 2025-12-08T17:15:46+00:00 · Latest: 2025-12-08T17:15:46+00:00

Abs · PDF · Code1 · Code2

Abstract

The 2025 Nobel Prize in Chemistry for Metal-Organic Frameworks (MOFs) and recent breakthroughs by Huanting Wang's team at Monash University establish angstrom-scale channels as promising post-silicon substrates with native integrate-and-fire (IF) dynamics. However, utilizing these stochastic, analog materials for deterministic, bit-exact AI workloads (e.g., FP8) remains a paradox. Existing neuromorphic methods often settle for approximation, failing Transformer precision standards. To traverse the gap "from stochastic ions to deterministic floats," we propose a Native Spiking Microarchitecture. Treating noisy neurons as logic primitives, we introduce a Spatial Combinational Pipeline and a Sticky-Extra Correction mechanism. Validation across all 16,129 FP8 pairs confirms 100% bit-exact alignment with PyTorch. Crucially, our architecture reduces Linear layer latency to O(log N), yielding a 17x speedup. Physical simulations further demonstrate robustness against extreme membrane leakage (beta approx 0.01), effectively immunizing the system against the stochastic nature of the hardware.

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Authors: Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu, Dong An, Xue Liu, Ian Reid, Xiaodan Liang

First: 2025-12-01T16:01:41+00:00 · Latest: 2025-12-08T17:14:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

中文标题/摘要

标题：透过想象看世界：通过隐式空间世界建模学习场景几何

空间推理，即理解并解释三维世界结构的能力，在多模态大型语言模型（MLLMs）中是一个关键但尚未充分发展的能力。当前方法主要依赖于基于文本符号的描述性调优，这导致了视觉盲点，即它们仅通过文本符号学习空间概念，而与视觉表现脱节。为了解决这一问题，本文引入了MILO，一种隐式空间世界建模范式，模拟人类的空间想象。MILO结合了视觉生成器，提供几何感知反馈，从而隐式地将MLLM的符号推理与感知经验联系起来。为了补充这一范式，我们提出了RePE（相对位置编码），这是一种新颖的编码方案，能够捕捉相对相机姿态变换，性能优于绝对坐标系统。为了支持训练，我们构建了GeoGen，一个包含约2,241个视频和67,827个观察-动作-结果三元组的大规模几何感知生成数据集。实验表明，我们的方法在多个基线和基准上显著增强了空间推理能力，提供了对三维空间更全面的理解。

Summary / 总结

This paper addresses the limitation of current Multimodal Large Language Models (MLLMs) in spatial reasoning by introducing MILO, an Implicit Spatial World modeling paradigm. MILO uses a visual generator to provide geometry-aware feedback, grounding symbolic reasoning in perceptual experience. The approach also introduces RePE, a novel encoding scheme for relative camera-pose transformations, and is trained on GeoGen, a large-scale Geometry-aware Generative dataset. The experiments show that this method significantly improves spatial reasoning capabilities compared to existing baselines and benchmarks.

本文通过引入MILO，一种隐式空间世界建模范式，解决了当前多模态大型语言模型在空间推理方面的局限性。MILO使用视觉生成器提供几何感知反馈，将符号推理与感知体验相结合。该方法还引入了RePE，一种用于捕捉相对相机姿态变换的新编码方案。实验表明，这种方法在各种基准测试中显著提高了空间推理能力，提供了对三维空间更全面的理解。

ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

Authors: Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu, Chaoyue Song, Cheng Chen, Yisheng He, Shifeng Zhang, Xiaoguang Han, Steven Hoi, Guosheng Lin

First: 2025-12-08T17:10:29+00:00 · Latest: 2025-12-08T17:10:29+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

中文标题/摘要

标题：ViSA：基于3D感知的视频着色技术实现实时上半身avatar创建

从单张输入图像生成高保真度的上半身3D avatar仍然是一个重大挑战。当前依赖大规模重建模型的3D avatar生成方法速度快且能产生稳定的骨骼结构，但常常会出现纹理模糊和僵硬不自然的运动等缺陷。相比之下，生成式视频模型通过合成逼真且动态的结果表现出良好的性能，但它们经常难以避免不稳定的行为，包括骨骼结构错误和身份漂移。为了解决这些局限性，我们提出了一种结合两种范式优点的新方法。我们的框架采用3D重建模型提供稳健的几何结构和外观先验，进而引导实时自回归视频扩散模型进行渲染。这一过程使模型能够实时合成高频、逼真的细节和流体动力学，有效减少纹理模糊和运动僵硬，同时防止视频生成方法中常见的结构不一致。通过结合3D重建的几何稳定性和视频模型的生成能力，我们的方法能够生成具有真实外观和动态、时间上一致的高质量数字avatar。实验表明，我们的方法显著减少了缺陷并实现了与领先方法相比在视觉质量上的重大改进，为游戏和虚拟现实等实时应用提供了稳健且高效的解决方案。项目页面：https://lhyfst.github.io/visa

Summary / 总结

The paper proposes ViSA, a method that combines 3D reconstruction and generative video models to create high-fidelity upper-body avatars. It uses a 3D reconstruction model to provide structural and appearance priors, guiding a real-time autoregressive video diffusion model for rendering. The approach reduces texture blur and motion stiffness, and improves visual quality over existing methods, making it suitable for real-time applications like gaming and virtual reality.

该论文提出了一种名为ViSA的方法，结合了3D重建和视频扩散模型，从单张输入图像生成高质量的上身avatar。该方法通过3D重建提供稳健的结构和外观先验，指导实时视频扩散模型生成逼真且动态的结果。实验表明，ViSA显著减少了图像伪影并提高了视觉质量，使其适用于实时应用如游戏和虚拟现实。

UnCageNet: Tracking and Pose Estimation of Caged Animal

Authors: Sayak Dutta, Harish Katti, Shashikant Verma, Shanmuganathan Raman

First: 2025-12-08T17:00:06+00:00 · Latest: 2025-12-08T17:00:06+00:00

Comments: 9 pages, 2 figures, 2 tables. Accepted to the Indian Conference on Computer Vision, Graphics, and Image Processing (ICVGIP 2025), Mandi, India

Abs · PDF · Code1 · Code2

Abstract

Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.

中文标题/摘要

标题：UnCageNet: 囚笼动物的跟踪与姿态估计

动物跟踪和姿态估计系统，如STEP（同时跟踪和姿态估计）和ViTPose，在处理包含笼子结构和系统遮挡的图像和视频时，会经历显著的性能下降。我们提出了一种三阶段预处理流水线，通过以下方式解决这一限制：(1) 使用增强的Gabor ResNet-UNet架构和可调方向滤波器进行笼子分割，(2) 使用CRFill进行内容感知的遮挡区域重建，(3) 在去笼子帧上评估姿态估计和跟踪。我们的增强分割模型利用72个方向核的定向特征，准确地识别和分割严重影响现有方法性能的笼子结构。实验验证表明，通过我们的流水线去除遮挡可以实现与无遮挡环境相当的姿态估计和跟踪性能。我们还观察到关键点检测准确性和轨迹一致性有显著提高。

Summary / 总结

The paper addresses the performance drop of animal tracking and pose estimation systems in the presence of cage structures and occlusions. It introduces a three-stage preprocessing pipeline: cage segmentation using a Gabor-enhanced ResNet-UNet, cage inpainting with CRFill, and evaluation on uncaged frames. The Gabor-enhanced segmentation model uses 72 directional kernels to accurately segment cage structures. Experimental results show that removing cage occlusions improves pose estimation and tracking performance to levels seen in unobstructed environments, with enhanced keypoint detection accuracy and trajectory consistency.

论文针对动物跟踪和姿态估计系统在包含笼子结构和遮挡的图像和视频中性能下降的问题，提出了一种三阶段预处理管道：使用增强的Gabor ResNet-UNet进行笼子分割，使用CRFill进行遮挡区域的修复，以及在去掉了笼子的帧上进行评估。增强的Gabor模型使用72个方向核来准确分割笼子结构。实验结果表明，通过去除笼子遮挡，可以提高姿态估计和跟踪性能，达到无遮挡环境下的准确度，并增强了关键点检测和轨迹一致性。

PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

Authors: Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri

Venue: iclr poster

First: 2024-04-06T16:16:30+00:00 · Latest: 2025-12-08T16:59:52+00:00

Comments: The project code is available at https://github.com/Nicolas-Yax/PhyloLM . Published as https://iclr.cc/virtual/2025/poster/28195 at ICLR 2025. A code demo is available at https://colab.research.google.com/drive/1agNE52eUevgdJ3KL3ytv5Y9JBbfJRYqd

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.

中文标题/摘要

标题：PhyloLM：大型语言模型的系统发育推断及其在基准测试中性能预测

本文介绍了PhyloLM，这是一种将系统发育算法应用于大型语言模型（LLMs）的方法，以探索它们之间的关系及其如何相互关联，并预测其性能特征。我们的方法基于LLMs输出的相似性计算系统发育距离度量。该度量随后用于构建能够有效捕捉111个开源和45个封闭模型之间已知关系的系统发育树。此外，我们的系统发育距离可以预测标准基准测试中的性能，从而证明其功能有效性，并为大型语言模型能力的快速、低成本估计铺平了道路。总之，通过将群体遗传学概念应用于机器学习，我们提出并验证了一个工具，用于评估大型语言模型的发展、关系和能力，即使缺乏透明的训练信息也是如此。

Summary / 总结

PhyloLM is a method that uses phylogenetic algorithms to analyze and predict the performance of Large Language Models (LLMs). It calculates a phylogenetic distance based on the similarity of LLMs' outputs and constructs dendrograms that accurately represent the relationships among 156 LLMs. This distance metric also effectively predicts LLM performance in benchmarks, validating its utility in evaluating LLM capabilities and development without detailed training information.

PhyloLM 是一种使用谱系算法分析和预测大型语言模型（LLM）性能的方法。它基于 LLM 输出的相似性计算谱系距离，并构建了能够准确反映 156 个 LLM 关系的谱系图。此外，这种距离还与 LLM 在基准测试中的表现相关，验证了该方法的预测能力和评估 LLM 能力的有效性，即使没有详细的训练信息。