arXiv 论文速递

DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

Authors: Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang

First: 2025-12-15T18:59:57+00:00 · Latest: 2025-12-15T18:59:57+00:00

Comments: Project page: https://susunghong.github.io/DiffusionBrowser

Abstract

Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.

中文标题/摘要

标题：DiffusionBrowser：多分支解码器实现交互式扩散预览

视频扩散模型已经革新了生成视频合成，但它们不够精确，速度慢，并且在生成过程中可能不够透明，使用户在较长时间内处于信息不足的状态。在本文中，我们提出了一种模型无关的轻量级解码器框架——DiffusionBrowser，允许用户在去噪过程中任意时间点（时间步或变压器块）生成交互式预览。我们的模型可以在超过4倍实时速度（少于1秒生成4秒视频）下生成多模态预览表示，这些表示能够一致地传达最终视频的外观和运动。通过训练好的解码器，我们展示了可以在中间噪声步骤通过重新注入随机性和模式导向来交互式引导生成的可能性，从而解锁新的控制能力。此外，我们系统地使用学习到的解码器探查模型，揭示了在黑盒去噪过程中场景、物体和其他细节是如何组合和构建的。

Summary / 总结

DiffusionBrowser is a model-agnostic framework that enables interactive generation of video previews during the denoising process, using multi-branch decoders. It allows users to generate multi-modal previews at any timestep or transformer block, achieving real-time or near-real-time speeds. The system can guide the generation process at intermediate noise steps, providing new control capabilities and revealing the composition of scene details during the denoising process.

DiffusionBrowser 是一个模型无关的框架，允许在去噪过程中通过多分支解码器生成视频预览，实现交互式生成。用户可以在任意时间步或变压器块生成多模态预览，达到实时或接近实时的速度。该系统可以在中间噪声步骤引导生成过程，提供新的控制能力，并揭示去噪过程中场景细节的组成。

LitePT: Lighter Yet Stronger Point Transformer

Authors: Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner, Christian Rupprecht, Konrad Schindler

First: 2025-12-15T18:59:57+00:00 · Latest: 2025-12-15T18:59:57+00:00

Comments: Project page: https://litept.github.io/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.

中文标题/摘要

标题：LitePT：更轻量但更强大的点变换器

现代用于3D点云处理的神经架构包含卷积层和注意力模块，但它们的最佳组合方式尚不明确。我们分析了3D点云网络中不同计算模块的作用，并发现一种直观的行为：卷积在早期高分辨率层中足以提取低级几何信息，而此时注意力模块昂贵且不带来任何好处；注意力模块在低分辨率的深层中更有效地捕捉高级语义和上下文。根据这一设计原则，我们提出了一种新的3D点云骨干网络，早期阶段使用卷积，深层阶段切换到注意力模块。为了避免在丢弃冗余卷积层时丢失空间布局信息，我们引入了一种新的、无需训练的3D位置编码，PointROPE。结果表明，LitePT模型参数量减少3.6倍，运行速度提高2倍，内存使用减少2倍，但在多种任务和数据集上仍能匹配甚至超越最先进的Point Transformer V3。代码和模型可在：https://github.com/prs-eth/LitePT获取。

Summary / 总结

The paper proposes LitePT, a new 3D point cloud backbone that uses convolutions in early layers and attention in deeper layers, guided by the observation that convolutions are sufficient for low-level geometry extraction while attention is more effective for high-level semantics in lower-resolution layers. This design reduces parameter count by 3.6 times, improves speed by 2 times, and reduces memory usage by 2 times compared to the state-of-the-art Point Transformer V3, while still matching or outperforming it on various tasks and datasets.

研究旨在通过平衡卷积层和注意力模块的使用来优化3D点云处理。研究发现，卷积在早期层中有效提取低级几何信息，而注意力在深层中更有利于高级语义理解。基于此，提出了一种新的骨干网络LitePT，初期使用卷积，后期切换到注意力。为了保留空间布局信息，引入了一种新的无需训练的3D位置编码PointROPE。LitePT在参数量上减少了3.6倍，运行速度提高了2倍，内存使用减少了2倍，同时在多种任务和数据集上与Point Transformer V3相当或更优。

Recurrent Video Masked Autoencoders

Authors: Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman

First: 2025-12-15T18:59:48+00:00 · Latest: 2025-12-15T18:59:48+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.

中文标题/摘要

标题：循环视频掩蔽自编码器

我们提出了循环视频掩蔽自编码器（RVM）：一种新颖的视频表示学习方法，使用基于变压器的循环神经网络在时间上聚合密集图像特征，有效地捕捉自然视频数据的空间-时间结构。RVM 通过不对称的掩蔽预测任务进行学习，仅需标准像素重构目标。这种设计产生了一个高效的“通才”编码器：RVM 在视频级别任务（如动作识别和点/对象跟踪）上与最先进的视频模型（例如 VideoMAE、V-JEPA）竞争性能，同时在测试几何和密集空间理解的任务上也优于图像模型（例如 DINOv2）。值得注意的是，RVM 在小型模型范围内表现出色，无需知识蒸馏，参数效率比竞争的视频掩蔽自编码器高 30 倍以上。此外，我们证明了 RVM 的循环特性使其能够在长时间序列中稳定传播特征，具有线性计算成本，克服了标准空间-时间注意力基架构的一些局限性。最后，我们使用定性可视化来突出 RVM 学习丰富的场景语义、结构和运动表示。

Summary / 总结

Recurrent Video Masked-Autoencoders (RVM) is a novel approach that uses a transformer-based recurrent neural network to capture spatio-temporal structures in videos. RVM is trained with an asymmetric masked prediction task and achieves competitive performance on video-level tasks like action recognition and point/object tracking, while also excelling in tasks requiring geometric and dense spatial understanding. RVM demonstrates up to 30x greater parameter efficiency compared to other video masked autoencoders and can propagate features stably over long temporal horizons with linear computational cost, making it a highly efficient 'generalist' encoder.

Recurrent Video Masked-Autoencoders (RVM) 是一种使用基于变压器的递归神经网络来捕捉视频中时空结构的新方法。RVM 通过不对称的掩蔽预测任务进行训练，并在动作识别和点/对象跟踪等视频级别任务上表现出竞争力，同时在需要几何和密集空间理解的任务上也表现出色。RVM 相比其他视频掩蔽自编码器具有高达 30 倍的参数效率，并且能够以线性计算成本稳定地在长时间范围内传播特征，使其成为一个高效的‘通用’编码器。

I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera

First: 2025-12-15T18:59:13+00:00 · Latest: 2025-12-15T18:59:13+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/

中文标题/摘要

标题：I-场景：3D实例模型是隐式的通用空间学习者

泛化仍然是交互式3D场景生成的核心挑战。现有的基于学习的方法将空间理解局限于有限的场景数据集，限制了对新布局的泛化能力。相反，我们重新编程了一个预训练的3D实例生成器，使其作为场景级别的学习者发挥作用，用模型为中心的空间监督取代数据集限制的监督。这种重新编程解锁了生成器可转移的空间知识，使其能够泛化到未见过的布局和新的物体组合。令人惊讶的是，即使训练场景是由随机组合的对象构成的，空间推理仍然会出现。这表明生成器的可转移场景先验为从纯几何线索中推断邻近性、支撑和对称性提供了丰富的学习信号。我们用基于视点的场景空间公式替代广泛使用的标准空间，从而实现了一个完全前馈的、可泛化的场景生成器，该生成器直接从实例模型中学习空间关系。定量和定性的结果表明，3D实例生成器是一个隐式的空间学习者和推理者，指出了交互式3D场景理解和生成的基础模型。项目页面：https://luling06.github.io/I-Scene-project/

LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Authors: Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang

First: 2025-12-15T18:59:04+00:00 · Latest: 2025-12-15T18:59:04+00:00

Comments: 16 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$

Summary / 总结

The research aims to enable training-free streaming 4D reconstruction by addressing the memory complexity issue of existing feed-forward models. LASER proposes a layer-wise scale alignment method that segments depth predictions into layers, computes per-layer scale factors, and propagates them to align predictions across consecutive temporal windows. Experiments demonstrate that LASER achieves state-of-the-art performance in camera pose estimation and point map reconstruction while operating at 14 FPS with 6 GB peak memory.

论文提出了LASER，一种无需训练的框架，通过在连续时间窗口间对齐预测将离线4D重建模型转换为流媒体系统。它解决了由单目尺度歧义引起的层深度错位问题，并引入了层间尺度对齐来计算并传播每层的尺度因子。LASER在相机姿态估计和点云重建方面达到了最先进的性能，同时在RTX A6000 GPU上以每秒14帧的速度运行，峰值内存使用量为6 GB，适用于大规模流媒体视频的实用部署。

Feedforward 3D Editing via Text-Steerable Image-to-3D

Authors: Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari

First: 2025-12-15T18:58:55+00:00 · Latest: 2025-12-15T18:58:55+00:00

Comments: https://glab-caltech.github.io/steer3d/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/

中文标题/摘要

标题：基于文本可引导的图像到3D的前馈3D编辑

图像到3D领域的最新进展为设计、AR/VR和机器人技术开辟了巨大的可能性。然而，要在实际应用中使用AI生成的3D资产，一个关键要求是能够轻松编辑它们。我们提出了一种前馈方法Steer3D，将文本可引导性添加到图像到3D模型中，使生成的3D资产可以通过语言进行编辑。我们的方法受到ControlNet的启发，我们将其适应图像到3D生成，以在前向传递中直接实现文本引导。我们构建了一个可扩展的数据引擎以实现自动数据生成，并开发了一种基于流匹配训练和直接偏好优化（DPO）的两阶段训练方案。与竞争方法相比，Steer3D更忠实地遵循语言指令，并且与原始3D资产保持更好的一致性，同时速度快2.4到28.5倍。Steer3D证明了可以使用10万数据将一种新的模态（文本）添加到引导预训练图像到3D生成模型的生成中。项目网站：https://glab-caltech.github.io/steer3d/

Summary / 总结

The research aims to enable easy editing of AI-generated 3D assets for applications in design, AR/VR, and robotics. Steer3D, a feedforward method, adds text steerability to image-to-3D models, allowing for language-driven editing. The approach uses a scalable data engine and a two-stage training recipe, resulting in faster and more faithful editing compared to competing methods, while maintaining consistency with the original 3D asset.

研究旨在通过文本指令使AI生成的3D资产易于编辑。方法Steer3D通过将ControlNet应用于图像到3D生成，使文本引导能够在前向传递中实现。关键发现表明，Steer3D更准确地遵循文本指令，并且与原始3D资产保持更好的一致性，同时比竞争对手的方法快得多，最多快28.5倍。

Active 6D Pose Estimation for Textureless Objects using Multi-View RGB Frames

Authors: Jun Yang, Wenjie Xue, Sahar Ghavidel, Steven L. Waslander

First: 2025-03-05T18:28:32+00:00 · Latest: 2025-12-15T18:58:46+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Estimating the 6D pose of textureless objects from RGB images is an important problem in robotics. Due to appearance ambiguities, rotational symmetries, and severe occlusions, single-view based 6D pose estimators are still unable to handle a wide range of objects, motivating research towards multi-view pose estimation and next-best-view prediction that addresses these limitations. In this work, we propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images. Our approach is built upon a key idea: decoupling the 6D pose estimation into a two-step sequential process can greatly improve both accuracy and efficiency. First, we estimate the 3D translation of each object, resolving scale and depth ambiguities inherent to RGB images. These estimates are then used to simplify the subsequent task of determining the 3D orientation, which we achieve through canonical scale template matching. Building on this formulation, we then introduce an active perception strategy that predicts the next best camera viewpoint to capture an RGB image, effectively reducing object pose uncertainty and enhancing pose accuracy. We evaluate our method on the public ROBI and TOD datasets, as well as on our reconstructed transparent object dataset, T-ROBI. Under the same camera viewpoints, our multi-view pose estimation significantly outperforms state-of-the-art approaches. Furthermore, by leveraging our next-best-view strategy, our approach achieves high pose accuracy with fewer viewpoints than heuristic-based policies across all evaluated datasets. The accompanying video and T-ROBI dataset will be released on our project page: https://trailab.github.io/ActiveODPE.

中文标题/摘要

标题：使用多视角RGB帧估计无纹理物体的6D姿态

从RGB图像估计无纹理物体的6D姿态是机器人技术中的一个重要问题。由于外观歧义、旋转对称性和严重遮挡，基于单视角的6D姿态估计器仍然无法处理广泛的物体，这促使研究朝向多视角姿态估计和最佳视角预测方向发展，以解决这些限制。在本文中，我们提出了一种全面的主动感知框架，仅使用RGB图像估计无纹理物体的6D姿态。我们的方法建立在一个关键思想之上：将6D姿态估计分解为两步顺序过程可以大大提高准确性和效率。首先，我们估计每个物体的3D平移，解决RGB图像固有的尺度和深度歧义。然后，利用这些估计简化后续确定3D方向的任务，我们通过标准尺度模板匹配实现这一目标。在此基础上，我们引入了一种主动感知策略，预测最佳视角以捕捉RGB图像，有效减少物体姿态不确定性并提高姿态准确性。我们在公共ROBI和TOD数据集以及我们重建的透明物体数据集T-ROBI上评估了我们的方法。在相同的摄像机视角下，我们的多视角姿态估计显著优于最先进的方法。此外，通过利用我们的最佳视角策略，我们的方法在所有评估数据集上以较少的视角实现高姿态准确性，优于基于启发式的策略。随附的视频和T-ROBI数据集将在我们的项目页面上发布：https://trailab.github.io/ActiveODPE/

Summary / 总结

This paper addresses the challenge of estimating the 6D pose of textureless objects using RGB images. It proposes a two-step active perception framework that first estimates 3D translation to resolve ambiguities, followed by 3D orientation determination through canonical scale template matching. The method also includes an active perception strategy to predict the next best viewpoint, reducing pose uncertainty. Experiments on ROBI, TOD, and T-ROBI datasets show that the multi-view approach outperforms state-of-the-art methods and achieves high accuracy with fewer viewpoints compared to heuristic-based policies.

论文针对使用RGB图像估计无纹理物体的6D姿态时遇到的挑战，特别是单视角方法由于外观模糊和遮挡所面临的局限性。它提出了一种两步主动感知框架，首先估计3D平移，然后通过标准尺度模板匹配确定3D方向。该方法还包括一种预测最佳拍摄视角的主动感知策略，以减少姿态不确定性。实验结果显示，多视角方法在公共数据集上的表现优于最先进的方法，并且在使用较少视角的情况下实现了高精度，优于基于启发式的策略。

JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Authors: Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han

First: 2025-12-15T18:58:18+00:00 · Latest: 2025-12-15T18:58:18+00:00

Comments: Project page: \url{https://visual-ai.github.io/jova}

Abs · PDF · Code1 · Code2 · Project1

Abstract

In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

中文标题/摘要

标题：JoVA：联合视频-音频生成的统一多模态学习

在本文中，我们提出了JoVA，一种联合视频-音频生成的统一框架。尽管最近取得了令人鼓舞的进步，但现有方法面临两个关键限制。首先，大多数现有方法只能生成环境声，缺乏生成与唇部动作同步的人声的能力。其次，最近的联合人类视频-音频生成尝试通常依赖于显式融合或模态特定对齐模块，这引入了额外的架构设计，削弱了原始变压器的模型简洁性。为了解决这些问题，JoVA 在每个变压器层中采用视频和音频标记之间的联合自注意力，使跨模态交互直接且高效，无需额外的对齐模块。此外，为了实现高质量的唇部语音同步，我们引入了一种基于面部关键点检测的简单而有效的口部区域损失，这在训练过程中增强了对关键口部区域的监督，而不会牺牲架构的简洁性。在基准测试上的广泛实验表明，JoVA 在唇同步准确性、语音质量和整体视频-音频生成保真度方面优于或与统一和音频驱动的最新方法竞争。我们的结果将JoVA确立为高质量多模态生成的优雅框架。

Summary / 总结

JoVA is a unified framework for joint video-audio generation that addresses the limitations of existing methods by enabling synchronized human speech with lip movements and avoiding the need for additional alignment modules. It uses joint self-attention across video and audio tokens within each transformer layer and introduces a mouth-area loss based on facial keypoint detection to enhance lip-sync accuracy. Experiments show that JoVA outperforms or matches state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity.

JoVA 是一种统一的框架，用于联合视频-音频生成，通过实现与唇部运动同步的人类语音，并避免使用额外的对齐模块来解决现有方法的限制。它在每个变压器层中使用视频和音频标记之间的联合自注意力，并引入基于面部关键点检测的口部区域损失来增强训练期间口部区域的监督。实验表明，JoVA 在唇同步准确性、语音质量和整体视频-音频生成保真度方面优于或与最先进的方法相当。

Towards Interactive Intelligence for Digital Humans

Authors: Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, Dianwen Ng, Zixiong Su, Erwin Wu, Yuhan Wu, Dingkun Yan, Tianyu Yan, Chang Zeng, Bo Zheng, You Zhou

First: 2025-12-15T18:57:35+00:00 · Latest: 2025-12-15T18:57:35+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

中文标题/摘要

标题：迈向数字人类的交互智能

我们介绍了交互智能，这是一种新型的数字人类，能够实现个性化的表达、适应性交互和自我进化。为了实现这一目标，我们提出了Mio（多模态交互全息化身），这是一个由五个专门模块组成的端到端框架：思考者、说话者、面部动画器、身体动画器和渲染器。这种统一的架构将认知推理与实时多模态表现相结合，以实现流畅、一致的交互。此外，我们还建立了一个新的基准来严格评估交互智能的能力。广泛的实验表明，我们的框架在所有评估维度上都优于最先进的方法。这些贡献共同推动了数字人类从表面模仿向智能交互的转变。

Summary / 总结

The research introduces Interactive Intelligence, a new paradigm for digital humans that can express personalities, adaptively interact, and self-evolve. It presents Mio, an end-to-end framework with five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. These modules integrate cognitive reasoning with real-time multimodal embodiment, enabling fluid and consistent interactions. Experiments show that Mio outperforms existing methods in all evaluated dimensions, advancing digital humans towards intelligent interaction rather than mere imitation.

研究引入了交互智能，这是一种新的数字人类范式，能够进行个性表达、适应互动和自我进化。它提出了Mio，一个包含专门模块的端到端框架，用于认知推理和实时多模态体现。实验表明，Mio在所有评估维度上都优于现有方法，推动了数字人类从简单模仿向智能互动的发展。

AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

Authors: Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang

First: 2025-12-15T18:57:04+00:00 · Latest: 2025-12-15T18:57:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.

中文标题/摘要

标题：AgentIAD：工具增强的单智能体工业异常检测

工业异常检测（IAD）由于正常参考样本稀缺和许多缺陷的细微、局部性质而具有挑战性。单次视图语言模型（VLMs）往往忽视小异常，缺乏与标准正常模式进行对比的显式机制。我们提出了一种工具驱动的代理框架AgentIAD，以实现多阶段视觉检查。该代理配备了感知放大器（PZ）进行局部精细分析，以及比较检索器（CR）在证据模糊时查询正常示例。为了教授这些检查行为，我们从MMAD数据集中构建了结构化的感知和比较轨迹，并分两阶段训练模型：监督微调后进行强化学习。两部分奖励设计驱动这一过程：感知奖励监督分类准确性、空间对齐和类型正确性，行为奖励鼓励高效使用工具。这些组件共同使模型能够通过逐步观察、放大和验证来完善其判断。AgentIAD在MMAD上实现了新的最佳分类准确率97.62%，超越了基于MLLM的先前方法，同时生成透明且可解释的检查轨迹。

Summary / 总结

AgentIAD is a tool-driven framework for industrial anomaly detection that uses a multi-stage visual inspection process. It includes a Perceptive Zoomer for detailed analysis and a Comparative Retriever to query normal patterns when needed. The model is trained in two stages: supervised fine-tuning and reinforcement learning, with a reward system that encourages accurate classification and efficient tool use. AgentIAD achieves 97.62% classification accuracy, outperforming previous methods and providing transparent inspection traces.

AgentIAD 是一种工具驱动的工业异常检测框架，使用感知放大器进行局部分析，并使用比较检索器查询正常模式。它通过监督微调和强化学习两个阶段进行训练，奖励系统鼓励准确分类和高效使用工具。AgentIAD 达到了 97.62% 的分类准确率，超过了先前的方法，并提供了透明的检测轨迹。

Template-Guided Reconstruction of Pulmonary Segments with Neural Implicit Functions

Authors: Kangxian Xie, Yufei Zhu, Kaiming Kuang, Li Zhang, Hongwei Bran Li, Mingchen Gao, Jiancheng Yang

First: 2025-05-13T19:31:01+00:00 · Latest: 2025-12-15T18:56:33+00:00

Comments: Manuscript accepted by Medical Image Analysis, 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

High-quality 3D reconstruction of pulmonary segments plays a crucial role in segmentectomy and surgical planning for the treatment of lung cancer. Due to the resolution requirement of the target reconstruction, conventional deep learning-based methods often suffer from computational resource constraints or limited granularity. Conversely, implicit modeling is favored due to its computational efficiency and continuous representation at any resolution. We propose a neural implicit function-based method to learn a 3D surface to achieve anatomy-aware, precise pulmonary segment reconstruction, represented as a shape by deforming a learnable template. Additionally, we introduce two clinically relevant evaluation metrics to comprehensively assess the quality of the reconstruction. Furthermore, to address the lack of publicly available shape datasets for benchmarking reconstruction algorithms, we developed a shape dataset named Lung3D, which includes the 3D models of 800 labeled pulmonary segments and their corresponding airways, arteries, veins, and intersegmental veins. We demonstrate that the proposed approach outperforms existing methods, providing a new perspective for pulmonary segment reconstruction. Code and data will be available at https://github.com/HINTLab/ImPulSe.

中文标题/摘要

标题：基于模板引导的神经隐式函数肺段重建

高质量的3D肺段重建在肺段切除术和肺癌治疗的手术规划中起着关键作用。由于目标重建所需的分辨率要求，传统的基于深度学习的方法往往受到计算资源限制或粒度有限的困扰。相反，隐式建模因其计算效率和任意分辨率下的连续表示而受到青睐。我们提出了一种基于神经隐式函数的方法，通过变形可学习的模板来学习3D表面，以实现解剖学意识的精确肺段重建。此外，我们引入了两个临床相关的评估指标，以全面评估重建质量。为进一步解决用于基准测试重建算法的公开形状数据集缺乏的问题，我们开发了一个名为Lung3D的形状数据集，其中包括800个带有标注的肺段3D模型及其相应的气道、动脉、静脉和亚段静脉。我们证明了所提出的方法优于现有方法，为肺段重建提供了新的视角。代码和数据将在https://github.com/HINTLab/ImPulSe上提供。

Summary / 总结

The research aims to improve the precision and efficiency of 3D pulmonary segment reconstruction for lung cancer treatment. The method uses neural implicit functions to deform a learnable template, achieving high-resolution and anatomy-aware reconstruction. The proposed approach outperforms existing methods and introduces a new evaluation metric and a publicly available dataset named Lung3D, which includes 3D models of 800 pulmonary segments and their associated structures.

研究旨在提高肺段重建的精度和效率，以用于肺癌治疗。方法采用神经隐式函数对可学习模板进行变形，实现高分辨率和解剖学意识的重建。所提出的方法在性能上优于现有方法，并包含一个名为Lung3D的新数据集，其中包含800个标注的肺段及其对应的气道、动脉、静脉和亚段静脉。评估指标显示在准确性和细节方面优于以往的技术。

A Scientific Reasoning Model for Organic Synthesis Procedure Generation

Authors: Guoqing Liu, Junren Li, Zihan Zhao, Eray Inanc, Krzysztof Maziarz, Jose Garrido Torres, Victor Garcia Satorras, Shoko Ueda, Christopher M. Bishop, Marwin Segler

First: 2025-12-15T18:55:39+00:00 · Latest: 2025-12-15T18:55:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.

中文标题/摘要

标题：有机合成程序生成的科学推理模型

计算机辅助合成规划对于实现完全自动化的、机器人辅助的合成工作流程以及提高药物发现的效率至关重要。然而，一个关键挑战是弥合计算路线设计与实际实验室执行之间的差距，特别是准确预测每个合成步骤的可行实验程序。在本研究中，我们提出了QFANG，这是一种能够直接从反应方程式生成精确且结构化的实验程序的科学推理语言模型，并且具有显式的链式推理。为了开发QFANG，我们整理了一个高质量的数据集，包含905,990个化学反应及其结构化的操作序列，这些数据是从专利文献中提取并使用大型语言模型进行处理的。我们引入了一种化学指导推理（CGR）框架，该框架能够大规模生成基于化学知识的链式推理数据。随后，模型通过监督微调来激发复杂的化学推理。最后，我们应用可验证奖励的强化学习（RLVR）进一步提高程序的准确性。实验结果表明，QFANG在传统NLP相似度指标和使用LLM作为裁判的化学意识评估器中，优于先进的通用推理模型和最近邻检索基线。此外，QFANG能够泛化到某些新的反应类别，并适应实验室条件和用户特定约束的变化。我们认为，QFANG生成高质量合成程序的能力是弥合计算合成规划与完全自动化实验室合成之间差距的重要一步。

Summary / 总结

The research aims to bridge the gap between computational route design and practical laboratory execution in organic synthesis by generating precise experimental procedures. QFANG, a scientific reasoning language model, is developed using a curated dataset of 905,990 chemical reactions paired with structured action sequences. The model undergoes supervised fine-tuning and RLVR to enhance procedural accuracy, outperforming general-purpose reasoning models and baselines. QFANG generalizes to out-of-domain reaction classes and adapts to laboratory conditions and user constraints.

研究旨在通过开发QFANG科学推理语言模型，弥合计算路线设计与实验室执行之间的差距。QFANG使用化学引导推理框架从反应方程式生成精确的、结构化的实验程序，并通过监督学习和可验证奖励的强化学习进行微调。实验结果表明，QFANG在性能上优于通用推理模型和基线，并且能够应用于新的反应类别，并适应各种实验室条件和用户特定的约束。

Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

Authors: Wenhan Chen, Sezer Karaoglu, Theo Gevers

First: 2025-12-15T18:54:30+00:00 · Latest: 2025-12-15T18:54:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

中文标题/摘要

标题：Grab-3D：从三维几何时间一致性检测AI生成视频

基于扩散生成技术的最新进展使AI模型能够生成高度逼真的视频，从而提高了可靠检测机制的需求。然而，现有的检测方法仅对生成视频中存在的三维几何模式进行了有限的探索。在本文中，我们使用消失点作为三维几何模式的显式表示，揭示了真实视频和AI生成视频在几何一致性方面的根本差异。我们引入了Grab-3D，这是一种基于三维几何时间一致性的几何感知变换器框架，用于检测AI生成的视频。为了实现可靠的评估，我们构建了一个静态场景的AI生成视频数据集，允许稳定地提取3D几何特征。我们提出了一种几何感知变换器，配备了几何位置编码、时空几何注意力和基于EMA的几何分类器头部，以明确地将三维几何意识注入到时间建模中。实验表明，Grab-3D显著优于最先进的检测器，并实现了对未见过的生成器的稳健跨域泛化。

Summary / 总结

This paper addresses the challenge of detecting AI-generated videos by focusing on 3D geometric temporal consistency. The authors introduce Grab-3D, a geometry-aware transformer framework that uses vanishing points to detect discrepancies in 3D geometry between real and AI-generated videos. The method includes a geometry-aware transformer with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head. Experiments show that Grab-3D outperforms existing methods and demonstrates robust cross-domain generalization to unseen generators.

研究旨在通过利用3D几何时间一致性来检测AI生成的视频。方法使用消失点和几何感知变压器框架来识别几何一致性中的差异。实验表明，Grab-3D在检测性能上优于现有方法，并且能够在未见过的生成器上实现稳健的跨域泛化。

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Authors: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang

First: 2025-12-15T18:52:43+00:00 · Latest: 2025-12-15T18:52:43+00:00

Comments: Project page: https://zhoues.github.io/RoboTracer

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

中文标题/摘要

标题：RoboTracer：在视觉语言模型中通过推理掌握空间跟踪

空间跟踪是机器人基本的具身交互能力之一，由于需要多步度量导向的推理和复杂的空间指代以及现实世界的度量测量，因此本质上具有挑战性。然而，现有方法在处理这种组合任务时存在困难。为此，我们提出RoboTracer，这是一种3D感知的VLM，首先通过通用空间编码器和回归监督解码器实现3D空间指代和测量，增强监督微调（SFT）期间的尺度意识。此外，RoboTracer通过度量敏感的过程奖励进行强化微调（RFT），监督关键中间感知提示，以准确生成空间跟踪。为了支持SFT和RFT训练，我们引入了TraceSpatial，这是一个包含3000万QA对的大规模数据集，涵盖了户外/室内/桌面场景，并支持复杂的推理过程（多达9步）。我们还提出了TraceSpatial-Bench，这是一个具有挑战性的基准，填补了空间跟踪评估的空白。实验结果表明，RoboTracer在空间理解、测量和指代方面超越了基线，平均成功率达到了79.1%，并且在TraceSpatial-Bench上也以显著优势超越了Gemini-2.5-Pro，准确率高出36%。值得注意的是，RoboTracer可以与各种控制策略结合，执行跨不同机器人（UR5，G1人形机器人）的复杂场景中的长期动态任务。

Summary / 总结

RoboTracer is designed to address the challenges of spatial tracing in robotics by integrating 3D spatial reasoning into vision-language models. It uses a universal spatial encoder and regression-supervised decoder for 3D spatial referring and measuring, and reinforcement fine-tuning to enhance multi-step metric-grounded reasoning. The model was trained on a large dataset, TraceSpatial, and outperformed existing methods with an average success rate of 79.1% and achieved state-of-the-art performance on TraceSpatial-Bench, surpassing Gemini-2.5-Pro by 36% accuracy. RoboTracer can be integrated with various control policies for long-horizon tasks in real-world scenarios.

RoboTracer 通过将 3D 空间推理融入视觉语言模型来增强机器人的空间跟踪能力。它使用通用的空间编码器和回归监督解码器进行空间引用和测量，并采用强化微调来提高多步度量导向的推理。该模型在包含 3000 万个 QA 对的 TraceSpatial 数据集上进行训练，以支持复杂的推理任务。RoboTracer 的平均成功率达到了 79.1%，并在 TraceSpatial-Bench 基准测试中超越了 Gemini-2.5-Pro，准确率高出 36%。它可以应用于各种机器人，执行复杂环境中的长期任务。

Embedding-Based Rankings of Educational Resources based on Learning Outcome Alignment: Benchmarking, Expert Validation, and Learner Performance

Authors: Mohammadreza Molavi, Mohammad Moein, Mohammadreza Tavakoli, Abdolali Faraji, Stefan T. Mol, Gábor Kismihók

First: 2025-12-15T18:51:00+00:00 · Latest: 2025-12-15T18:51:00+00:00

Comments: Accepted for publication at the 16th International Conference on Learning Analytics & Knowledge (LAK 2026)

Abs · PDF · Code1 · Code2

Abstract

As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.

中文标题/摘要

标题：基于嵌入的教育资源排名：基于学习成果对齐的基准测试、专家验证和学习者表现

随着在线学习环境的发展，个性化的需求越来越明显。尽管教育资源日益丰富，教育工作者在选择既符合预期学习成果又能满足多样化学习者需求的材料时仍面临挑战。大型语言模型（LLMs）因其潜在能力而受到越来越多的关注，可以创建更支持个性化的学习资源，但验证预期成果的覆盖范围仍然需要人工对齐审查，这既昂贵又限制了可扩展性。我们提出了一种框架，以支持以较低成本自动化评估教育资源与预期学习成果之间的对齐。使用人类生成的材料，我们对基于LLM的文本嵌入模型进行了基准测试，发现最准确的模型（Voyage）在检测对齐方面的准确率为79%。然后，我们应用了最优模型对LLM生成的资源进行了评估，并通过专家评估确认了其可靠地评估了与预期成果的对应关系（准确率为83%）。最后，在涉及360名学习者的三组实验中，更高的对齐分数与更好的学习表现呈正相关，χ²(2, N = 360) = 15.39, p < 0.001。这些发现表明，基于嵌入的对齐分数可以通过确认与学习成果的对齐来促进可扩展的个性化，从而使教师能够专注于根据多样化学习者的需求定制内容。

Summary / 总结

The study aims to address the challenge of selecting educational resources that align with intended learning outcomes and meet diverse learner needs. It benchmarks LLM-based text-embedding models and finds that the most accurate model (Voyage) achieves 79% accuracy in detecting alignment. The optimal model was then used to evaluate LLM-generated resources, confirmed by expert evaluation with 83% accuracy. In a learner performance experiment, higher alignment scores were positively correlated with better learning outcomes, indicating that embedding-based alignment scores can support scalable personalization in education.

研究旨在通过将教育资源与预期学习成果对齐来解决个性化教育的挑战。研究使用人类生成的材料对基于文本的嵌入模型进行了基准测试，发现最准确的模型（Voyage）在检测对齐方面达到了79%的准确性。将此模型应用于LLM生成的资源并通过专家评估确认了83%的准确性。研究还表明，更高的对齐分数与更好的学习表现正相关，表明基于嵌入的对齐分数可以支持教育中的可扩展个性化。

World Models Can Leverage Human Videos for Dexterous Manipulation

Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun

First: 2025-12-15T18:37:12+00:00 · Latest: 2025-12-15T18:37:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

中文标题/摘要

标题：世界模型可以利用人类视频进行灵巧操作

灵巧操作具有挑战性，因为它需要理解细微的手部运动如何通过与物体接触影响环境。我们引入了DexWM，这是一种灵巧操作世界模型，能够在给定过去状态和灵巧动作的情况下预测环境的下一个潜在状态。为了克服灵巧操作数据集稀缺的问题，DexWM 是在超过900小时的人类和非灵巧机器人视频上进行训练的。为了实现精细的灵巧性，我们发现仅预测视觉特征是不够的；因此，我们引入了一个辅助的手部一致性损失，以确保准确的手部配置。DexWM 在基于文本、导航和全身动作的世界模型中表现出色，能够更准确地预测未来状态。当部署在配备Allegro夹爪的Franka Panda手臂上时，DexWM 在抓取、放置和到达任务中也展示了强大的零样本泛化能力，平均性能比Diffusion Policy高出50%以上。

Summary / 总结

The paper introduces DexWM, a world model designed for dexterous manipulation, which predicts the next latent state of the environment based on past states and dexterous actions. It leverages over 900 hours of human and non-dexterous robot videos to overcome the scarcity of dexterous manipulation datasets. An auxiliary hand consistency loss is introduced to ensure accurate hand configurations. DexWM outperforms previous models by providing more accurate predictions and demonstrates strong zero-shot generalization to unseen manipulation skills, outperforming Diffusion Policy by over 50% in grasping, placing, and reaching tasks.

论文介绍了DexWM，这是一种用于灵巧操作的世界模型，它基于过去的状态和灵巧动作来预测环境的下一个潜在状态。该模型利用了超过900小时的人类和非灵巧机器人视频数据，以克服灵巧操作数据集稀缺的问题。引入了辅助的手一致性损失来确保手部配置的准确性。DexWM在提供更准确的预测方面优于之前的模型，并且在抓取、放置和到达任务中表现出强大的零样本泛化能力，平均性能比Diffusion Policy高出50%以上。

From Code to Field: Evaluating the Robustness of Convolutional Neural Networks for Disease Diagnosis in Mango Leaves

Authors: Gabriel Vitorino de Andrade, Saulo Roberto dos Santos, Itallo Patrick Castro Alves da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza

First: 2025-12-15T18:36:48+00:00 · Latest: 2025-12-15T18:36:48+00:00

Comments: This work was presented at the BRACIS 2025 conference in Fortaleza

Abs · PDF · Code1 · Code2

Abstract

The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.

中文标题/摘要

标题：从代码到实地：评估卷积神经网络在芒果叶片疾病诊断中的鲁棒性

通过鲁棒性评估验证和验证人工智能（AI）模型对于确保智能系统在面对现实世界挑战时的可靠性能至关重要，例如图像损坏包括噪声、模糊和天气变化。尽管芒果（Mangifera indica L.）具有全球重要性，但关于其叶片疾病诊断模型鲁棒性的研究却很少。本文提出了一种在恶劣条件下评估卷积神经网络（CNNs）的方法。我们改编了MangoLeafDB数据集，生成了MangoLeafDB-C，其中包含五种严重程度级别的19种人工损坏类型。我们进行了基准测试，比较了五种架构：ResNet-50、ResNet-101、VGG-16、Xception和LCNN（后者是专门为芒果叶片诊断设计的轻量级架构）。评估指标包括F1分数、损坏误差（CE）和相对平均损坏误差（相对mCE）。结果表明，LCNN在现实世界场景中可能出现的失焦模糊、运动模糊等损坏情况下表现优于复杂模型，同时实现了最低的mCE。现代架构（例如ResNet-101）在损坏场景中表现出显著的性能下降，尽管在理想条件下其准确率很高。这些发现表明，轻量级和专门化的模型可能更适合在边缘设备中进行实际应用，因为鲁棒性和效率至关重要。该研究强调了在农业领域开发智能系统时需要纳入鲁棒性评估，特别是在技术受限的地区。

Summary / 总结

This paper evaluates the robustness of convolutional neural networks (CNNs) for diagnosing diseases in mango leaves under various image corruptions. The authors developed a new dataset, MangoLeafDB-C, with 19 types of artificial corruptions at five severity levels. They compared five CNN architectures, including ResNet-50, ResNet-101, VGG-16, Xception, and LCNN. The results showed that LCNN outperformed complex models in real-world corruptions like Defocus Blur and Motion Blur, and had the lowest mean corruption error. This suggests that lightweight and specialized models may be more suitable for real-world agricultural applications where robustness and efficiency are crucial.

该研究评估了卷积神经网络（CNNs）在各种图像失真条件下对芒果叶片疾病诊断的鲁棒性。作者将MangoLeafDB数据集改编为MangoLeafDB-C，包含五种严重程度级别的19种人工失真类型。五种CNN架构的比较结果显示，轻量级的LCNN模型在失真场景如Defocus Blur和Motion Blur等现实世界条件下表现优于更复杂的模型如ResNet-101。研究建议，在农业领域，特别是在技术受限的地区，轻量级和专门化的模型可能更适合实际应用。

Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

Authors: Michal Nazarczuk, Thomas Tanay, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero

First: 2025-12-15T18:33:08+00:00 · Latest: 2025-12-15T18:33:08+00:00

Comments: Project page: https://charge-benchmark.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.

中文标题/摘要

标题：Charge：一个综合的新型视图合成基准数据集及其应用

本文提出了一种新型视图合成的数据集，该数据集源自高质量的动画电影，具有惊人的真实感和复杂的细节。我们的数据集捕捉了各种动态场景，包括详细的纹理、照明和运动，使其成为训练和评估先进的4D场景重建和新型视图生成模型的理想选择。除了高保真RGB图像外，我们还提供了多种补充模态，包括深度、表面法线、物体分割和光流，这有助于更深入地理解场景几何和运动。数据集组织成三个不同的基准测试场景：密集多视图相机设置、稀疏相机布局和单目视频序列，这使得在不同数据稀疏程度下进行广泛实验和比较成为可能。凭借其丰富的视觉效果、高质量的注释和多样化的实验设置，该数据集为推动视图合成和3D视觉的边界提供了独特的资源。

Summary / 总结

This paper introduces a new dataset for Novel View Synthesis, derived from a high-quality animated film. The dataset includes dynamic scenes with detailed textures, lighting, and motion, and provides multiple modalities such as depth, normals, segmentation, and optical flow. It is organized into three benchmarking scenarios: dense multi-view, sparse camera, and monocular video, allowing for comprehensive evaluation of 4D scene reconstruction and novel view generation models. Key findings include the dataset's ability to support a wide range of experiments and its potential to advance view synthesis and 3D vision research.

该论文介绍了一个新的用于新型视图合成的数据集，来源于高质量的动画电影。数据集包含动态场景，具有详细的纹理、光照和运动，并提供了深度、法线、分割和光流等多种模态。数据集被组织成三种基准测试场景：密集多视角、稀疏相机和单目视频，允许进行广泛的实验和评估。主要发现包括该数据集支持广泛实验的能力及其在推进视图合成和三维视觉研究方面的潜力。

MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, Xiang Bai

First: 2025-12-15T18:31:32+00:00 · Latest: 2025-12-15T18:31:32+00:00

Comments: 16 pages, 12 figures, 6 tables; Project Page: https://xiaomi-mlab.github.io/MindDrive/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.

中文标题/摘要

标题：MindDrive：一种基于在线强化学习的视觉-语言-行动自主驾驶模型

当前自主驾驶中的视觉-语言-行动（VLA）范式主要依赖于模仿学习（IL），这引入了诸如分布偏移和因果混淆等固有挑战。在线强化学习通过试错学习提供了一条有希望的途径来解决这些问题。然而，将在线强化学习应用于自主驾驶中的VLA模型受到在连续动作空间中探索效率低下的阻碍。为克服这一限制，我们提出了MindDrive，这是一种包含两个不同LoRA参数集的大语言模型（LLM）的VLA框架。其中一个LLM作为决策专家进行场景推理和驾驶决策，另一个作为行动专家动态将语言决策映射为可行轨迹。通过将轨迹级奖励反馈到推理空间，MindDrive能够在有限的离散语言驾驶决策集上实现试错学习，而不是直接在连续动作空间中操作。这种方法有效地平衡了在复杂场景中的最优决策、类人驾驶行为以及在线强化学习中的高效探索。MindDrive在具有挑战性的Bench2Drive基准测试中实现了强闭环性能，驾驶得分为78.04，成功率为55.09%。据我们所知，这是首次证明在线强化学习在自主驾驶中的VLA模型的有效性。

Summary / 总结

MindDrive is a Vision-Language-Action framework that uses Online Reinforcement Learning to address the challenges of Imitation Learning in autonomous driving, such as distribution shift and causal confusion. By employing two sets of LoRA parameters in a large language model, it differentiates between a Decision Expert for scenario reasoning and a separate Action Expert for mapping linguistic decisions to trajectories. This approach enables efficient exploration and trial-and-error learning in a discrete linguistic space, leading to strong performance on the Bench2Drive benchmark with a Driving Score of 78.04 and a Success Rate of 55.09%. This is the first work to demonstrate the effectiveness of online reinforcement learning for VLA models in autonomous driving.

MindDrive 是一个基于视觉-语言-行动框架，采用在线强化学习来解决模仿学习在自动驾驶中的挑战。它使用了一个大型语言模型，包含两组LoRA参数，其中一组作为决策专家进行场景推理和驾驶决策，另一组作为行动专家将语言决策映射为可行的轨迹。通过将轨迹级奖励反馈到推理空间，MindDrive 使在线强化学习中的探索更加高效。在Bench2Drive基准测试中，MindDrive 达到了驾驶分数78.04和成功率55.09%。这是首次证明在线强化学习在自动驾驶中的VLA模型的有效性。

SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

Authors: Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Siqi Lu, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yalin Zheng, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

First: 2025-12-15T18:30:40+00:00 · Latest: 2025-12-15T18:30:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST

中文标题/摘要

标题：SCR2-ST：结合单细胞与空间转录组学以强化学习实现高效主动采样

空间转录组学（ST）是一种新兴技术，使研究人员能够研究组织形态背后的分子关系。然而，获取ST数据仍然非常昂贵，传统的固定网格采样策略会导致对形态相似或生物信息不相关的区域进行冗余测量，从而导致数据稀缺，限制了当前方法的发展。然而，成熟的单细胞测序领域可以提供丰富的生物数据，作为有效的辅助来源来缓解这一限制。为了解决这些问题，我们提出了SCR2-ST，这是一种统一框架，利用单细胞先验知识指导高效的数据采集和准确的表达预测。SCR2-ST 结合了单细胞引导的基于强化学习的（SCRL）主动采样和一种混合回归检索预测网络 SCR2Net。SCRL 将单细胞基础模型嵌入与空间密度信息结合，构建生物学依据的奖励信号，使在受限测序预算下能够选择性地采集信息丰富的组织区域。然后，SCR2Net 通过结合基于回归的建模与检索增强的推理的混合架构利用主动采样数据，其中主要细胞类型过滤机制抑制噪声匹配，检索到的表达谱作为辅助监督的软标签。我们在三个公开的 ST 数据集上评估了 SCR2-ST，展示了在采样效率和预测准确性方面的 SOTA 性能，特别是在低预算场景下。代码已公开：https://github.com/hrlblab/SCR2ST

Summary / 总结

The research aims to improve the efficiency and accuracy of spatial transcriptomics (ST) data acquisition by integrating single-cell sequencing knowledge. The method employs a unified framework, SCR2-ST, which includes a single-cell guided reinforcement learning-based active sampling strategy and a hybrid regression-retrieval prediction network, SCR2Net. The key findings show that SCR2-ST outperforms existing methods in both sampling efficiency and prediction accuracy, especially in low-budget scenarios.

SCR2-ST 是一个结合单细胞测序和空间转录组学的统一框架，旨在提高数据采集效率和预测准确性。它使用单细胞引导的强化学习基于主动采样方法和混合回归检索预测网络 SCR2Net，来选择具有信息性的组织区域并预测基因表达。在三个公开的 ST 数据集上的实验表明，SCR2-ST 在低预算条件下尤其在采样效率和预测准确性方面优于现有方法。

StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

Authors: Guransh Singh, Md Shah Fahad

First: 2025-12-15T18:28:39+00:00 · Latest: 2025-12-15T18:28:39+00:00

Comments: 13 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.

中文标题/摘要

标题：StutterFuse：通过Jaccard加权度量学习和门控融合减轻 stutter检测中的模态崩溃

当断言重叠时，stutter检测会失效。现有的参数模型难以区分复杂的、同时发生的断言（例如，“阻塞”与“延长”），因为训练数据中这些特定组合很少见。虽然检索增强生成（RAG）已通过将模型与外部知识联系起来而彻底改变了NLP，但这一范式在病理语音处理中尚未被探索。为弥合这一差距，我们引入了StutterFuse，这是第一个用于多标签stutter检测的检索增强分类器（RAC）。通过将Conformer编码器条件化于临床示例的非参数记忆库上，我们允许模型通过参考而非记忆来进行分类。我们进一步识别并解决了“模态崩溃”这一问题，即“回声室”效应，其中简单的检索提高了召回率但降低了精确率。我们通过以下方式减轻了这一问题：(1) SetCon，一种Jaccard加权度量学习目标，优化多标签集合相似性；(2) 门控混合专家融合策略，动态仲裁声学证据与检索上下文之间的关系。在SEP-28k数据集上，StutterFuse实现了加权F1分数0.65，超越了强大的基线模型，并展示了显著的零样本跨语言泛化能力。

Summary / 总结

Stuttering detection is challenging when disfluencies overlap, and existing models struggle with complex simultaneous disfluencies due to data scarcity. To address this, StutterFuse introduces a Retrieval-Augmented Classifier (RAC) that uses a Conformer encoder conditioned on a non-parametric memory bank of clinical examples. This approach mitigates 'Modality Collapse' by employing SetCon for optimizing multi-label set similarity and a Gated Mixture-of-Experts fusion strategy. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, surpassing strong baselines and showing zero-shot cross-lingual generalization.

当语音中断重叠时，语音中断检测变得困难。StutterFuse通过引入一种基于非参数临床示例记忆库的检索增强分类器（RAC）来解决这一问题，使用Conformer编码器进行条件处理。通过使用Jaccard加权度量学习目标和门控混合专家融合策略来缓解“模态坍塌”。在SEP-28k数据集上，StutterFuse的加权F1分数达到0.65，超越了强大的基线模型，并展示了零样本跨语言泛化能力。

BlurDM: A Blur Diffusion Model for Image Deblurring

Authors: Jin-Ting He, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin, Yen-Yu Lin

Venue: NeurIPS 2025

First: 2025-12-03T17:10:44+00:00 · Latest: 2025-12-15T18:15:43+00:00

Comments: NeurIPS 2025. Project Page: https://jin-ting-he.github.io/Blur-Diffusion-Model/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The project page is available at https://jin-ting-he.github.io/Blur-Diffusion-Model/.

中文标题/摘要

标题：BlurDM：一种用于图像去模糊的模糊扩散模型

扩散模型在动态场景去模糊方面显示出潜力；然而，现有研究往往未能充分利用扩散模型中的模糊过程本质，限制了其全部潜力。为了解决这一问题，我们提出了一种模糊扩散模型（BlurDM），该模型将模糊形成过程无缝地整合到扩散中以进行图像去模糊。我们观察到运动模糊源自连续曝光，BlurDM 通过双重扩散前向方案隐式地建模模糊形成过程，将噪声和模糊扩散到锐利图像上。在反向生成过程中，我们推导出双重去噪和去模糊公式，使BlurDM能够在给定受模糊图像条件的纯高斯噪声输入的情况下同时去噪和去模糊，恢复锐利图像。此外，为了高效地将BlurDM集成到去模糊网络中，我们在潜在空间中执行BlurDM，形成一个灵活的先验生成网络以进行去模糊。广泛的实验表明，BlurDM在四个基准数据集上显著且一致地增强了现有的去模糊方法。项目页面可在https://jin-ting-he.github.io/Blur-Diffusion-Model/获取。

Summary / 总结

BlurDM is a novel blur diffusion model designed for image deblurring. It integrates the blur formation process into the diffusion model to better handle motion blur. By diffusing both noise and blur onto a sharp image, BlurDM can simultaneously denoise and deblur images. Experiments show that BlurDM outperforms existing methods on four benchmark datasets, demonstrating its effectiveness in image deblurring.

BlurDM 是一种用于图像去模糊的新扩散模型，它将模糊形成过程整合到扩散框架中。通过使用双重扩散前向方案，BlurDM 将噪声和模糊扩散到锐利图像上，在反向生成过程中，它同时去噪和去模糊。实验表明，BlurDM 在四个基准数据集上优于现有方法，显著提高了去模糊效果。

OUGS: Active View Selection via Object-aware Uncertainty Estimation in 3DGS

Authors: Haiyi Li, Qi Chen, Denis Kalkofen, Hsiang-Ting Chen

First: 2025-11-12T15:08:46+00:00 · Latest: 2025-12-15T18:12:41+00:00

Comments: Conditionally accepted to Eurographics 2026 (five reviewers)

Abs · PDF · Code1 · Code2

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have achieved state-of-the-art results for novel view synthesis. However, efficiently capturing high-fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene-level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object-centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically-grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object-aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state-of-the-art methods, while also serving as a robust uncertainty estimator for the global scene.

中文标题/摘要

标题：OUGS：基于对象感知不确定性估计的3DGS主动视图选择

近年来，3D高斯点绘制（3DGS）技术在新颖视图合成方面取得了最先进的成果。然而，高效地捕捉复杂场景中特定对象的高保真重建仍然是一个重大挑战。现有主动重建方法的主要局限在于依赖场景级别的不确定性度量，这些度量往往受到无关背景杂乱的影响，导致对象中心任务的视图选择效率低下。我们提出了一种名为OUGS的新框架，通过更符合物理原理的不确定性公式解决了这一挑战。我们的核心创新是从3D高斯原语的显式物理参数（例如，位置、尺度、旋转）中直接推导不确定性。通过将这些参数的协方差通过渲染雅可比传播，我们建立了一个高度可解释的不确定性模型。这一基础使我们能够无缝集成语义分割掩码，生成一个针对对象的不确定性评分，有效地将对象与其环境分离。这使得能够采用更有效的主动视图选择策略，优先选择对提高对象保真度至关重要的视图。在公共数据集上的实验评估表明，我们的方法显著提高了3DGS重建过程的效率，并在目标对象的质量方面优于现有最先进的方法，同时作为全局场景的稳健不确定性估计器。

Summary / 总结

The research aims to improve the efficiency and quality of novel view synthesis in complex scenes by addressing the limitations of existing active reconstruction methods. OUGS, a novel framework, uses object-aware uncertainty estimation derived from the physical parameters of 3D Gaussian primitives to select views that enhance object fidelity. Experiments show that OUGS improves reconstruction efficiency and object quality compared to existing methods, while providing a robust uncertainty estimate for the entire scene.

研究旨在通过解决现有主动重建方法的局限性，提高复杂场景中新颖视图合成的效率和质量。OUGS 使用一种新颖框架，直接从 3D 高斯基元的物理参数中推导不确定性，并结合语义分割掩码生成目标感知的不确定性评分。这导致了更有效的主动视图选择策略，相比现有方法，实现了特定对象的更高质量重建，并在公共数据集上展示了显著的效率和质量改进。

LightTopoGAT: Enhancing Graph Attention Networks with Topological Features for Efficient Graph Classification

Authors: Ankit Sharma, Sayan Roy Gupta

First: 2025-12-15T18:09:59+00:00 · Latest: 2025-12-15T18:09:59+00:00

Comments: 9 pages

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks have demonstrated significant success in graph classification tasks, yet they often require substantial computational resources and struggle to capture global graph properties effectively. We introduce LightTopoGAT, a lightweight graph attention network that enhances node features through topological augmentation by incorporating node degree and local clustering coefficient to improve graph representation learning. The proposed approach maintains parameter efficiency through streamlined attention mechanisms while integrating structural information that is typically overlooked by local message passing schemes. Through comprehensive experiments on three benchmark datasets, MUTAG, ENZYMES, and PROTEINS, we show that LightTopoGAT achieves superior performance compared to established baselines including GCN, GraphSAGE, and standard GAT, with a 6.6 percent improvement in accuracy on MUTAG and a 2.2 percent improvement on PROTEINS. Ablation studies further confirm that these performance gains arise directly from the inclusion of topological features, demonstrating a simple yet effective strategy for enhancing graph neural network performance without increasing architectural complexity.

中文标题/摘要

标题：LightTopoGAT：通过引入拓扑特征增强图注意力网络以实现高效的图分类

图神经网络在图分类任务中已经取得了显著的成功，但它们通常需要大量的计算资源，并且难以有效地捕捉全局图属性。我们提出了LightTopoGAT，这是一种轻量级的图注意力网络，通过结合节点度和局部聚类系数来增强节点特征，从而改进图表示学习。所提出的方法通过精简的注意力机制保持了参数效率，同时整合了通常被局部消息传递方案忽略的结构信息。通过在三个基准数据集MUTAG、ENZYMES和PROTEINS上的全面实验，我们展示了LightTopoGAT在准确率方面优于包括GCN、GraphSAGE和标准GAT在内的现有基线，MUTAG上的准确率提高了6.6%，PROTEINS上的准确率提高了2.2%。消融研究进一步证实了这些性能提升直接来自于拓扑特征的引入，表明了一种简单而有效的策略，可以在不增加架构复杂性的情况下增强图神经网络的性能。

Summary / 总结

LightTopoGAT is a lightweight graph attention network that enhances node features using topological augmentation, specifically incorporating node degree and local clustering coefficient to improve graph representation learning. It achieves superior performance on three benchmark datasets (MUTAG, ENZYMES, and PROTEINS) compared to established baselines like GCN, GraphSAGE, and standard GAT, with accuracy improvements of 6.6% on MUTAG and 2.2% on PROTEINS. Ablation studies confirm that these gains are due to the inclusion of topological features without increasing architectural complexity.

LightTopoGAT 是一种轻量级的图注意力网络，通过引入拓扑增强来提升节点特征，结合节点度和局部聚类系数以改进图表示学习。该方法在三个基准数据集 MUTAG、ENZYMES 和 PROTEINS 上表现出优越性能，MUTAG 上的准确率提高了 6.6 个百分点，PROTEINS 上提高了 2.2 个百分点，优于 GCN、GraphSAGE 和标准 GAT 等现有基线。消融研究进一步证实这些性能提升直接来源于拓扑特征的引入，而无需增加架构复杂性。

Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli

First: 2025-12-15T18:03:42+00:00 · Latest: 2025-12-15T18:03:42+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

中文标题/摘要

标题：执行与撤销：在视觉语言模型中生成和逆转物理动作

我们引入了执行与撤销任务及基准测试，以解决视觉语言模型中的关键问题：理解并生成由真实世界动作驱动的物理上合理的场景变换。与以往专注于对象级编辑的工作不同，执行与撤销要求模型模拟物理动作的结果，然后准确地逆转它，反映视觉世界中的真正因果关系。我们从真实世界的视频中收集了一个大规模的可逆动作数据集，并设计了一种训练策略，以确保动作的稳健定位。我们的实验表明，当前模型在物理可逆性方面存在困难，突显了该任务对于具身人工智能、机器人技术和物理感知生成建模的重要性。执行与撤销为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。

Summary / 总结

The Do-Undo task and benchmark were introduced to address the gap in vision-language models for understanding and generating physically plausible scene transformations. Unlike previous work focusing on object-level edits, this task requires models to simulate and reverse physical actions, reflecting true cause-and-effect. The authors curated a large dataset of reversible actions from real-world videos and developed a training strategy to enforce consistency. Experiments showed that current models have difficulty with physical reversibility, highlighting the importance of this task for embodied AI and physics-aware generative modeling. Do-Undo provides a testbed for evaluating and advancing physical reasoning in multimodal systems.

提出了Do-Undo任务和基准，旨在解决视觉-语言模型在理解和生成物理上合理的场景变换方面的不足。与之前专注于对象级编辑的工作不同，Do-Undo要求模型模拟并逆转物理动作，反映真实的因果关系。作者创建了一个包含真实视频中可逆动作的大规模数据集，并开发了一种一致性训练策略。实验表明，当前模型在物理可逆性方面存在困难，突显了该任务对于具身AI、机器人技术和物理感知生成建模的重要性。

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Authors: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

First: 2025-12-15T18:02:35+00:00 · Latest: 2025-12-15T18:02:35+00:00

Comments: We publicly release the Nemotron-Cascade models and the full collection of training data at: https://huggingface.co/collections/nvidia/nemotron-cascade

Abs · PDF · Code1 · Code2 · Code3

Abstract

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

中文标题/摘要

标题：Nemotron-Cascade：扩展级联强化学习以适应通用推理模型

使用强化学习（RL）构建通用推理模型涉及跨域异质性，包括推理时响应长度和验证延迟的大幅变化。这种变化性使RL基础设施复杂化，减慢了训练速度，并使训练课程（例如，响应长度扩展）和超参数选择变得具有挑战性。在本工作中，我们提出了一种级联领域特定强化学习（Cascade RL）方法，以开发能够在指令模式和深度思考模式下运行的通用推理模型Nemotron-Cascade。与传统的将不同领域异质提示混合的方法不同，Cascade RL协调顺序的、领域特定的RL，减少了工程复杂性，并在一系列基准测试中实现了最先进的性能。值得注意的是，当使用RLHF进行对齐作为预步骤时，它显著提升了模型的推理能力，远超简单的偏好优化，而后续的领域特定RLVR阶段通常不会损害早期领域中获得的基准性能，甚至可能还会提高它（见图1中的示例）。经过RL训练的14B模型在LiveCodeBench v5/v6/Pro上优于其SFT教师DeepSeek-R1-0528，并在2025年国际信息学奥林匹克竞赛（IOI）中获得银牌。我们透明地分享了我们的训练和数据配方。

IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

First: 2025-09-26T17:46:32+00:00 · Latest: 2025-12-15T18:01:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

中文标题/摘要

标题：IA2：ICL激活的一致性改进监督微调

监督微调（SFT）用于通过训练权重以产生查询的预期目标响应来使模型行为专业化。相比之下，上下文学习（ICL）在推理过程中通过提示中的指令或示范来适应模型。在数据稀缺的情况下，ICL相比SFT可以提供更好的泛化能力和更校准的响应，但代价是更多的推理计算。在本研究中，我们提出的问题是：ICL的内部计算能否用于改进SFT的质量？我们首先展示了ICL和SFT产生不同的激活模式，表明这两种方法通过不同的功能机制实现适应。受此观察的启发，为了利用ICL丰富的功能，我们引入了ICL激活一致性（IA2），这是一种自我蒸馏技术，旨在在SFT模型中复制ICL的激活模式，并激励类似ICL的内部推理。在SFT之前作为预热步骤执行IA2，显著提高了模型输出的准确性和校准性，如我们在12个流行基准和两个模型家族上的广泛实验证明。这一发现不仅在实践中具有实用性，还为模型适应的内部机制提供了一个概念性的窗口。

Summary / 总结

This work explores whether In-Context Learning (ICL) can improve Supervised Fine-Tuning (SFT) by aligning SFT models with ICL's activation patterns. The authors introduce ICL Activation Alignment (IA2), a self-distillation technique that replicates ICL's internal computations in SFT models. Extensive experiments on 12 benchmarks show that IA2 significantly enhances the accuracy and calibration of SFT models, demonstrating the potential of using ICL's rich functionality to improve SFT.

这项工作探讨了是否可以通过使监督微调（SFT）模型与In-Context Learning（ICL）的内部激活模式对齐，来增强SFT。作者引入了ICL激活对齐（IA2），这是一种自蒸馏技术，旨在使SFT模型复制ICL的激活模式，并促进类似ICL的内部推理。在12个基准测试上的广泛实验表明，IA2可以提高模型输出的准确性和校准性，表明ICL的内部推理可以对SFT产生积极影响。

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Authors: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu

First: 2025-12-15T17:59:58+00:00 · Latest: 2025-12-15T17:59:58+00:00

Comments: Project Page: https://vchitect.github.io/LongVie2-project/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

中文标题/摘要

标题：LongVie 2：多模态可控超长视频世界模型

基于预训练视频生成系统的视频世界模型构建是实现通用时空智能的重要但具有挑战性的一步。一个世界模型应具备三个基本特性：可控性、长期视觉质量和时间一致性。为此，我们采取渐进的方法——首先增强可控性，然后扩展到长期、高质量的生成。我们提出了LongVie 2，这是一种端到端的自回归框架，经过三个阶段的训练：（1）多模态引导，将密集和稀疏控制信号结合以提供隐式的全局监督并提高可控性；（2）输入帧的降级感知训练，缩小训练与长期推理之间的差距，保持高质量视觉效果；（3）历史上下文引导，将相邻片段中的上下文信息对齐以确保时间一致性。我们还引入了LongVGenBench，这是一个包含100个高分辨率一分钟视频的基准，涵盖了多种真实世界和合成环境。广泛的实验表明，LongVie 2 在长距离可控性、时间连贯性和视觉保真度方面达到了最先进的性能，并支持长达五分钟的连续视频生成，标志着统一视频世界建模的重要一步。

Summary / 总结

The research aims to develop a world model for video generation that is controllable, maintains high visual quality over long periods, and ensures temporal consistency. LongVie 2 is an end-to-end autoregressive framework trained in three stages: multi-modal guidance for controllability, degradation-aware training for visual quality, and history-context guidance for temporal consistency. Experiments show that LongVie 2 outperforms existing methods in long-range controllability, temporal coherence, and visual fidelity, supporting continuous video generation up to five minutes.

LongVie 2 是一个端到端的自回归框架，旨在增强视频生成中的可控性、长期视觉质量和时间一致性。它通过三个阶段进行训练：多模态引导以提供隐式的世界级监督，降解感知训练以保持高视觉质量，以及历史上下文引导以确保时间一致性。LongVie 2 在长距离可控性、时间连贯性和视觉保真度方面优于现有方法，并支持长达五分钟的连续视频生成。该框架使用包含 100 个高分辨率一分钟视频的 LongVGenBench 进行评估。

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

Authors: Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde

First: 2025-12-15T17:49:22+00:00 · Latest: 2025-12-15T17:49:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

中文标题/摘要

标题：运动中的照明：时空高动态范围照明估计

我们提出了运动中的照明（LiMo），这是一种基于扩散的方法，用于时空照明估计。LiMo旨在预测真实的高频细节并准确估计照度。为了兼顾这两点，我们提出根据输入中的3D位置生成不同曝光的镜面和漫反射球体。利用扩散先验，我们在大规模定制的室内和室外场景数据集上对强大的现有扩散模型进行微调，该数据集与时空光探针配对。为了准确的空间条件，我们证明仅使用深度是不够的，并引入了一种新的几何条件来提供场景相对于目标3D位置的相对位置。最后，我们利用可微渲染将不同曝光下的漫反射和镜面预测合并为一个高动态范围图像（HDRI）图。我们全面评估了我们的方法和设计选择，以确立LiMo在空间控制和预测准确性方面的最新水平。

Summary / 总结

Lighting in Motion (LiMo) is a diffusion-based approach for spatiotemporal lighting estimation, aiming to predict realistic high-frequency details and accurate illuminance. It generates mirrored and diffuse spheres at various exposures based on 3D positions, fine-tunes diffusion models on a large customized dataset, and introduces a geometric condition for spatial conditioning. LiMo combines diffuse and mirror predictions into a single HDRI map, achieving state-of-the-art results in spatial control and prediction accuracy.

Lighting in Motion (LiMo) 是一种基于扩散的方法，用于时空光照估计，旨在预测真实的高频细节和准确的照度。它基于3D位置生成不同曝光的镜面和漫反射球体，并使用扩散先验微调现有模型。LiMo 引入了一种新的几何条件进行空间条件处理，并将漫反射和镜面预测结合成一个高动态范围光照图。该方法在空间控制和预测准确性方面被评估为最先进的技术。

Scalable Formal Verification via Autoencoder Latent Space Abstraction

Authors: Robert Reed, Morteza Lahijanian, Luca Laurenti

First: 2025-12-15T17:48:07+00:00 · Latest: 2025-12-15T17:48:07+00:00

Comments: 14 pages, 7 figures, under review

Abs · PDF · Code1 · Code2

Abstract

Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.

中文标题/摘要

标题：通过自动编码器潜在空间抽象的可扩展形式验证

有限抽象方法为证明系统满足其规范提供了强大的形式框架。然而，这些技术在处理高维系统时面临可扩展性挑战，因为它们依赖于状态空间离散化，这会随着维度的增加而呈指数增长。利用神经网络和自动编码器进行基于学习的降维方法显示出缓解这一问题的巨大潜力。然而，确保由此产生的验证结果的正确性仍然是一个开放问题。在本文中，我们提供了一种形式化的途径，通过凸自动编码器减少系统的维度，并通过核方法在潜在空间中学习动力学。然后，我们从学习到的模型中构建有限抽象，并保证该抽象包含原始系统的真正行为。我们展示了潜在空间中的验证结果可以映射回原始系统。最后，我们在多个系统上展示了我们方法的有效性，包括一个由神经网络控制的26维系统，证明了在不牺牲严谨性的情况下具有显著的可扩展性改进。

Summary / 总结

This research aims to address the scalability issue in formal verification for high-dimensional systems by leveraging convex autoencoders and kernel-based methods. The method involves reducing the dimensionality of the system and learning the dynamics in the latent space, followed by constructing a finite abstraction that accurately represents the original system's behaviors. The key experimental findings show significant scalability improvements for a 26D system controlled by a neural network without compromising the rigor of the verification process.

本文通过使用凸自编码器进行降维和核方法学习潜空间中的动力学，解决高维系统形式验证的可扩展性问题。该方法从学习模型中构建有限抽象，并确保抽象能够捕获原始系统的真正行为。实验表明，该方法在包括一个26维的神经网络控制系统的多个系统上，实现了显著的可扩展性改进，同时保持了验证的严谨性。

Image Diffusion Preview with Consistency Solver

Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao

First: 2025-12-15T17:47:49+00:00 · Latest: 2025-12-15T17:47:49+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.

中文标题/摘要

标题：图像扩散预览与一致性求解器

图像扩散模型的缓慢推理过程显著降低了交互式用户体验。为解决这一问题，我们引入了扩散预览，这是一种新颖的范式，通过快速、低步数采样生成初步输出供用户评估，直到预览被判定满意后再进行全步数细化。现有的加速方法，包括无训练求解器和后训练蒸馏，难以提供高质量的预览或确保预览与最终输出之间的一致性。我们提出了一致性求解器ConsistencySolver，这是一种源自通用线性多步法的轻量级、可训练的高阶求解器，通过强化学习优化，能够提升预览质量和一致性。实验结果表明，ConsistencySolver在低步数场景下显著提高了生成质量和一致性，使其成为高效的预览和细化工作流的理想选择。值得注意的是，它在使用47%更少的步骤时，FID分数与Multistep DPM-Solver相当，同时优于蒸馏基线。此外，用户研究显示，我们的方法将总体用户交互时间减少了近50%，同时保持了生成质量。代码可在https://github.com/G-U-N/consolver/ 获取。

Summary / 总结

The paper addresses the slow inference process of image diffusion models by introducing Diffusion Preview, which uses rapid, low-step sampling to generate preliminary outputs for user evaluation. To enhance preview quality and consistency, the authors propose ConsistencySolver, a lightweight, trainable solver derived from general linear multistep methods and optimized via Reinforcement Learning. Experimental results show that ConsistencySolver significantly improves generation quality and consistency, achieving FID scores comparable to Multistep DPM-Solver with fewer steps and outperforming distillation baselines. User studies also indicate a reduction of nearly 50% in user interaction time while maintaining generation quality.

研究通过引入Diffusion Preview，利用快速、低步数采样生成初步输出，以解决图像扩散模型的推理过程缓慢问题。为了提高预览质量和一致性，研究提出了一种轻量级、可训练的ConsistencySolver，通过强化学习进行优化。实验结果显示，ConsistencySolver在较少步骤的情况下实现了与Multistep DPM-Solver相当的FID分数，并优于蒸馏基线。用户研究还表明，与保持生成质量的同时减少了近50%的用户交互时间。

SparsePixels: Efficient Convolution for Sparse Data on FPGAs

Authors: Ho Fung Tsoi, Dylan Rankin, Vladimir Loncar, Philip Harris

First: 2025-12-05T23:04:44+00:00 · Latest: 2025-12-15T17:47:05+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Inference of standard convolutional neural networks (CNNs) on FPGAs often incurs high latency and a long initiation interval due to the deep nested loops required to densely convolve every input pixel regardless of its feature value. However, input features can be spatially sparse in some image data, where semantic information may occupy only a small fraction of the pixels and most computation would be wasted on empty regions. In this work, we introduce SparsePixels, a framework that implements sparse convolution on FPGAs by selectively retaining and computing on a small subset of active pixels while ignoring the rest. We show that, for identifying neutrino interactions in naturally sparse LArTPC images with 4k pixels, a standard CNN with a compact size of 4k parameters incurs an inference latency of 48.665 $μ$s on an FPGA, whereas a sparse CNN of the same base architecture, computing on less than 1% of the input pixels, achieves a $\times 73$ speedup to 0.665 $μ$s with resource utilization well within on-chip budgets, trading only a small percent-level performance loss. This work aims to benefit future algorithm development for efficient data readout in modern experiments with latency requirements of microseconds or below.

中文标题/摘要

标题：SparsePixels：FPGA上高效稀疏卷积

在FPGA上进行标准卷积神经网络（CNN）的推理往往由于需要密集卷积每个输入像素（无论其特征值如何）而产生高延迟和长启动间隔，这需要深层嵌套循环。然而，在某些图像数据中，输入特征可能是空间稀疏的，其中语义信息可能只占据一小部分像素，而大多数计算则会浪费在空旷区域上。在本文中，我们引入了SparsePixels框架，该框架通过选择性地保留并仅在一小部分活跃像素上进行计算，同时忽略其余部分，来在FPGA上实现稀疏卷积。我们展示了，在处理天然稀疏的LArTPC图像（具有4k像素）中识别中微子相互作用时，一个具有紧凑大小4k参数的标准CNN在FPGA上的推理延迟为48.665 μs，而具有相同基础架构的稀疏CNN，仅在输入像素的不到1%上进行计算，实现了73倍的加速，达到0.665 μs，资源利用率在芯片预算范围内，仅付出很小的百分比性能损失。本文旨在为具有微秒级延迟要求的现代实验中的高效数据读出算法开发提供帮助。

Summary / 总结

SparsePixels is a framework designed to reduce the computational overhead of convolutional neural networks (CNNs) on Field-Programmable Gate Arrays (FPGAs) by processing only active pixels in sparse input data. This approach significantly reduces inference latency and resource usage. For LArTPC images, a sparse CNN achieves a 73x speedup to 0.665 μs while maintaining performance within acceptable limits.

该研究引入了SparsePixels框架，旨在提高FPGA上稀疏卷积的效率，解决标准CNN在处理稀疏数据时的低效问题。通过仅选择性地处理活跃像素，SparsePixels将一个4k参数CNN的推理延迟从48.665 μs降低到0.665 μs，实现了73倍的加速，同时保持在片上资源限制内，并仅牺牲了少量的性能。

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Authors: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li

First: 2025-12-15T17:41:19+00:00 · Latest: 2025-12-15T17:41:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

中文标题/摘要

标题：ReFusion：一种具有并行自回归解码的扩散大型语言模型

自回归模型（ARMs）受到缓慢的顺序推理限制。虽然掩蔽扩散模型（MDMs）提供了并行的替代方案，但它们也存在关键缺陷：由于禁止使用键值（KV）缓存而导致的高计算开销，以及由于在难以处理的令牌组合空间中学习依赖关系而导致的不连贯生成。为了解决这些限制，我们引入了ReFusion，这是一种新颖的掩蔽扩散模型，通过将并行解码从令牌级别提升到更高的一级插槽级别，实现了更优的性能和效率，其中每个插槽是一个固定长度的连续子序列。这通过迭代的“计划和填充”解码过程实现：基于扩散的计划步骤首先识别一组弱依赖插槽，然后自回归填充步骤并行解码这些选定的插槽。基于插槽的设计同时解锁了完整的KV缓存重用，并通过统一的因果框架减少了从令牌组合空间到可管理的插槽级排列空间的学习复杂度。在七个不同的基准测试上的广泛实验表明，ReFusion不仅在性能上大幅超越了之前的MDMs，平均性能提高了34%，速度提高了18倍以上，而且还缩小了与强大的ARMs之间的性能差距，同时保持了2.33倍的平均加速。

Summary / 总结

ReFusion is a novel masked diffusion model that improves upon autoregressive models by using a slot-based parallel decoding approach. This method overcomes the limitations of both autoregressive and masked diffusion models, achieving better performance and efficiency. Experiments on seven benchmarks demonstrate that ReFusion outperforms previous masked diffusion models by 34% and speeds up inference by over 18 times on average, while also narrowing the performance gap with strong autoregressive models and maintaining a 2.33 times speedup.

ReFusion 是一种新型的掩码扩散模型，通过在槽级别实现并行解码来改进自回归模型，从而提高性能和效率。它通过使全键值缓存重用成为可能并降低学习复杂度来解决掩码扩散模型的限制。在七个基准上的实验表明，ReFusion 在性能上比之前的掩码扩散模型高出 34%，平均加速比超过 18 倍，同时在保持 2.33 倍平均加速比的同时，也缩小了与强大自回归模型之间的性能差距。

DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication

Authors: Zehan Zhu, Heng Zhao, Yan Huang, Joey Tianyi Zhou, Shouling Ji, Jinming Xu

First: 2025-12-15T17:37:02+00:00 · Latest: 2025-12-15T17:37:02+00:00

Comments: 13 pages

Abs · PDF · Code1 · Code2

Abstract

In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}δ \right)}/(\sqrt{n}Jε) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(ε, δ\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.

中文标题/摘要

标题：DP-CSGP：差分隐私随机梯度推送压缩通信

在本文中，我们提出了一种差分隐私随机梯度推送压缩通信（称为DP-CSGP）方法，用于有向图上的分布式学习。与现有工作不同，所提出的算法旨在在保持高模型实用性的前提下，同时确保严格的差分隐私（DP）保证和高效的通信。对于一般的非凸和平滑目标函数，我们证明所提出的算法在每个节点上具有$(ε, δ)$-DP保证时，实现了紧致的实用界$\mathcal{O}\left( \sqrt{d\log \left( \frac{1}δ \right)}/(\sqrt{n}Jε) \right)$（$J$和$d$分别是局部样本数和决策变量的维数），与具有精确通信的分布式对应物相匹配。在基准任务上的广泛实验表明，在相同的隐私预算下，DP-CSGP在显著降低通信成本的同时，实现了可比的模型精度。

Summary / 总结

The paper introduces DP-CSGP, a differentially private stochastic gradient push algorithm with compressed communication for decentralized learning over directed graphs. It aims to balance high model utility with rigorous differential privacy and efficient communication. The algorithm achieves a tight utility bound matching decentralized counterparts with exact communication, and experimental results demonstrate that DP-CSGP can achieve comparable model accuracy with significantly lower communication cost under the same privacy budget.

该研究提出了一种名为DP-CSGP的差分隐私随机梯度推算法，用于有向图上的分布式学习，该算法旨在保持高模型性能的同时确保严格的差分隐私和高效的通信。实验结果表明，在相同的隐私预算下，该算法可以实现与现有精确通信的分布式方法相当的模型准确性，但通信成本显著较低。

WakeupUrban: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imagery

Authors: Tianxiang Hao, Lixian Zhang, Yingjia Zhang, Mengxuan Chen, Jinxiao Zhang, Runmin Dong, Haohuan Fu

First: 2025-06-11T07:41:30+00:00 · Latest: 2025-12-15T17:33:00+00:00

Comments: 11 pages, 4 figures, 3 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Historical satellite imagery archive, such as Keyhole satellite data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation ($\textit{e.g.}$, distortion, misalignment, and spectral scarcity) and the absence of annotations have long hindered its analysis. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{WakeupUrbanBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing remote sensing (RS) datasets, along with a framework for unsupervised segmentation tasks, $\textbf{WakeupUSM}$. First, WakeupUrbanBench serves as a pioneer, expertly annotated dataset built on mid-$20^{\text{th}}$ century RS imagery, involving four key urban classes and spanning 4 cities across 2 continents with nearly 1000 km$^2$ area of diverse urban morphologies, and additionally introducing one present-day city. Second, WakeupUSM is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Comprehensive experiments demonstrate WakeupUSM significantly outperforms existing unsupervised segmentation methods $\textbf{both WakeupUrbanBench and public dataset}$, promising to pave the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and codes will be released at https://github.com/Tianxiang-Hao/WakeupUrban.

中文标题/摘要

标题：WakeupUrban：基于卫星影像的20世纪中期城市景观无监督语义分割

历史卫星影像档案，如关键孔卫星数据，提供了理解早期城市发展和长期转变的罕见见解。然而，严重的质量退化（例如，失真、错位和光谱稀缺）以及缺乏注释长期阻碍了其分析。为弥合这一差距并增强对城市发展的理解，我们引入了WakeupUrbanBench，这是一个基于历史卫星影像的标注分割数据集，是所有现有遥感（RS）数据集中最早观测时间的数据集，以及一个用于无监督分割任务的框架WakeupUSM。首先，WakeupUrbanBench作为先驱，是一个基于20世纪中期遥感影像的专家标注数据集，涉及四个关键城市类别，跨越两个大洲的四座城市，总面积近1000平方公里，包含多种城市形态，并引入了一个现代城市。其次，WakeupUSM是一个用于历史遥感影像的新型无监督语义分割框架。它采用一种基于自我监督学习架构的置信度感知对齐机制和焦点-置信度损失，生成稳健的伪标签，并自适应地优先考虑预测难度和标签可靠性，以提高在嘈杂的历史数据上的无监督分割，而无需人工监督。全面的实验表明，WakeupUSM在WakeupUrbanBench和公共数据集上的无监督分割方法中表现显著优于现有方法，有望为使用现代计算机视觉定量研究长期城市变化铺平道路。我们的基准和代码将在https://github.com/Tianxiang-Hao/WakeupUrban发布。

Summary / 总结

The research aims to leverage historical satellite imagery to understand early urban development and long-term transformation, addressing issues of quality degradation and lack of annotations. The study introduces WakeupUrbanBench, an annotated dataset of mid-20th century urban landscapes, and WakeupUSM, a novel unsupervised semantic segmentation framework. WakeupUSM uses a confidence-aware alignment mechanism and focal-confidence loss to generate robust pseudo-labels, improving unsupervised segmentation on noisy historical data. Experiments show WakeupUSM outperforms existing methods on both WakeupUrbanBench and public datasets, advancing the field of long-term urban change studies.

研究旨在利用历史卫星图像来理解早期的城市发展和长期变化，解决质量问题和缺乏标注的问题。研究引入了WakeupUrbanBench，这是一个基于20世纪中期遥感图像的标注数据集，以及WakeupUSM，这是一种新颖的无监督语义分割框架。WakeupUSM采用了一种基于自监督学习架构的置信度感知对齐机制和焦点置信损失，生成稳健的伪标签，提高无监督分割在嘈杂历史数据上的性能。实验表明，WakeupUSM在WakeupUrbanBench和公开数据集上均优于现有方法，推动了长期城市变化研究的发展。

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Authors: Dawei Li, Abdullah Alnaibari, Muhammad Arslan, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

First: 2025-12-02T18:31:18+00:00 · Latest: 2025-12-15T17:31:26+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

中文标题/摘要

标题：从审查到调解：大语言模型能否在在线争吵中担任调解人？

大型语言模型（LLMs）的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流，它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本研究探讨LLMs是否不仅能作为检测有害内容的审查者，还能作为能够理解并缓解在线冲突的调解人。我们的框架将调解分解为两个子任务：判断，即LLM评估对话的公平性和情感动态；引导，即生成同理心、缓解冲突的消息，引导参与者走向解决。为了评估调解质量，我们构建了一个基于Reddit的大规模数据集，并提出了一种结合原则评分、用户模拟和人工对比的多阶段评估管道。实验表明，基于API的模型在调解时在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。

Summary / 总结

This study investigates whether large language models (LLMs) can act as mediators in online conflicts, beyond their role as moderators. The research proposes a framework that decomposes mediation into judgment and steering tasks. A multi-stage evaluation pipeline was used to assess the quality of mediation, showing that API-based models outperform open-source models in both reasoning and intervention alignment during mediation. The findings suggest that while current LLMs show promise, they still have limitations in online social mediation.

这项研究探讨了大型语言模型（LLMs）是否可以在在线冲突中扮演调解者的角色，而不仅仅是作为内容审查员。研究提出了一个将调解分解为判断和引导两个子任务的框架。使用多阶段评估管道来评估调解质量，结果显示API基模型在推理和干预一致性方面优于开源模型。研究结果表明，尽管当前的LLMs显示出一定的潜力，但在在线社会调解方面仍存在局限性。

MMhops-R1: Multimodal Multi-hop Reasoning

Authors: Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li, Chunfeng Yuan, Guangting Wang, Fengyun Rao, Ying Shan, Weiming Hu

Venue: AAAI 2026

First: 2025-12-15T17:29:02+00:00 · Latest: 2025-12-15T17:29:02+00:00

Comments: Acceped by AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

中文标题/摘要

标题：MMhops-R1：多模态多跳推理

在迭代地跨多种模态和外部知识整合信息以进行多模态多跳推理方面的能力对于解决复杂的现实世界挑战至关重要。然而，现有的多模态大型语言模型（MLLMs）主要局限于单步推理，因为现有的基准测试缺乏评估和推动多跳能力所需的复杂性。为弥合这一差距，我们引入了MMhops，这是一种新型的大规模基准测试，旨在系统地评估和促进多模态多跳推理。MMhops数据集包含两种具有挑战性的任务格式，即桥接和比较，这要求模型动态构建复杂的推理链并整合外部知识。为应对MMhops带来的挑战，我们提出了MMhops-R1，这是一种新颖的多模态检索增强生成（mRAG）框架，用于动态推理。我们的框架利用强化学习优化模型，使其能够自主规划推理路径、提出有针对性的查询并综合多层次信息。全面的实验表明，MMhops-R1在MMhops上显著优于强大的基线模型，突显了动态规划和多模态知识整合对于复杂推理的重要性。此外，MMhops-R1在需要固定跳数推理的任务中表现出强大的泛化能力，强调了我们动态规划方法的稳健性。总之，我们的工作贡献了一个具有挑战性的新基准和一个强大的基线模型，并将发布相关的代码、数据和权重，以促进该关键领域的未来研究。

Summary / 总结

The paper introduces MMhops, a new benchmark for evaluating and promoting multi-modal multi-hop reasoning, which is essential for addressing complex real-world challenges. MMhops-R1, a novel Retrieval-Augmented Generation framework, is proposed to handle the challenges posed by MMhops through reinforcement learning for dynamic reasoning. Experiments show that MMhops-R1 outperforms strong baselines and demonstrates strong generalization to fixed-hop reasoning tasks, emphasizing the importance of dynamic planning and multi-modal knowledge integration.

研究旨在开发能够进行多模态多跳推理的模型，这对于解决复杂的现实世界挑战至关重要。为了解决现有基准的局限性，作者引入了MMhops，这是一个新的大规模基准，用于评估和促进多模态多跳推理。他们提出了MMhops-R1，这是一种使用强化学习优化动态推理路径并整合多模态知识的检索增强生成框架。实验表明，MMhops-R1在MMhops上优于强基线，并且在固定跳推理任务上表现出强大的泛化能力，突显了动态规划和多模态知识整合对于复杂推理的重要性。

Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

Authors: Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves

Venue: Transactions on Machine Learning Research, 2025

First: 2025-12-15T17:25:39+00:00 · Latest: 2025-12-15T17:25:39+00:00

Comments: Accepted to TMLR, view HTML here: https://leonardbereska.github.io/blog/2025/superposition/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.

中文标题/摘要

标题：叠加作为无损压缩：使用稀疏自编码器度量并连接到对抗性脆弱性

神经网络通过叠加实现卓越的性能：将多个特征编码为激活空间中的重叠方向，而不是为每个特征分配单独的神经元。这挑战了可解释性，但我们缺乏衡量叠加的原理性方法。我们提出了一种信息论框架，用于衡量神经表示的有效自由度。我们应用香农熵计算稀疏自编码器激活中的有效特征数量，即无干扰编码所需的最小神经元数量。等价地，这衡量了网络通过叠加模拟的“虚拟神经元”数量。当网络编码的有效特征多于实际神经元时，它们必须接受干扰作为压缩的代价。我们的度量在玩具模型中与真实值高度相关，能够检测算法任务中的最小叠加，并揭示在dropout下的系统性减少。逐层模式与Pythia-70M的固有维度研究相呼应。该度量还捕捉到发展动态，检测到在“悟解”期间特征的锐化合并。令人惊讶的是，对抗性训练可以在提高鲁棒性的同时增加有效特征，这与叠加导致脆弱性的假设相矛盾。相反，效果取决于任务复杂性和网络容量：简单任务和充足容量允许特征扩展（丰富性范式），而复杂任务或有限容量则迫使减少（稀缺性范式）。通过将叠加定义为无损压缩，本工作使我们能够在计算约束下对神经网络如何组织信息进行原理性测量，并将叠加与对抗性鲁棒性联系起来。

Summary / 总结

This paper introduces an information-theoretic framework to measure superposition in neural networks by quantifying the effective degrees of freedom using Shannon entropy on sparse autoencoder activations. The method reveals that networks often simulate more effective features than actual neurons, accepting interference as a form of lossy compression. Key findings include strong correlation with ground truth in toy models, detection of minimal superposition in algorithmic tasks, and systematic reduction under dropout. The study also uncovers that adversarial training can increase effective features while improving robustness, challenging the notion that superposition inherently causes vulnerability. Instead, the effect depends on task complexity and network capacity, distinguishing between abundance and scarcity regimes.

该研究提出了一种信息论框架，通过在稀疏自编码器激活上应用香农熵来量化神经网络中的有效自由度，以测量超位置。方法表明，网络通常编码的有效特征多于实际神经元，需要以压缩形式接受干扰。关键发现包括在玩具模型中与真实值的强相关性，在算法任务中检测到最小的超位置，并在dropout下系统地减少。研究还发现，对抗训练可以增加有效特征并提高鲁棒性，挑战了超位置会导致脆弱性的假设。相反，这种效果取决于任务复杂性和网络容量，区分了丰裕和稀缺阶段。

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Authors: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev

First: 2025-09-29T21:40:10+00:00 · Latest: 2025-12-15T17:24:36+00:00

Comments: Code: \url{https://github.com/ontocord/mixturevitae}

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

中文标题/摘要

标题：MixtureVitae：开放的网络规模预训练数据集，基于宽松许可的文本源构建，包含高质量的指令和推理数据

我们介绍了MixtureVitae，一个开放访问的预训练语料库，旨在最小化法律风险同时提供强大的下游性能。MixtureVitae 采用了一种宽松许可优先、风险缓解的采样策略，结合了公共领域和宽松许可的文本（例如CC-BY/Apache）以及经过仔细论证的低风险添加（例如政府作品和欧盟TDM合格来源）。MixtureVitae 采用了一种简单的单阶段预训练配方，整合了大量的宽松许可合成指令和推理数据信号，这些信号通常在后训练阶段引入，并且在宽松许可的网络语料库中通常较为稀缺。我们将所有来源分为三级方案，反映不同的风险级别，并提供分片级别的来源元数据以支持风险意识使用。在使用开放科学参考训练协议（固定架构和超参数；130M-1.7B参数，50B和300B令牌预算）的受控实验中，使用MixtureVitae 训练的模型在一系列标准基准测试中始终优于其他宽松许可的数据集，在1.7B参数/300B令牌设置下，它们超越了FineWeb-Edu，并接近DCLM的后期训练表现。特别是在MMLU和数学、代码基准测试中，表现尤为突出：一个使用300B MixtureVitae令牌预训练的1.7B模型在GSM8K、HumanEval和MBPP基准测试中达到了或超过了强大的1.7B指令调优基线，尽管使用了超过36倍少的令牌（300B vs. ~11T）。通过彻底的去污分析支持，这些结果表明，基于宽松许可的数据，按许可证和来源相关风险分级，具有高指令和推理密度，可以为训练强大的语言模型提供实用且风险缓解的基础，减少对广泛网络抓取的依赖，而不牺牲竞争力。

Summary / 总结

MixtureVitae is an open-access pretraining dataset designed to minimize legal risks while maintaining strong downstream performance. It combines public-domain and permissively licensed text with carefully selected low-risk additions, and uses a simple, single-stage pretraining recipe with a large proportion of synthetic instruction and reasoning data. Models trained on MixtureVitae outperform other permissive datasets across various benchmarks, especially on MMLU, math, and code benchmarks, with a 1.7B model pretrained on 300B tokens matching or exceeding a strong 1.7B instruction-tuned baseline despite using significantly fewer tokens. This demonstrates the effectiveness of permissive-first data with high instruction and reasoning density for training capable LLMs.

MixtureVitae 是一个开放访问的预训练数据集，旨在最小化法律风险同时保持强大的下游性能。它结合了公共领域和许可许可的文本以及低风险的补充内容，并使用简单的预训练配方，包含合成的指令和推理数据。使用 MixtureVitae 训练的模型在各种基准测试中表现优于其他许可数据集，特别是在 MMLU、数学和代码任务方面，一个 1.7B 模型在 300B 令牌上预训练的结果与一个 1.7B 指令调优基线相当，尽管使用的令牌数量要少得多。

Memory in the Age of AI Agents

Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan

First: 2025-12-15T17:22:34+00:00 · Latest: 2025-12-15T17:22:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

中文标题/摘要

标题：人工智能代理时代的记忆

记忆已成为基于基础模型的代理的核心能力，并将持续成为其核心能力。随着对代理记忆的研究迅速扩展并吸引前所未有的关注，该领域也变得越来越碎片化。现有涵盖代理记忆的研究工作在动机、实现和评估协议方面往往存在显著差异，而模糊定义的记忆术语的泛滥进一步模糊了概念的清晰度。传统的长/短期记忆分类不足以捕捉当今代理记忆系统的多样性。本文旨在提供当前代理记忆研究的最新概览。我们首先明确界定代理记忆的范围，并将其与相关概念如LLM记忆、检索增强生成（RAG）和上下文工程区分开来。然后，我们通过形式、功能和动态的统一视角来审视代理记忆。从形式的角度来看，我们识别出三种主要的代理记忆实现，即标记级、参数级和潜在记忆。从功能的角度来看，我们提出了一种更精细的分类，区分事实记忆、经验记忆和工作记忆。从动态的角度来看，我们分析了记忆如何随时间形成、演变和检索。为了支持实际开发，我们汇总了全面的记忆基准和开源框架的总结。除了整合，我们还阐述了新兴研究前沿的前瞻性视角，包括记忆自动化、强化学习集成、多模态记忆、多代理记忆和可信性问题。我们希望这份综述不仅作为现有工作的参考，还能作为重新思考记忆作为未来代理智能设计中的一级原语的概念基础。

Summary / 总结

This paper aims to provide a comprehensive overview of agent memory research, addressing the fragmentation and lack of clarity in the field. It delineates the scope of agent memory and differentiates it from related concepts like LLM memory, RAG, and context engineering. The study examines agent memory through the lenses of forms, functions, and dynamics, identifying three types of memory (token-level, parametric, and latent) and proposing a taxonomy for memory functions (factual, experiential, and working). It also analyzes memory dynamics and compiles benchmarks and frameworks. The paper highlights emerging research frontiers such as memory automation, reinforcement learning integration, and trustworthiness issues, aiming to serve as a conceptual foundation for future agentic intelligence design.

本文旨在提供对代理记忆研究的最新概述，解决该领域存在的碎片化和概念模糊问题。它界定了代理记忆的范围，并将其与LLM记忆和上下文工程等概念区分开来。研究通过形式、功能和动态三个视角来审视代理记忆，识别出三种类型的记忆（标记级、参数化和潜在记忆），并提出了一种记忆功能的细分分类（事实记忆、经验记忆和工作记忆）。研究还分析了记忆随时间的形成、演变和检索过程。文章汇总了基准测试和开源框架以支持实际开发，并讨论了未来研究方向，如记忆自动化和可信性问题。该研究旨在不仅作为现有工作的参考，也是重新思考代理智能设计中记忆作为一级原语的概念基础。