arXiv 论文速递

Snapshot: 20260319_0358

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Authors: Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou

First: 2026-03-17T17:59:56+00:00 · Latest: 2026-03-17T17:59:56+00:00

Comments: Project page is available at https://cvlab-kaist.github.io/WorldCam/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

中文标题/摘要

标题：WorldCam：以相机姿态为统一几何表示的交互式自回归3D游戏世界

近期视频扩散变换器的发展使交互式游戏世界模型得以实现，用户可以探索生成的环境。然而，现有方法在精确动作控制和长时3D一致性方面存在困难。大多数先前的工作将用户动作视为抽象的条件信号，忽视了动作与3D世界的几何耦合，即动作会引发相对相机运动，这些运动会累积成3D世界中的全局相机姿态。在本文中，我们确立了相机姿态作为统一的几何表示，以联合地实现即时动作控制和长期3D一致性。首先，我们定义了一个基于物理的连续动作空间，并将用户输入表示为李代数，以推导出精确的6自由度相机姿态，这些姿态通过相机嵌入器注入生成模型，以确保准确的动作对齐。其次，我们使用全局相机姿态作为空间索引来检索相关的历史观察结果，从而在长时导航过程中实现几何一致性地重新访问位置。为了支持这项研究，我们引入了一个包含3000分钟真实人类游戏数据的大规模数据集，这些数据被标注了相机轨迹和文本描述。广泛的实验表明，我们的方法在动作可控性、长时视觉质量和3D空间一致性方面显著优于最先进的交互式游戏世界模型。

Summary / 总结

This paper introduces WorldCam, which uses camera pose as a unifying geometric representation to improve action control and long-term 3D consistency in interactive gaming worlds. It defines a physics-based continuous action space and uses Lie algebra to derive precise 6-DoF camera poses, which are then injected into the generative model. Additionally, it uses global camera poses as spatial indices for consistent revisiting during long-horizon navigation. Experiments demonstrate that WorldCam outperforms existing models in action controllability, long-horizon visual quality, and 3D spatial consistency.

本文通过引入WorldCam，使用相机姿态作为统一的几何表示来解决现有交互式游戏世界模型的局限性。该方法定义了一个基于物理的连续动作空间，并使用李代数来推导精确的6-DoF相机姿态，然后将其注入生成模型。这种方法确保了动作对齐的准确性并在长时间导航中实现几何上一致的重新访问位置。实验表明，WorldCam在动作可控性、长时视觉质量和3D空间一致性方面优于最先进的模型。

Demystifing Video Reasoning

Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

Venue: www

First: 2026-03-17T17:59:55+00:00 · Latest: 2026-03-17T17:59:55+00:00

Comments: Homepage: https://www.wruisi.com/demystifying_video_reasoning

Abs · PDF · Code1 · Code2

Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

中文标题/摘要

标题：揭秘视频推理

近期在视频生成方面的进展揭示了一个意想不到的现象：基于扩散的视频模型表现出非平凡的推理能力。先前的工作将这一现象归因于帧链（CoF）机制，推理被认为在视频帧之间顺序展开。在本项工作中，我们挑战了这一假设并揭示了一种根本不同的机制。我们表明，视频模型中的推理主要在去噪步骤中出现。通过定性分析和有针对性的探针实验，我们发现模型在早期去噪步骤中探索多个候选解决方案，并逐步收敛到最终答案，我们将其称为步骤链（CoS）。除了这一核心机制外，我们还识别出几种对模型性能至关重要的新兴推理行为：（1）工作记忆，使参考持久化；（2）自我纠正和增强，允许从错误的中间解决方案中恢复；（3）先感知后行动，早期步骤建立语义基础，后期步骤执行结构化操作。在去噪步骤中，我们进一步发现扩散变换器内部自我进化出的功能专业化，早期层编码密集的感知结构，中间层执行推理，后期层巩固潜在表示。受这些见解的启发，我们提出了一种简单的无训练策略作为概念验证，展示了如何通过从具有不同随机种子的相同模型中组合潜在轨迹来提高推理能力。总体而言，我们的工作为理解视频生成模型中推理的出现提供了系统性的理解，为未来更好地利用视频模型固有的推理动态作为智能新基质的研究奠定了基础。

Summary / 总结

This work challenges the assumption that reasoning in video models occurs sequentially across frames and instead identifies a Chain-of-Steps (CoS) mechanism where reasoning emerges primarily along diffusion denoising steps. Key findings include models exploring multiple solutions early and converging to a final answer, with emergent behaviors like working memory, self-correction, and perception before action. The study also reveals functional specialization within diffusion transformers, with early layers encoding perceptual structure, middle layers executing reasoning, and later layers consolidating representations. A training-free strategy ensembling latent trajectories from identical models with different random seeds is proposed to improve reasoning, demonstrating the potential to better exploit the inherent reasoning dynamics of video models.

这项工作挑战了视频模型中的推理过程是按帧顺序进行的假设，而是识别出了一种称为Chain-of-Steps (CoS)的机制，其中推理主要在去噪步骤中出现。研究发现，模型在早期步骤中探索多个候选解决方案，并逐步收敛到最终答案。关键的推理行为包括工作记忆、自我纠正和感知先于行动。研究还揭示了扩散变换器内部的功能专业化，早期层编码感知结构，中间层执行推理，后期层整合潜在表示。提出了一种简单的无训练策略，通过从具有不同随机种子的相同模型中ensemble潜在轨迹来提高推理能力。

SegviGen: Repurposing 3D Generative Model for Part Segmentation

Authors: Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng

First: 2026-03-17T17:59:51+00:00 · Latest: 2026-03-17T17:59:51+00:00

Comments: Project page: https://fenghora.github.io/SegviGen-Page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.

中文标题/摘要

标题：SegviGen：重新利用3D生成模型进行部件分割

我们介绍了SegviGen，这是一种框架，它重新利用了原生的3D生成模型来进行3D部件分割。现有的管道要么通过蒸馏或多视图掩码聚合将强大的2D先验提升到3D，经常遭受跨视图不一致和边界模糊的问题，要么探索原生的3D判别分割，这通常需要大量的标注3D数据和大量的训练资源。相比之下，SegviGen 利用预训练的3D生成模型中编码的结构先验来通过特征部件着色诱导分割，建立了一种新颖且高效的部件分割框架。具体来说，SegviGen 编码了一个3D资产，并在几何对齐重建的活动体素上预测部件指示颜色。它支持交互式部件分割、完整分割和带有2D指导的完整分割。广泛的实验表明，SegviGen 在交互式部件分割上的性能比之前的最佳方法提高了40%，在完整分割上的性能提高了15%，同时仅使用了0.32%的标注训练数据。它证明了预训练的3D生成先验可以有效地转移到3D部件分割中，从而在有限监督下实现强大的性能。请参见我们的项目页面：https://fenghora.github.io/SegviGen-Page/

Summary / 总结

SegviGen is a framework that repurposes 3D generative models for part segmentation, improving over existing methods by 40% on interactive part segmentation and 15% on full segmentation. It uses pretrained 3D generative model priors to induce segmentation through part colorization, requiring only 0.32% of the labeled training data. This approach supports interactive, full, and guided full segmentation in a unified framework.

SegviGen 是一个框架，通过重新利用 3D 生成模型来进行 3D 部件分割，相比之前的方法，在交互式部件分割上提高了 40%，在完整分割上提高了 15%。它利用预训练的 3D 生成模型先验来着色部件，并仅需要 0.32% 的标注训练数据。这种方法避免了其他方法面临的跨视图不一致和边界模糊的问题，并支持在统一框架中完成各种分割任务。

Efficient Reasoning on the Edge

Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi

First: 2026-03-17T17:59:51+00:00 · Latest: 2026-03-17T17:59:51+00:00

Comments: Project page: https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

中文标题/摘要

标题：边缘端高效推理

大型语言模型（LLMs）结合链式推理在复杂问题解决任务中达到最先进的性能，但其冗长的推理过程和大量上下文需求使其不适合边缘部署。这些挑战包括高令牌生成成本、大的KV缓存占用空间以及在将推理能力精简到移动设备的小型模型时的效率低下。现有方法通常依赖于从大型模型中提取冗长且风格冗余的推理痕迹，这不适合设备端推理。在本工作中，我们提出了一种轻量级方法，通过LoRA适配器结合监督微调来使小型LLM能够进行推理。我们进一步引入了通过强化学习对这些适配器进行预算控制，显著减少了响应长度，同时保持了最小的准确率损失。为了解决内存限制下的解码问题，我们利用并行测试时缩放，提高了准确性，同时增加了轻微的延迟。最后，我们提出了一种动态适配器切换机制，仅在需要时激活推理，并在提示编码期间采用KV缓存共享策略，减少了设备端推理的时间。在Qwen2.5-7B上的实验表明，我们的方法在严格的资源限制下实现了高效且准确的推理，使LLM推理在移动场景中成为可能。有关我们解决方案在移动设备上运行的视频可在项目页面上找到。

Summary / 总结

This work addresses the impracticality of large language models (LLMs) with chain-of-thought reasoning for edge deployment by proposing a lightweight approach. The method uses LoRA adapters combined with supervised fine-tuning and introduces budget forcing via reinforcement learning to reduce response length. It also employs parallel test-time scaling and a dynamic adapter-switching mechanism to improve accuracy and reduce latency. Experiments show that the proposed method achieves efficient and accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios.

该研究解决了在边缘设备上部署具有链式推理能力的大语言模型因高令牌生成成本和大内存占用而不可行的问题。作者提出了一种轻量级方法，结合了LoRA适配器和监督微调，并通过强化学习进行预算控制，以减少响应长度同时保持准确性。他们还引入了并行测试时缩放和动态适配器切换机制来提高效率。实验表明，该方法能够在严格资源限制下实现移动设备上的高效且准确的推理。

MessyKitchens: Contact-rich object-level 3D scene reconstruction

Authors: Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati, Ivan Laptev

First: 2026-03-17T17:59:51+00:00 · Latest: 2026-03-17T17:59:51+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

中文标题/摘要

标题：MessyKitchens: 富含接触信息的对象级3D场景重建

单目3D场景重建近年来取得了显著进展。借助现代神经架构和大规模数据，最近的方法在单张图像的深度估计方面取得了高性能。然而，由于物体种类繁多、频繁遮挡和复杂的物体关系，将常见场景重建和分解为独立的3D物体仍然是一个难题。值得注意的是，除了对单个物体的形状和姿态估计外，机器人技术和动画应用还需要符合物理原理的场景重建，其中物体遵循不穿透和现实接触的原则。在本工作中，我们沿着两个方向推进了对象级场景重建。首先，我们引入了MessyKitchens，这是一个包含杂乱环境的真实场景数据集，并提供了高保真度的对象级地面真值，包括3D物体形状、姿态和准确的物体接触。其次，我们基于最近的SAM 3D单物体重建方法，并通过多物体解码器（MOD）将其扩展为联合对象级场景重建。为了验证我们的贡献，我们展示了MessyKitchens在配准精度和物体间穿透方面的显著改进，超过了之前的基准数据集。我们还在三个数据集上比较了我们的多物体重建方法，并展示了MOD相对于最新技术的一致和显著改进。我们的新基准、代码和预训练模型将在我们的项目网站上公开：https://messykitchens.github.io/

Summary / 总结

This work addresses the challenge of reconstructing and decomposing complex scenes into individual 3D objects, focusing on physically-plausible scene reconstruction. It introduces MessyKitchens, a new dataset with real-world cluttered scenes and detailed object-level ground truth, and develops a Multi-Object Decoder (MOD) to improve multi-object reconstruction. Experiments show significant improvements in registration accuracy and inter-object penetration compared to previous datasets and methods.

该研究旨在重建和分解复杂场景中的单个3D物体，并实现物理上合理的场景重建。引入了MessyKitchens数据集，包含真实世界的杂乱场景及其详细的物体形状、姿态和接触信息。作者还提出了一种Multi-Object Decoder (MOD) 来扩展SAM 3D方法，实现联合物体级场景重建，并在多个数据集上展示了在注册精度和物体间穿透性方面的显著改进，超过先前的方法。

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Authors: Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo

First: 2026-03-17T17:59:49+00:00 · Latest: 2026-03-17T17:59:49+00:00

Comments: Website: https://manitwin.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.

中文标题/摘要

标题：ManiTwin：将数据生成就绪的数字对象数据集扩展至10万

在模拟中学习为扩展机器人操作能力提供了有用的基石。然而，这种范式往往因缺乏规模和多样性的数据生成就绪的数字资产而受到限制。在本工作中，我们提出了ManiTwin，这是一种自动且高效的生成数据生成就绪的数字对象孪生的流水线。我们的流水线将单张图像转换为模拟就绪且语义标注的3D资产，从而实现大规模机器人操作数据的生成。使用此流水线，我们构建了包含10万高质量标注3D资产的ManiTwin-100K数据集。每个资产都配备了物理属性、语言描述、功能注释和验证的操作提案。实验表明，ManiTwin 提供了一种高效的资产合成和标注工作流程，而ManiTwin-100K提供了高质量和多样化的资产，适用于操作数据生成、随机场景合成和VQA数据生成，为可扩展的模拟数据合成和策略学习奠定了坚实的基础。我们的网页可在https://manitwin.github.io/访问。

Summary / 总结

ManiTwin is an automated pipeline that transforms single images into simulation-ready 3D assets with semantic annotations, enabling large-scale robotic manipulation data generation. Using this pipeline, the authors created ManiTwin-100K, a dataset of 100,000 high-quality, annotated 3D assets with physical properties, language descriptions, and manipulation proposals. Experiments show that ManiTwin provides an efficient asset synthesis and annotation workflow, and ManiTwin-100K offers diverse and high-quality assets for manipulation data generation, random scene synthesis, and VQA data generation, supporting scalable simulation data synthesis and policy learning.

研究旨在通过开发ManiTwin自动化管道解决机器人操作中缺乏数据生成的数字资产问题，该管道将图像转换为具有语义注释的模拟准备3D资产。该管道生成了100K高质量的3D资产，每个资产都配备了物理属性、语言描述和操作提案。实验表明，ManiTwin-100K提供了高效的资产合成和注释流程，并提供了用于操作数据生成、随机场景合成和VQA数据生成的高质量和多样化的资产，支持大规模模拟数据合成和策略学习。

SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Authors: Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang, Balu Adsumilli, Zhengzhong Tu

First: 2026-03-17T17:59:30+00:00 · Latest: 2026-03-17T17:59:30+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/

中文标题/摘要

标题：SparkVSR：基于稀疏关键帧传播的交互式视频超分辨率

视频超分辨率（VSR）旨在从低分辨率（LR）估计中恢复高质量的视频帧，但大多数现有的VSR方法在推理时像黑箱一样运作：用户无法可靠地纠正意外的伪影，而只能接受模型产生的结果。在本文中，我们提出了一种新颖的交互式VSR框架，称为SparkVSR，该框架使稀疏关键帧成为一种简单且富有表现力的控制信号。具体而言，用户可以首先使用任何现成的图像超分辨率（ISR）模型对视频中的关键帧进行超分辨率处理或选择一小组关键帧，然后SparkVSR将关键帧先验传播到整个视频序列，同时保持与原始LR视频运动的联系。具体来说，我们引入了一种关键帧条件下的两阶段训练管道，该管道将LR视频的潜在特征与稀疏编码的HR关键帧的潜在特征融合，以学习稳健的跨空间传播并细化感知细节。在推理时，SparkVSR支持灵活的关键帧选择（手动指定、编解码器I帧提取或随机采样）和一种无需参考的指导机制，该机制不断平衡关键帧的忠实度和盲恢复，即使在缺乏或不完美的参考关键帧时也能确保稳健的性能。在多个VSR基准上的实验表明，SparkVSR具有更好的时间一致性并具有强大的恢复质量，分别在CLIP-IQA、DOVER和MUSIQ上超越基线24.6%、21.8%和5.6%，实现了可控的关键帧驱动视频超分辨率。此外，我们展示了SparkVSR作为一种通用的交互式、关键帧条件下的视频处理框架，可以开箱即用地应用于诸如老电影修复和视频风格迁移等未见任务。我们的项目页面可在：https://sparkvsr.github.io/上获取

Summary / 总结

The research aims to address the lack of user control in existing VSR models by proposing SparkVSR, an interactive framework that uses sparse keyframes as control signals. It involves training a model that fuses low-resolution video latents with high-resolution keyframe latents to propagate keyframe priors across the entire video sequence. Experiments show that SparkVSR improves temporal consistency and restoration quality, outperforming baselines on multiple benchmarks by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, and supports various keyframe selection methods and reference-free guidance for robust performance. Additionally, SparkVSR can be applied to other video processing tasks such as old-film restoration and style transfer.

SparkVSR 是一种交互式视频超分辨率框架，允许用户通过选择稀疏关键帧来控制恢复过程。它使用关键帧条件下的潜像素两阶段训练管道将高分辨率关键帧传播到整个视频序列，同时保持时间一致性。实验表明，SparkVSR 在多个基准测试中优于现有方法，分别在 CLIP-IQA、DOVER 和 MUSIQ 上实现了高达 24.6%、21.8% 和 5.6% 的改进，并支持多种关键帧选择方法和无参考指导以实现稳健性能。此外，SparkVSR 还可以应用于其他视频处理任务，如老电影修复和风格迁移。

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Authors: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

First: 2026-03-17T17:58:44+00:00 · Latest: 2026-03-17T17:58:44+00:00

Comments: Code is available at https://github.com/MAC-AutoML/SocialOmni and dataset is available at https://huggingface.co/datasets/alexisty/SocialOmni

Abs · PDF · Code1 · Code2 · Code3

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

中文标题/摘要

标题：SocialOmni：全方位评估多模态社交互动能力

全方位多模态大型语言模型（OLMs）通过原生整合音频、视觉和文本重新定义了人机交互。然而，现有的OLM基准仍然局限于静态、准确性导向的任务，忽略了评估社交互动能力，这是导航自然对话中动态线索的基本能力。为此，我们提出了SocialOmni，一个全面的基准，用于在三个核心维度上操作化评估这种对话互动：（i）说话者分离和识别（谁在说话），（ii）打断时机控制（何时插话），以及（iii）自然打断生成（如何表达打断）。SocialOmni 包含2000个感知样本和一个包含209个交互生成实例的质量控制诊断集，这些实例具有严格的时序和语境约束，同时还包括受控的音频-视觉不一致性场景以测试模型的鲁棒性。我们对12个领先的OLM进行了基准测试，揭示了它们在社交互动能力上的显著差异。此外，我们的分析表明，模型的感知准确性与其生成上下文相关打断的能力之间存在明显的脱节，这表明仅依赖理解导向的指标不足以表征对话社交能力。更令人鼓舞的是，SocialOmni的这些诊断为未来OLM中的感知-互动鸿沟提供了可操作的信号。

Summary / 总结

SocialOmni is a benchmark designed to evaluate the social interactivity of omni-modal large language models (OLMs) across three dimensions: speaker separation and identification, interruption timing control, and natural interruption generation. It includes 2,000 perception samples and 209 interaction-generation instances with strict temporal and contextual constraints, as well as audio-visual inconsistency scenarios. Benchmarking 12 leading OLMs revealed significant differences in their social-interaction capabilities, highlighting the need for understanding-centric metrics to be complemented by interaction-centric ones. The diagnostics from SocialOmni provide actionable insights for improving future OLMs.

SocialOmni 对12个领先的全模态大语言模型（OLMs）在社交互动能力方面进行了基准测试，重点关注说话人分离、打断时机控制和自然打断生成。基准测试包括2,000个感知样本和209个交互生成实例，具有严格的约束条件，揭示了OLMs在社交互动能力方面的显著差异，并强调需要通过理解为中心的指标来补充交互为中心的指标。这项工作提供了改进OLMs对话社交能力的可操作见解。

Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

Authors: Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler, Stephanie Marik, Allen Sheldon, Rajeev Chhajer, Nithin Santhanam

First: 2026-03-17T17:58:01+00:00 · Latest: 2026-03-17T17:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Reliable multi-horizon traffic forecasting is challenging because network conditions are stochastic, incident disruptions are intermittent, and effective spatial dependencies vary across time-of-day patterns. This study is conducted on the Ohio Department of Transportation (ODOT) traffic count data and corresponding ODOT crash records. This work utilizes a Spatio-Temporal Transformer (STT) model with Adaptive Conformal Prediction (ACP) to produce multi-horizon forecasts with calibrated uncertainty. We propose a piecewise Coefficient of Variation (CV) strategy that models hour-to-hour traveltime variability using a log-normal distribution, enabling the construction of a per-hour dynamic adjacency matrix. We further perturb edge weights using incident-related severity signals derived from the ODOT crash dataset that comprises incident clearance time, weather conditions, speed violations, work zones, and roadway functional class, to capture localized disruptions and peak/off-peak transitions. This dynamic graph construction replaces a fixed-CV assumption and better represents changing traffic conditions within the forecast window. For validation, we generate extended trips via multi-hour loop runs on the Columbus, Ohio, network in SUMO simulations and apply a Monte Carlo simulation to obtain travel-time distributions for a Vehicle Under Test (VUT). Experiments demonstrate improved long-horizon accuracy and well-calibrated prediction intervals compared to other baseline methods.

中文标题/摘要

标题：基于事件感知同变时空间变换器的长周期交通预测

可靠的多周期交通预测具有挑战性，因为网络条件具有随机性，事件干扰是间歇性的，且有效的空间依赖性随不同时段的变化而变化。本研究基于俄亥俄州运输部(ODOT)的交通计数数据和相应的ODOT事故记录进行。本研究利用具有自适应同变预测(ACP)的时空间变换器(STT)模型，生成具有校准不确定性的多周期预测。我们提出了一种分段变异系数(CV)策略，使用对数正态分布建模小时间的旅行时间变异，从而构建每小时动态邻接矩阵。进一步使用来自ODOT事故数据集的与事件相关的严重性信号（包括事故清除时间、天气状况、超速、工作区和道路功能类别）扰动边权重，以捕捉局部干扰和高峰/非高峰时段的过渡。这种动态图构建替代了固定CV假设，更好地代表了预测窗口内的交通条件变化。为了验证，我们通过SUMO模拟中的多小时环路运行生成扩展行程，并应用蒙特卡洛模拟获得测试车辆(VUT)的旅行时间分布。实验结果表明，与基线方法相比，该方法在长周期预测准确性上有所提高，并且预测区间具有良好的校准性。

Summary / 总结

This study addresses the challenge of long-horizon traffic forecasting by using a Spatio-Temporal Transformer (STT) model with Adaptive Conformal Prediction (ACP) to incorporate incident-related severity signals from crash records. The proposed method models traveltime variability using a log-normal distribution and constructs a dynamic adjacency matrix. Experiments show improved long-horizon accuracy and well-calibrated prediction intervals compared to other baseline methods.

本研究通过使用Spatio-Temporal Transformer (STT)模型结合Adaptive Conformal Prediction (ACP)，应对网络条件的随机性和间歇性事件带来的挑战。引入了分段变异系数（CV）策略来建模旅行时间的变异性，并使用与事件相关的严重性信号构建动态图，以捕捉局部干扰和高峰/非高峰转换。实验表明，与基线方法相比，该方法在长时段内的准确性和预测区间校准度均有所提高。

BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation

Authors: Haitian Wang, Xinyu Wang, Muhammad Ibrahim, Dustin Severtson, Ajmal Mian

First: 2026-03-02T14:49:05+00:00 · Latest: 2026-03-17T17:55:38+00:00

Comments: This article has been published in Remote Sensing as part of the Special Issue Intelligent UAV Remote Sensing for Next-Generation Precision Agriculture

Abs · PDF · Code1 · Code2

Abstract

Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop--weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopy. We propose VISA, a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using local residual convolutions, channel recalibration, spatial gating, and skip-connected decoding, which preserve fine textures, row boundaries, and small weed structures that are often weakened after ratio-based index compression. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8 M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively.

中文标题/摘要

标题：BAWSeg：一种用于大麦杂草分割的UAV多光谱基准

在谷物田地中的准确杂草制图需要可靠的UAV图像像素级分割，且能在不同田地、季节和光照条件下保持一致性。现有的多光谱管道通常依赖于阈值化的植被指数，这些指数在辐射漂移和混合作物-杂草像素下容易失效，或者依赖于单流CNN和Transformer骨干网络，这些网络摄入堆叠的波段和指数，其中辐射线索和归一化指数线索相互干扰，降低了对嵌入作物冠层中的小杂草簇的敏感性。我们提出了一种VISA，这是一种两流分割网络，将这些线索解耦并在原生分辨率下融合。辐射流从校准的五波段反射率中学习，使用局部残差卷积、通道重新校准、空间门控和跳连解码，以保留细纹理、行边界和在比率指数压缩后通常会减弱的小杂草结构。指数流在植被指数图上操作，使用窗口自注意力高效建模局部结构，使用状态空间层传播田地规模上下文而不增加二次注意力成本，以及Slot注意力形成稳定的区域描述符，以提高在冠层混合下稀疏杂草的区分能力。为了支持监督训练和部署导向的评估，我们引入了BAWSeg，这是一个在西澳大利亚商业大麦田地上收集的四年UAV多光谱数据集，提供了辐射校准的蓝、绿、红、红边和近红外正射影像，衍生植被指数以及无泄漏块分割的密集作物、杂草和其他标签。在BAWSeg上，VISA实现了75.6%的mIoU和63.5%的杂草IoU，参数量为22.8M，优于多光谱SegFormer-B1基线1.2 mIoU和1.9杂草IoU。在跨图和跨年协议下，VISA分别保持了71.2%和69.2%的mIoU。

Summary / 总结

The research aims to develop a reliable pixel-level segmentation method for barley weed mapping using UAV multispectral imagery. VISA, a two-stream segmentation network, decouples radiance and index cues, improving sensitivity to small weeds. The radiance stream uses local residual convolutions and spatial gating to preserve fine textures, while the index stream employs windowed self-attention and Slot Attention for efficient context propagation. The BAWSeg dataset, a four-year UAV multispectral dataset, supports supervised training and evaluation. VISA achieves 75.6% mIoU and 63.5% weed IoU, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU.

研究旨在利用无人机多光谱成像技术，开发一种可靠的小麦田杂草像素级分割方法。提出的VISA网络分离辐射和指数线索，提高了对小杂草的敏感性。在BAWSeg数据集上，VISA实现了75.6%的平均IoU和63.5%的杂草IoU，优于基线SegFormer-B1的1.2 mIoU和1.9杂草IoU。VISA在跨地块和跨年协议下保持了71.2%和69.2%的平均IoU。

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Authors: Xavier Gonzalez

First: 2026-03-17T17:55:01+00:00 · Latest: 2026-03-17T17:55:01+00:00

Comments: PhD Dissertation; Stanford University

Abs · PDF · Code1 · Code2

Abstract

Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.

中文标题/摘要

标题：统一优化与动力学以并行化顺序计算：并行牛顿方法指南

大规模并行硬件（GPU）和长序列数据使得并行算法对于大规模机器学习至关重要。然而，动力学系统，如循环神经网络和马尔可夫链蒙特卡洛，被认为受到顺序瓶颈的限制。最近的研究表明，通过将动力学系统的评估重新构想为非线性方程组，可以并行化这些系统，从而可以使用并行关联扫描和牛顿法求解。然而，这些并行牛顿方法面临效率低下、不稳定性和缺乏收敛保证等局限性。本论文通过方法论和理论贡献解决了这些局限性，特别借鉴了优化领域。方法论上，我们开发了可扩展且稳定的并行牛顿方法，基于拟牛顿和信赖域方法。拟牛顿方法更快且更节省内存，而信赖域方法则显著更稳定。理论上，我们将许多不动点方法统一到我们的并行牛顿框架中，包括皮卡德和雅可比迭代。我们为这些技术建立了依赖于方法近似准确性和稳定性的线性收敛率。此外，我们给出了一个精确条件，基于动力学稳定性，确定并行化何时能加速动力学系统以及何时不能。具体来说，动力学系统的最大李雅普unov指数的符号决定了并行牛顿方法是否能快速收敛。总之，本论文解锁了并行化顺序计算的可扩展且稳定的方法，并为这些技术何时有效提供了坚实的理论基础。本论文还为希望在这个持续故事中撰写新篇章的研究人员提供了一本并行牛顿方法指南。

Summary / 总结

This thesis aims to address the limitations of parallelizing dynamical systems by developing scalable and stable parallel Newton methods, drawing from optimization techniques. The methods include quasi-Newton and trust-region approaches, which are faster and more memory efficient, and significantly more stable, respectively. Theoretical contributions unify fixed-point methods and establish a linear convergence rate depending on approximation accuracy and stability. The thesis also provides a condition based on the Largest Lyapunov Exponent to predict when parallel Newton methods will converge quickly, offering a guide for researchers in this field.

该论文旨在通过开发可扩展且稳定的并行牛顿方法来解决动态系统（如递归神经网络和马尔可夫链蒙特卡洛）并行化的问题。这些方法借鉴了优化技术，如拟牛顿法和信任区域方法，并将许多不动点方法统一到并行牛顿框架中。关键发现包括依赖于近似准确性和稳定性的线性收敛率，以及基于最大李雅普诺夫指数的条件，该条件确定了并行化是否能加速收敛。该工作为并行牛顿方法何时有效提供了理论基础，并为该领域的研究人员提供了一个指南。

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Authors: Mattia Rigotti, Nicholas Thumiger, Thomas Frick

First: 2026-03-17T17:54:26+00:00 · Latest: 2026-03-17T17:54:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numerical solver artifacts, while efficient approximate methods sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization in inductive learning, where models trained with one set of numerical choices fail when encountering different spectral decompositions of similar graphs or discretizations of the same mesh. We propose GIST (Gauge-Invariant Spectral Transformers), a new graph transformer architecture that resolves this challenge by achieving end-to-end $\mathcal{O}(N)$ complexity through random projections while algorithmically preserving gauge invariance via inner-product-based attention on the projected embeddings. We prove GIST achieves discretization-invariant learning with bounded mismatch error, enabling parameter transfer across arbitrary mesh resolutions for neural operator applications. Empirically, GIST matches state-of-the-art on standard graph benchmarks (e.g., achieving 99.50% micro-F1 on PPI) while uniquely scaling to mesh-based Neural Operator benchmarks with up to 750K nodes, achieving state-of-the-art aerodynamic prediction on the challenging DrivAerNet and DrivAerNet++ datasets.

中文标题/摘要

标题：GIST：保持规范不变性的光谱变换器以实现可扩展的图神经算子

将变换器的位置编码适应网格和图结构数据带来了重大的计算挑战：精确的光谱方法需要三次方复杂性的特征分解，并且可能会通过数值求解器的副作用无意中破坏规范不变性，而高效的近似方法则会牺牲规范对称性。这两种失败模式都会在归纳学习中导致灾难性的泛化，即使用一种数值选择训练的模型在遇到相似图的不同光谱分解或同一网格的不同离散化时会失效。我们提出了GIST（保持规范不变性的光谱变换器），这是一种新的图变换器架构，通过随机投影实现了端到端的$\mathcal{O}(N)$复杂性，同时通过基于内积的注意力机制在投影嵌入上算法上保持了规范不变性。我们证明GIST实现了离散化不变的学习，并且误差在可接受范围内，从而使得神经算子应用中的参数转移不受网格分辨率的影响。实验上，GIST在标准图基准测试中达到了最先进的性能（例如，在PPI上达到了99.50%的微F1），并且唯一地扩展到了基于网格的神经算子基准测试，具有多达750K个节点，实现了在具有挑战性的DrivAerNet和DrivAerNet++数据集上的最先进的气动预测。

Summary / 总结

The paper addresses the computational challenges in adapting transformer positional encodings to graph-structured data by proposing GIST (Gauge-Invariant Spectral Transformers). GIST achieves $O(N)$ complexity through random projections and preserves gauge invariance via inner-product-based attention. The model demonstrates discretization-invariant learning and matches state-of-the-art performance on graph benchmarks while scaling to large mesh-based Neural Operator benchmarks, achieving state-of-the-art results in aerodynamic prediction.

论文解决了将变压器位置编码适应网格和图结构数据时的计算挑战，其中精确的谱方法计算成本高且可能破坏规范不变性，而近似方法则牺牲了规范对称性。它提出了GIST，一种新的图变换器架构，通过随机投影实现O(N)复杂度，并通过基于内积的注意力机制保持规范不变性，展示了在图和网格基准上的离散化不变学习和优越性能。

Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning

Authors: Reek Das, Biplab Kanti Sen

First: 2026-03-17T17:54:00+00:00 · Latest: 2026-03-17T17:54:00+00:00

Comments: 15 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Federated Learning (FL) is increasingly applied in sectors like healthcare, finance, and IoT, enabling collaborative model training while safeguarding user privacy. However, FL systems are susceptible to Byzantine adversaries that inject malicious updates, which can severely compromise global model performance. Existing defenses tend to focus on specific attack types and fail against untargeted strategies, such as multi-label flipping or combinations of noise and backdoor patterns. To overcome these limitations, we propose FedAOT-a novel defense mechanism that counters multi-label flipping and untargeted poisoning attacks using a metalearning-inspired adaptive aggregation framework. FedAOT dynamically weights client updates based on their reliability, suppressing adversarial influence without relying on predefined thresholds or restrictive attack assumptions. Notably, FedAOT generalizes effectively across diverse datasets and a wide range of attack types, maintaining robust performance even in previously unseen scenarios. Experimental results demonstrate that FedAOT substantially improves model accuracy and resilience while maintaining computational efficiency, offering a scalable and practical solution for secure federated learning.

中文标题/摘要

标题：Byzantine-鲁棒联邦学习中的动态元层聚合

联邦学习（FL）在医疗保健、金融和物联网等领域中越来越被应用，能够实现模型协作训练并保护用户隐私。然而，FL系统容易受到拜占庭对手的攻击，这些对手会注入恶意更新，严重损害全局模型性能。现有的防御措施往往针对特定的攻击类型，无法应对未瞄准的策略，如多标签翻转或噪声和后门模式的组合。为克服这些限制，我们提出了一种名为FedAOT的新颖防御机制，该机制利用元学习启发的自适应聚合框架来对抗多标签翻转和未瞄准的投毒攻击。FedAOT动态地根据客户端更新的可靠性对其进行加权，抑制恶意影响，而不依赖于预定义的阈值或严格的攻击假设。值得注意的是，FedAOT在多种数据集和广泛的攻击类型中表现出有效的泛化能力，即使在未见过的场景中也能保持稳健的性能。实验结果表明，FedAOT在提高模型准确性和鲁棒性的同时，保持了计算效率，提供了一种可扩展且实用的联邦学习安全解决方案。

Summary / 总结

The paper addresses the vulnerability of Federated Learning (FL) to Byzantine adversaries, which can inject malicious updates and compromise model performance. It introduces FedAOT, a defense mechanism that uses a metalearning-inspired adaptive aggregation framework to dynamically weight client updates based on their reliability, effectively countering multi-label flipping and untargeted poisoning attacks. Experimental results show that FedAOT improves model accuracy and resilience while maintaining computational efficiency, making it a scalable and practical solution for secure FL.

论文提出了FedAOT，一种基于元学习的自适应聚合框架，动态加权客户端更新以应对拜占庭对手的攻击，特别是多标签翻转和未瞄准的投毒攻击。FedAOT在多种数据集和攻击类型下保持了鲁棒性能，提高了模型的准确性和抗攻击能力，同时保持了计算效率。

M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

Authors: Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang, Mulin Yu, Bo Dai

First: 2026-03-17T17:52:37+00:00 · Latest: 2026-03-17T17:52:37+00:00

Comments: Project page: https://city-super.github.io/M3/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

中文标题/摘要

标题：M^3：多视图基础模型与单目高斯点云SLAM的密集匹配结合

从未标定的单目视频流式重建仍然具有挑战性，因为它需要高精度的位姿估计和在动态环境中高效的在线优化。虽然将3D基础模型与SLAM框架结合是一种有前景的范式，但一个关键瓶颈仍然存在：大多数多视图基础模型以前馈方式估计位姿，产生像素级对应关系，缺乏进行严格几何优化所需的精度。为了解决这个问题，我们提出了M^3，它通过增加一个专门的匹配头来增强多视图基础模型，以促进细粒度的密集对应关系，并将其集成到鲁棒的单目高斯点云SLAM中。M^3进一步通过引入动态区域抑制和跨推理固有对齐来增强跟踪稳定性。在多种室内外基准上的广泛实验表明，M^3在位姿估计和场景重建方面的精度达到了最先进的水平。值得注意的是，M^3将ATE RMSE降低了64.3%，与VGGT-SLAM 2.0相比，并且在ScanNet++数据集上的PSNR上比ARTDECO高出2.11 dB。

Summary / 总结

M^3 addresses the challenge of monocular SLAM in dynamic environments by integrating a multi-view foundation model with a dedicated matching head for dense correspondences, and incorporating dynamic area suppression and cross-inference alignment. Experiments show M^3 achieves state-of-the-art accuracy in pose estimation and scene reconstruction, reducing ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforming ARTDECO by 2.11 dB in PSNR on ScanNet++.

M^3通过将专用的匹配头集成到多视图基础模型中，解决动态环境下的单目SLAM挑战，提高姿态估计精度和计算效率。该系统集成到单目高斯点云SLAM中，显示出显著的准确性提升，ATE RMSE相比VGGT-SLAM 2.0降低了64.3%，在ScanNet++数据集上PSNR比ARTDECO高出2.11 dB。

Internalizing Agency from Reflective Experience

Authors: Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang

Venue: ICML 2026

First: 2026-03-17T17:50:47+00:00 · Latest: 2026-03-17T17:50:47+00:00

Comments: 17 pages, 5 figures; Submitted to ICML 2026

Abs · PDF · Code1 · Code2

Abstract

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

中文标题/摘要

标题：从反思经验内化代理权

大型语言模型越来越多地被部署为自主代理，必须通过与提供丰富反馈的环境进行长期交互来进行规划、行动并从错误中恢复。然而，当前主要依赖于结果驱动的后训练方法（例如，具有可验证奖励的强化学习）主要优化最终的成功信号，而未能充分利用丰富的环境反馈。因此，这些方法往往导致分布锐化：策略变得更好于重现一组已经成功的特定行为，而未能提高基于反馈的代理权，以在长期交互中扩展问题解决能力（例如，Pass@k）。为了解决这一问题，我们提出了LEAFE（从反思经验学习反馈驱动的代理权），这是一种框架，能够从反思经验中内化恢复代理权。具体而言，在探索过程中，代理将环境反馈总结为可操作的经验，回溯到早期决策点，并探索带有修订行动的替代分支。然后，我们通过监督微调将这些经验指导的修正提炼到模型中，使策略在未来交互中能够更有效地恢复。在固定交互预算下的各种交互式编程和代理任务中，LEAFE在Pass@1上始终优于基线模型，并且在Pass@k上优于结果驱动的基线（GRPO）和基于经验的方法（如早期经验），在Pass@128上的收益高达14%。

Summary / 总结

The research aims to enhance the agency of large language models by utilizing rich environment feedback during long-horizon interactions. The proposed LEAFE framework internalizes recovery agency from reflective experience, allowing the agent to backtrack and explore alternative actions based on summarized feedback. This method improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven and experience-based baselines, with up to 14% gains on Pass@128.

研究旨在通过利用丰富的环境反馈来增强大型语言模型在长时交互中的自主性。提出的LEAFE框架从反思性经验中内化恢复自主性，使代理能够回溯并基于总结的反馈探索替代行动。这种方法在交互编码和代理任务中提高了Pass@1和Pass@k指标的表现，最高在Pass@128上实现了14%的性能提升。

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Authors: Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab

First: 2026-03-17T17:50:32+00:00 · Latest: 2026-03-17T17:50:32+00:00

Comments: 18 pages, 17 figures

Abs · PDF · Code1 · Code2

Abstract

Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.

中文标题/摘要

标题：随机重置加速强化学习策略收敛

随机重置，即动态过程间歇性地返回到固定参考状态，已成为优化首次通过性质的强大机制。现有理论主要处理静态、非学习过程。本文探讨了随机重置与强化学习的相互作用，其中底层动力学通过经验进行适应。在表格网格环境中，我们发现即使重置不减少纯扩散代理的搜索时间，它也能加速策略收敛，表明一种超越经典首次通过优化的新机制。在具有神经网络价值近似的连续控制任务中，我们展示了随机重置在探索困难和奖励稀疏时如何改善深度强化学习。与时间折扣不同，重置保留了最优策略，通过截断无信息的长轨迹来加速价值传播，从而提高收敛性。我们的结果将随机重置确立为一种简单可调的加速学习机制，将统计力学中的一个典型现象转化为强化学习的优化原则。

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Authors: Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

First: 2026-03-17T17:45:53+00:00 · Latest: 2026-03-17T17:45:53+00:00

Comments: 12 pages, 11 figures, 13 tables, 26 references. Code: https://github.com/pushing-the-frontier/slide-forge-llm Dataset: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts

Abs · PDF · Code1 · Code2 · Code3

Abstract

Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm

中文标题/摘要

标题：学习展示：逆向规范奖励在自主幻灯片生成中的应用

自动演示文稿生成仍然是一个具有挑战性的任务，需要连贯的内容创作、视觉设计和面向观众的沟通。本研究提出了一种与OpenEnv兼容的强化学习环境，其中LLM代理通过工具使用学习研究主题、规划内容并生成专业的HTML幻灯片演示文稿。我们引入了一个多组件奖励系统，结合结构验证、渲染质量评估、基于LLM的美学评分、内容质量指标以及一个逆向规范奖励，该奖励衡量生成的幻灯片在多大程度上忠实于其预期目的。逆向规范奖励是一种“逆向任务”，其中LLM尝试从生成的幻灯片中恢复原始规范，提供了一个全面的质量信号。我们的方法通过GRPO对Qwen2.5-Coder-7B进行微调，仅在来自使用Claude Opus 4.6收集的专家演示的提示上训练0.5%的参数。在六个模型对48个不同业务简报的实验中，我们微调的7B模型达到了Claude Opus 4.6质量的91.2%，并且比基模型提高了33.1%。六模型比较表明，指令遵循和工具使用合规性，而不是参数数量，决定了自主任务的性能。我们贡献了SlideRL，一个包含288个多回合展开轨迹的开源数据集，涵盖了所有六个模型：https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts 代码：https://github.com/pushing-the-frontier/slide-forge-llm

Summary / 总结

This work addresses the challenge of automated presentation generation by proposing a reinforcement learning environment where LLM agents learn to create professional HTML slide presentations. The approach uses a multi-component reward system, including structural validation, render quality assessment, aesthetic scoring, content quality metrics, and an inverse specification reward that measures how well the generated slides convey their intended purpose. Experiments show that the fine-tuned 7B model achieves 91.2% of the quality of Claude Opus 4.6 while improving 33.1% over the base model, highlighting the importance of instruction adherence and tool-use compliance over raw parameter count.

该研究通过提出一个强化学习环境，让LLM代理学习生成专业的HTML幻灯片演示文稿，以应对自动化演示文稿生成的挑战。方法使用一个多组件奖励系统，包括结构验证、渲染质量评估、美学评分、内容质量指标以及一个逆任务奖励，衡量生成的幻灯片是否准确传达其预期目的。实验表明，微调后的7B模型在质量上达到了Claude Opus 4.6的91.2%，并且在基模型基础上提高了33.1%，强调了指令遵循和工具使用合规性的重要性，而非参数量的多少。

Fluids You Can Trust: Property-Preserving Operator Learning for Incompressible Flows

Authors: Ramansh Sharma, Matthew Lowery, Houman Owhadi, Varun Shankar

First: 2026-02-17T10:20:46+00:00 · Latest: 2026-03-17T17:44:47+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a novel property-preserving kernel-based operator learning method for incompressible flows governed by the incompressible Navier--Stokes equations. Traditional numerical solvers incur significant computational costs to respect incompressibility. Operator learning offers efficient surrogate models, but current neural operators fail to exactly enforce physical properties such as incompressibility, periodicity, and turbulence. Our kernel method maps input functions to expansion coefficients of output functions in a property-preserving kernel basis, ensuring that predicted velocity fields $\textit{analytically}$ and $\textit{simultaneously}$ preserve the aforementioned physical properties. Our method leverages efficient numerical linear algebra, simple rootfinding, and streaming to allow for training at-scale on desktop GPUs. We also present universal approximation results and both pessimistic and more realistic $\textit{a priori}$ convergence rates for our framework. We evaluate the method on challenging 2D and 3D, laminar and turbulent, incompressible flow problems. Our method achieves up to six orders of magnitude lower relative $\ell_2$ errors upon generalization and trains up to five orders of magnitude faster compared to neural operators, despite our method being trained on desktop GPUs and neural operators being trained on cutting-edge GPU servers. Moreover, while our method enforces incompressibility analytically, neural operators exhibit very large deviations. Our results show that our method provides an accurate and efficient surrogate for incompressible flows.

中文标题/摘要

标题：可信赖的流体：保性质算子学习方法用于不可压缩流

我们提出了一种新的保性质核基算子学习方法，用于由不可压缩纳维-斯托克斯方程支配的不可压缩流。传统的数值求解器为了遵守不可压缩性会带来显著的计算成本。算子学习提供了高效的替代模型，但当前的神经算子无法精确强制执行诸如不可压缩性、周期性和湍流等物理性质。我们的核方法将输入函数映射到输出函数在保性质核基中的展开系数，确保预测的流速场在分析上和同时保有上述物理性质。我们的方法利用高效的数值线性代数、简单的根寻找和流式计算，允许在桌面GPU上大规模训练。我们还提供了我们框架的通用逼近结果以及悲观和更现实的先验收敛速率。我们在具有挑战性的2D和3D、层流和湍流、不可压缩流问题上评估了该方法。我们的方法在泛化时相对$L_2$误差低六个数量级，并且训练速度快五个数量级，尽管我们的方法在桌面GPU上训练，而神经算子在最先进的GPU服务器上训练。此外，虽然我们的方法在分析上强制执行不可压缩性，但神经算子表现出非常大的偏差。我们的结果表明，我们的方法为不可压缩流提供了准确且高效的替代模型。

An assessment of data-centric methods for label noise identification in remote sensing data sets

Authors: Felix Kröber, Genc Hoxha, Ribana Roscher

First: 2026-03-17T17:40:28+00:00 · Latest: 2026-03-17T17:40:28+00:00

Comments: Accepted for publication in International Society for Photogrammetry and Remote Sensing (ISPRS) Annals 2026

Abs · PDF · Code1 · Code2

Abstract

Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.

中文标题/摘要

标题：数据为中心的方法在遥感数据集标签噪声识别中的评估

在许多现实世界的数据集中都存在标签噪声，即错误的标签，这严重限制了深度学习模型的泛化能力。然而，在遥感领域，数据集中的标签噪声的自动化处理迄今受到的关注较少。特别是，缺乏对数据为中心的方法的系统分析，这些方法不仅能够处理标签噪声，还能明确识别和隔离噪声标签。在本文中，我们研究了三种此类方法，并在不同的标签噪声假设下评估了它们的行为。为此，我们向两个基准数据集注入了不同类型的标签噪声，噪声水平从10%到70%不等，然后分析了所选方法过滤标签噪声的效果以及这对任务性能的影响。通过我们的分析，我们清楚地证明了数据为中心的方法在两个方面的价值——标签噪声识别和任务性能改进。我们的分析提供了关于在不同设置和目标下哪种方法是最佳选择的见解。最后，我们展示了在将数据为中心的标签噪声方法转移到遥感数据中仍需研究的领域。因此，我们的工作是朝着在遥感领域方法论建立与实际应用之间架起桥梁迈出的一步。

Summary / 总结

This paper assesses data-centric methods for identifying label noise in remote sensing datasets, which are crucial for improving the generalizability of deep learning models. The authors evaluate three such methods by injecting various types of label noise (10-70%) into two benchmark datasets and analyzing their performance. The study demonstrates the effectiveness of these methods in both identifying label noise and enhancing task performance, providing insights into their suitability for different settings. The research highlights areas needing further investigation for transferring these methods to remote sensing applications.

本文评估了用于识别遥感数据集中标签噪声的数据中心化方法，这对于提高深度学习模型的泛化能力至关重要。作者向两个基准数据集注入不同类型的标签噪声（10%-70%），并评估了三种方法的效果。结果显示，这些方法能够有效识别和减轻标签噪声，提升任务性能。研究提供了基于特定设置和目标的最佳方法选择的见解，并指出了需要进一步研究的遥感应用领域。

OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

Authors: Naaisha Agarwal, Yihan Wu, Yichang Jian, Yikuan Hu, Nishad Mansoor, Mohan Li, Yifei Peng, Wang-Zhou Dai, Yao-Xiang Ding, Emanuele Sansone

First: 2026-03-14T09:33:29+00:00 · Latest: 2026-03-17T17:36:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.

中文标题/摘要

标题：OrigamiBench：一种合成平面可折叠纸艺的互动环境

构建能够在物理世界中规划、行动和创造的AI系统，不仅需要模式识别，还需要理解物理过程中的因果机制和约束，以指导顺序决策。这种能力依赖于类似于内部语言模型的内部表示，将观察、行动和环境变化的结果联系起来。然而，许多现有的基准将视觉感知和程序化推理视为两个独立的问题，分别关注视觉识别或符号任务。折纸领域提供了一个自然的测试平台，可以整合这些模态。通过折叠操作构建形状需要视觉感知、几何和物理约束的推理以及顺序规划，同时保持足够的结构化以便系统评估。我们介绍了OrigamiBench，这是一个互动基准，在其中模型迭代地提出折叠并接收关于物理有效性和与目标配置相似性的反馈。现代视觉-语言模型的实验表明，仅扩大模型规模并不能可靠地产生关于物理变换的因果推理。模型无法生成连贯的多步折叠策略，这表明视觉和语言表示仍然结合得不够紧密。

Summary / 总结

The research aims to develop AI systems capable of understanding and manipulating the physical world, focusing on the domain of origami as a testbed. The method involves an interactive environment where models propose folding operations and receive feedback on their validity and similarity to a target configuration. Key findings indicate that increasing model size does not inherently improve causal reasoning about physical transformations, and models struggle to develop coherent multi-step folding strategies, highlighting the need for better integration of visual and language representations.

研究旨在开发能够理解和操作物理世界的AI系统，通过因果推理和内部表示。主要方法是创建OrigamiBench，一个交互式环境，模型提出折叠操作并接收其有效性及与目标配置相似度的反馈。关键发现表明，增加模型规模并不一定能导致对物理变换的因果推理，模型在开发连贯的多步折叠策略方面遇到困难，这表明视觉和语言表示的整合较弱。

SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

Authors: Mahdi Naseri, Zhou Wang

First: 2026-03-14T00:37:26+00:00 · Latest: 2026-03-17T17:34:31+00:00

Comments: Submitted to IEEE Transactions on Image Processing

Abs · PDF · Code1 · Code2

Abstract

No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

中文标题/摘要

标题：SHAMISA：形状建模的隐式结构关联自监督无参考图像质量评估

无参考图像质量评估（NR-IQA）旨在在无法访问高质量参考图像的情况下估计感知质量。学习一个NR-IQA模型面临一个根本性的瓶颈：需要大量的昂贵的人工感知标签。我们提出SHAMISA，一种非对比的自监督框架，通过利用显式结构关系监督从未标记的失真图像中学习。与之前的方法施加刚性、二元相似性约束不同，SHAMISA 引入了隐式结构关联，定义为软、可控的关系，既对失真敏感又对内容敏感，从合成元数据和固有特征结构中推断而来。一个关键创新是我们组成的失真引擎，它从连续参数空间生成不可数的退化家族，并按组排列，使得每次只有一个失真因素变化。这使得在训练期间对表示相似性进行精细控制：具有共同失真模式的图像在嵌入空间中被拉近，而严重程度变化产生结构化的、可预测的位移。我们通过双重来源关系图将这些见解整合起来，该图编码已知退化配置文件和新兴的结构亲和力，以指导整个训练过程中的学习过程。在监督下训练一个卷积编码器，然后在推理时冻结，质量预测由其特征上的线性回归器完成。在合成、真实和跨数据集的NR-IQA基准上的广泛实验表明，SHAMISA 在无需人工质量注释或对比损失的情况下实现了强大的整体性能，并且具有改进的跨数据集泛化能力和鲁棒性。

Summary / 总结

SHAMISA is a non-contrastive self-supervised framework for No-Reference Image Quality Assessment that learns from unlabeled images by leveraging structured relational supervision. It introduces implicit structural associations, which are soft and distortion-aware, and uses a compositional distortion engine to generate diverse degradations. SHAMISA demonstrates strong performance and improved cross-dataset generalization without requiring human quality annotations or contrastive losses.

SHAMISA 是一种无需人类感知标签的无参考图像质量评估的自监督框架，通过结构化的监督学习未标记的图像。它引入了软且对失真敏感的隐式结构关联，并使用组成失真引擎生成细粒度的失真。SHAMISA 在各种基准测试中表现出色，具有改进的跨数据集泛化能力和鲁棒性，无需对比损失。

Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

Authors: Sourya Saha, Saptarshi Debroy

First: 2026-03-17T17:30:11+00:00 · Latest: 2026-03-17T17:30:11+00:00

Comments: Accepted at the The 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)

Abs · PDF · Code1 · Code2

Abstract

Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.

中文标题/摘要

标题：基于深度强化学习的边缘卸载技术用于延迟受限的XR流水线

沉浸式扩展现实(XR)应用引入了延迟关键的工作负载，必须在能源和电池受限的设备上满足严格的实时响应性，使得在终端设备和附近边缘服务器之间执行位置的选择成为基本的系统挑战。现有的自适应执行和计算卸载方法通常优化平均性能指标，而未能充分捕捉闭环XR工作负载中实时延迟要求与设备电池寿命之间的持续互动。在本文中，我们提出了一种针对边缘辅助XR系统的电池感知执行管理框架，该框架同时考虑了执行位置、工作负载质量、延迟要求和电池动力学。我们设计了一种基于轻量级深度强化学习策略的在线决策机制，在动态网络条件下持续适应执行决策，同时保持高运动到光子延迟合规性。实验结果表明，与延迟优化的本地执行相比，所提出的方法可将预期的设备电池寿命延长163%，在稳定网络条件下保持超过90%的运动到光子延迟合规性。即使在网络带宽显著受限的情况下，这种合规性也不低于80%，从而证明了在沉浸式XR系统中显式管理延迟-能量权衡的有效性。

Summary / 总结

This paper addresses the challenge of managing immersive extended reality (XR) applications with stringent latency requirements on energy-constrained devices. It proposes a battery-aware execution management framework using a deep reinforcement learning policy to optimize execution placement and workload quality. The approach ensures high motion-to-photon latency compliance while extending the device battery lifetime by up to 163% compared to local execution under stable network conditions, and maintains over 80% compliance even with limited network bandwidth.

本文针对在能量受限设备上运行具有严格延迟要求的沉浸式扩展现实(XR)应用的挑战，提出了一种电池感知的执行管理框架，利用深度强化学习策略优化边缘服务器和设备之间的执行位置。该方法确保高运动到光子延迟合规性的同时，将设备电池寿命延长了最多163%，即使在网络带宽受限的情况下，合规性也不低于80%。

Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

Authors: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, Yueming Jin

First: 2026-03-17T17:27:32+00:00 · Latest: 2026-03-17T17:27:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$Σ$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$Σ$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$Σ$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$Σ$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$Σ$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$Σ$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

中文标题/摘要

标题：Surg$Σ$: 外科智能的大型多模态数据和基础模型谱系

外科智能有潜力提高外科护理的安全性和一致性，但大多数现有的外科人工智能框架仍局限于特定任务，难以在不同手术和机构之间泛化。尽管多模态基础模型，尤其是多模态大型语言模型，在各种医学领域展示了强大的跨任务能力，但在外科领域的进展受限于缺乏大规模、系统整理的多模态数据。为了解决这一挑战，我们引入了Surg$Σ$，一个支持外科智能的大型多模态数据和基础模型谱系。该框架的核心是Surg$Σ$-DB，一个大型多模态数据基础，旨在支持多种外科任务。Surg$Σ$-DB将异构的外科数据源（包括开源数据集、内部临床收集和网络数据源）整合到统一的模式中，旨在提高异构数据集之间的标签一致性和数据标准化。Surg$Σ$-DB涵盖了6个临床专科和多种手术类型，提供了18项实用外科任务的丰富图像和视频级注释，覆盖了理解、推理、规划和生成，规模空前（超过598万次对话）。除了常规的多模态对话，Surg$Σ$-DB还包含层次推理注释，提供了更丰富的语义线索，以支持复杂外科场景中的深层次上下文理解。我们还通过基于Surg$Σ$-DB开发的新型外科基础模型提供了实证证据，展示了大规模多模态注释、统一语义设计和结构化推理注释对提高跨任务泛化能力和可解释性的实际益处。

Summary / 总结

Surg$Σ$ addresses the challenge of task-specific surgical AI frameworks by introducing a large-scale multimodal data foundation, Surg$Σ$-DB, which consolidates heterogeneous surgical data sources into a unified schema. This framework supports diverse surgical tasks across 6 clinical specialties and 18 practical surgical tasks, providing rich annotations. Empirical evidence from recently developed surgical foundation models built on Surg$Σ$-DB demonstrates improved cross-task generalization and interpretability due to large-scale multimodal annotations and structured reasoning annotations.

Surg$Σ$通过引入大规模多模态数据基础 Surg$Σ$-DB，整合了多种手术数据源，形成了统一的结构。该框架支持6个临床专科的18项实际手术任务，并提供了超过598万次对话，其中包含层次推理注释。实验证据表明，基于 Surg$Σ$-DB 的基础模型能够提高跨任务泛化能力和可解释性。

Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

Authors: Jihoon Jeong

First: 2026-03-05T01:49:29+00:00 · Latest: 2026-03-17T17:25:58+00:00

Comments: 56 pages, 7 figures. Project page: https://jihoonjeong.github.io/model-medicine/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions -- Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core--Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis -- a biologically-inspired three-layer parameter architecture -- and a therapeutic framework connecting diagnosis to treatment.

中文标题/摘要

标题：模型医学：理解、诊断和治疗AI模型的临床框架

模型医学是理解、诊断、治疗和预防AI模型紊乱的科学，基于AI模型——就像生物有机体一样——具有内部结构、动态过程、可遗传特征、可观察症状、可分类状况和可治疗状态的原则。本文介绍了模型医学作为一项研究计划，填补了当前AI可解释性研究（解剖观察）与复杂AI系统日益需要的系统临床实践之间的空白。我们提出了五项贡献：（1）一个学科分类，组织了四个部门下的15个子学科——基础模型科学、临床模型科学、模型公共卫生和模型架构医学；（2）四层壳模型（v3.3），一个行为遗传框架，基于Agora-12计划中的720个代理和24,923个决策经验地建立，解释了模型行为如何从核心-壳相互作用中产生；（3）神经MRI（模型共振成像），一个工作中的开源诊断工具，将五种医学神经影像学模态映射到AI可解释性技术，通过四个临床案例验证了成像、比较、定位和预测能力；（4）一个五层诊断框架，用于全面的模型评估；（5）临床模型科学，包括模型气质指数进行行为特征分析、模型体征描述和M-CARE标准化病例报告。我们还提出了分层核心假设——一个生物启发的三层参数架构——以及将诊断与治疗连接起来的治疗框架。

Summary / 总结

Model Medicine is a research program that aims to understand, diagnose, and treat AI models by comparing them to biological organisms. The paper introduces five contributions: a discipline taxonomy, the Four Shell Model, Neural MRI as a diagnostic tool, a five-layer diagnostic framework, and clinical model sciences. Key findings include the validation of Neural MRI through four clinical cases and the introduction of the Layered Core Hypothesis for AI model parameter architecture.

Model Medicine旨在通过将AI模型与生物体类比来理解、诊断和治疗它们。研究引入了五个子学科的分类体系和一个行为遗传框架，称为四壳模型。还提出了Neural MRI诊断工具和一个五层诊断框架。主要发现包括通过临床案例验证Neural MRI，并引入了用于行为和症状分析的Model Temperament Index和Model Semiology。

Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning

Authors: Yifan Wang, Yanyu Li, Gordon Guocheng Qian, Sergey Tulyakov, Yun Fu, Anil Kag

First: 2026-01-07T18:05:08+00:00 · Latest: 2026-03-17T17:21:55+00:00

Comments: Webpage: https://snap-research.github.io/diffusion-drf/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this rich feedback, Diffusion-DRF achieves stable reward-based tuning without preference datasets collection. Diffusion-DRF achieves significant gains both quantitatively and qualitatively, outperforming state-of-the-art Flow-GRPO by 4.74% in overall performance on unseen VBench-2.0.

中文标题/摘要

标题：Diffusion-DRF：免费、丰富且可微的奖励框架用于视频扩散微调

视频扩散对齐主要依赖于标量奖励。这些奖励通常来自人类偏好数据集中的学习奖励模型，需要额外的训练和广泛的收集。此外，标量奖励提供粗略的全局监督，对提示生成不匹配的信用分配有限，使模型容易受到奖励利用和不稳定优化的影响。我们提出了Diffusion-DRF，一种用于视频扩散微调的免费、丰富且可微的奖励框架。Diffusion-DRF 使用一个冻结的现成视觉-语言模型（VLM）作为批评者，消除了奖励模型训练的需要。它不依赖于单一的标量奖励，而是将每个用户提示分解为多维问题，并使用自由形式的密集VQA解释查询，提供信息丰富的反馈。通过直接对这种丰富反馈进行可微优化，Diffusion-DRF 实现了稳定的基于奖励的微调，无需收集偏好数据集。Diffusion-DRF 在定量和定性方面均取得了显著的改进，在未见过的VBench-2.0 上的整体性能上优于最先进的Flow-GRPO 4.74%。

Summary / 总结

Diffusion-DRF is a reward framework for video diffusion fine-tuning that uses a frozen Vision-Language Model (VLM) as a critic, avoiding the need for additional training. It decomposes user prompts into multi-dimensional questions with dense VQA explanations, providing rich feedback for direct differentiable optimization. This approach leads to stable tuning without the need for human preference datasets, and it outperforms state-of-the-art methods by 4.74% on the VBench-2.0 dataset.

该论文提出了一种名为Diffusion-DRF的新颖视频扩散调优奖励框架，该框架使用预训练的视觉-语言模型作为批评者，避免了额外训练和偏好数据集收集的需要。该框架将用户提示分解为多维问题，并使用密集的VQA解释提供丰富的反馈，从而实现稳定的高效调优。实验表明，Diffusion-DRF在VBench-2.0数据集上的整体性能比Flow-GRPO高出4.74%。

Exploring Collatz Dynamics with Human-LLM Collaboration

Authors: Edward Y. Chang

First: 2026-03-10T02:07:00+00:00 · Latest: 2026-03-17T17:21:23+00:00

Comments: 127 pages, 11 figures, 13 tables

Abs · PDF · Code1 · Code2

Abstract

We develop a quantitative framework for the Collatz conjecture through a human-LLM collaboration, combining exact arithmetic structure, cycle-level probabilistic laws, and a conditional convergence reduction. The central quantitative result is the Per-Orbit Gain Rate theorem, which proves R <= 0.0893 < epsilon = 2 - log_2 3 ~= 0.415, leaving a safety margin of at least 4.65x. A robustness corollary shows that exact equidistribution is unnecessary: it suffices that sum_K delta_K < 0.557. This promotes the Weak Mixing Hypothesis (WMH) to the primary open condition. On the arithmetic side, we refine modular crossing methods and prove that by depth 13 about 91 percent of odd residue classes are already forced to descend below their start. On the odd skeleton, we prove the exact run-length identity L(n) = v_2(n+1) - 1, derive an exact one-cycle crossing criterion, and compute the exact one-cycle crossing density P_1cyc = 0.713725498.... A major breakthrough is that the odd-skeleton valuation process satisfies an exact finite-block law: every prescribed valuation block occurs on a single odd residue class with the expected density. Hence the valuation process is exactly i.i.d. geometric in the natural-density ensemble, and the induced run-compensate cycle types are exactly i.i.d. This yields an exact cycle-level large-deviation theory and an unconditional almost-all crossing theorem in cycle language. We also prove substantial classwise deterministic crossing: about 41.9 percent of odd starts lie in one-cycle residue classes where every representative crosses below its start, and about 50.4 percent lie in two-cycle residue classes with the same universal crossing property. The framework does not yet prove Collatz. The remaining gap is now sharply isolated as a pointwise problem: proving that every deterministic orbit realizes enough of the exact negative cycle drift to cross below its start.

中文标题/摘要

标题：人类-LLM协作探索Collatz动力学

我们通过人类-LLM协作开发了一个定量框架来研究Collatz猜想，结合了精确的算术结构、循环级概率定律和条件收敛简化。核心定量结果是每次轨道收益率定理，证明了R <= 0.0893 < ε = 2 - log₂3 ≈ 0.415，留有至少4.65倍的安全余量。一个稳健性推论表明，精确的等分布不是必需的：只要满足∑K δ_K < 0.557即可。这将弱混合假设（WMH）提升为主要的开放条件。在算术方面，我们细化了模交叉方法，并证明了在深度13时约91%的奇数残差类已经被迫下降到其起始值以下。在奇数骨架上，我们证明了精确的运行长度恒等式L(n) = v₂(n+1) - 1，推导出精确的一循环交叉准则，并计算出一循环交叉密度P₁cyc = 0.713725498……一个重大突破是奇数骨架估值过程满足精确的有限块定律：每个指定的估值块在单一奇数残差类中以预期密度出现。因此，估值过程在自然密度集合中是精确的独立同几何分布的，诱导的运行补偿循环类型也是精确的独立同分布。这产生了精确的循环级大偏差理论和循环语言中的几乎全部交叉定理。我们还证明了显著的分类确定性交叉：约41.9%的奇数起始值位于每个代表都交叉到其起始值以下的一循环残差类中，约50.4%的起始值位于具有相同普遍交叉性质的两循环残差类中。该框架尚未证明Collatz猜想。剩余的缺口现在被明确地隔离为一个点问题：证明每个确定性轨道实现了足够的精确负循环漂移以交叉到其起始值以下。

Summary / 总结

The research aims to develop a quantitative framework for the Collatz conjecture through human-LLM collaboration, focusing on exact arithmetic structure and probabilistic laws. The key method involves proving the Per-Orbit Gain Rate theorem, which shows R <= 0.0893 < epsilon, and refining modular crossing methods to demonstrate that about 91 percent of odd residue classes descend below their start by depth 13. The study also proves the exact run-length identity and exact one-cycle crossing density, and shows that the odd-skeleton valuation process is exactly i.i.d. geometric, leading to an exact cycle-level large-deviation theory. Despite these advancements, the framework does not yet prove the Collatz conjecture, leaving a pointwise problem to be resolved.

研究旨在通过人类-LLM合作开发Collatz猜想的定量框架，重点关注精确的算术结构、循环级概率定律和条件收敛减少。关键结果包括Per-Orbit Gain Rate定理，证明了R <= 0.0893 < epsilon = 2 - log_2 3，以及精确的运行长度公式L(n) = v_2(n+1) - 1，表明大约91％的奇数残差类在深度13时会下降到其起始值以下。该框架还证明了奇骨架上的估值过程是精确的独立同几何分布的，从而获得精确的循环级大偏差理论。

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Authors: Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, Ramya Korlakai Vinayak

First: 2026-03-17T17:20:08+00:00 · Latest: 2026-03-17T17:20:08+00:00

Comments: 56 pages

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

中文标题/摘要

标题：基于RAG的LLM的符合事实性稳健吗？新颖的度量标准与系统性见解

大型语言模型（LLMs）经常产生幻觉，限制了它们在知识密集型应用中的可靠性。检索增强生成（RAG）和符合事实性已经作为潜在的方法来解决这一限制。虽然RAG旨在使响应基于检索到的证据，但它不能提供统计保证最终输出是正确的。符合事实性过滤通过在保留数据上校准阈值来对原子声明进行评分和过滤，从而提供无分布统计可靠性，然而，最终输出的信息量没有得到保证。我们系统地分析了符合事实性在生成、评分、校准、稳健性和效率方面的可靠性和实用性。我们提出了新颖的信息量感知度量标准，更好地反映了在符合性过滤下的任务实用性。在三个基准和多个模型家族中，我们发现：(i) 在高符合性水平下，符合性过滤由于空洞的输出而具有低实用性；(ii) 符合性事实性保证对分布偏移和干扰不稳健，突显了需要校准数据与部署条件紧密匹配的局限性；(iii) 轻量级的蕴含验证器在模型置信度评分器上表现出色或优于其性能，同时需要超过100倍更少的FLOPs。总体而言，我们的结果揭示了符合性与信息量之间的权衡以及在分布偏移和干扰下的符合性过滤框架的脆弱性，强调了需要新的方法以可靠性、稳健性和实用性作为关键指标，并为构建既可靠又计算高效的RAG管道提供了可操作的指导。

Summary / 总结

This study evaluates the reliability and usefulness of conformal factuality for RAG-based LLMs by proposing new metrics and conducting systematic analysis across various aspects. Key findings include that conformal filtering often produces vacuous outputs at high factuality levels, lacks robustness to distribution shifts and distractors, and that lightweight entailment-based verifiers outperform LLM-based scorers in terms of efficiency while maintaining comparable performance.

研究评估了RAG基大语言模型中形式化事实的可靠性和实用性，并提出了新的指标来更好地反映任务的实用性。研究发现，形式化过滤在高事实性水平时经常产生空洞的输出，对分布变化不稳健，且轻量级的蕴含验证器在效率上优于基于大模型的评分器，同时保持相当的性能。

WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

Authors: Muhammad Aamir, Naoya Muramatsu, Sangyun Shin, Matthew Wijers, Jiaxing Jhong, Xinyu Hou, Amir Patel, Andrew Markham

First: 2026-03-17T17:19:43+00:00 · Latest: 2026-03-17T17:19:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.

中文标题/摘要

标题：WildDepth：一种用于野生动物感知和深度估计的多模态数据集

深度估计和三维重建一直是计算机视觉中的核心研究课题。从具有相对简单几何形状的刚性物体，如车辆，研究已经扩展到处理一般物体，包括具有挑战性的可变形物体，如人类和动物。然而，对于动物而言，现有的大多数模型都是基于没有度量尺度的数据集进行训练的，这有助于验证仅基于图像的模型。为了解决这一局限性，我们提出了WildDepth，这是一个多模态数据集和基准套件，用于从不同种类的动物中进行深度估计、行为检测和三维重建，这些动物包括从家养到野生环境，具有同步的RGB和LiDAR数据。实验结果表明，使用多模态数据可以提高深度可靠性最多10%的RMSE，而RGB-LiDAR融合可以提高12%的Chamfer距离的三维重建精度。通过发布WildDepth及其基准测试，我们旨在促进跨领域的稳健多模态感知系统。

Summary / 总结

The research aims to improve depth estimation and 3D reconstruction for animals by addressing the limitations of existing models trained without metric scale. WildDepth, a multimodal dataset, includes synchronized RGB and LiDAR data from various animal categories. The study finds that using multimodal data reduces RMSE by up to 10% and enhances 3D reconstruction fidelity by 12% in Chamfer distance through RGB-LiDAR fusion. The dataset and benchmarks are intended to promote robust multimodal perception systems.

研究旨在通过解决现有模型缺乏度量尺度的问题，提高动物的深度估计和3D重建。WildDepth是一个多模态数据集，包含来自各种动物类别的同步RGB和LiDAR数据。研究发现，使用多模态数据可将RMSE降低最多10%，并通过RGB-LiDAR融合提高3D重建精度12%。

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Authors: Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Urmish Thakker, Changran Hu, Qizheng Zhang

First: 2026-03-06T02:25:02+00:00 · Latest: 2026-03-17T17:10:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.

中文标题/摘要

标题：测试时适应性通过多示例提示：优势、局限与风险

测试时适应性使大型语言模型（LLMs）在推理时能够修改其行为而不更新模型参数。一种常见方法是多示例提示，其中大量上下文学习（ICL）示例被注入作为输入空间的测试时更新。尽管随着示例数量的增加性能可以提高，但这种更新机制的可靠性和局限性仍然知之甚少，尤其是对于开源模型。我们对多示例提示在不同任务和模型架构上的进行了实证研究，分析了性能如何随更新幅度、示例排序和选择策略的变化而变化。我们还研究了动态和强化ICL作为替代的测试时更新策略，这些策略控制了注入哪些信息以及如何约束模型行为。我们发现，多示例提示对于结构化任务有效，其中示例提供了高信息增益，但对于开放生成任务则高度敏感且通常显示出有限的好处。总体而言，我们界定了基于提示的测试时适应性的实际局限，并指出了输入空间更新是有益还是有害的情况。

Summary / 总结

The study investigates the effectiveness of many-shot prompting for test-time adaptation in large language models, focusing on its performance across different tasks and model architectures. It finds that many-shot prompting is beneficial for structured tasks where demonstrations provide significant information gain but is less effective for open-ended generation tasks. The research also highlights the importance of selection strategy and shows that dynamic and reinforced in-context learning can offer more controlled updates. Overall, the study characterizes the practical limits of prompt-based test-time adaptation and provides insights into when input-space updates are useful or detrimental.

研究探讨了大规模语言模型在不同任务和模型架构下的many-shot提示在测试时适应的有效性。研究发现，many-shot提示对于提供显著信息增益的结构化任务是有益的，但对于开放生成任务则效果有限。研究还强调了选择策略的重要性，并表明动态和强化的上下文学习可以提供更可控的更新。总体而言，研究界定了基于提示的测试时适应的实际限制，并提供了关于何时输入空间更新是有益还是有害的见解。

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Authors: Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Yue Zhang, Zichen Liu, Jiaheng Zhang, Michael Qizhe Shieh

First: 2026-03-16T15:37:07+00:00 · Latest: 2026-03-17T17:07:16+00:00

Abs · PDF · Code1 · Code2

Abstract

As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.

中文标题/摘要

标题：Gym-V：一种统一的视觉环境系统，用于自主视觉研究

随着自主系统越来越多地依赖于可验证奖励的强化学习，标准化的“gym”基础设施已成为快速迭代、可重复性和公平比较的必要条件。视觉代理缺乏这样的基础设施，限制了对其学习驱动因素的系统研究以及当前模型的不足之处。我们引入了**Gym-V**，这是一种包含10个领域179个可程序生成的视觉环境的统一平台，具有可控难度，使以前在分散工具包中不可行的受控实验成为可能。使用它，我们发现观察支架比选择RL算法对训练成功更为关键，而描述和游戏规则决定了学习是否成功。跨领域迁移实验进一步表明，多样任务类别训练具有广泛的泛化能力，而狭窄训练可能导致负迁移，多轮交互进一步放大了这些效果。Gym-V 作为训练环境和评估工具包的便捷基础发布，旨在加速对自主VLMs 的未来研究。

Summary / 总结

The research aims to address the lack of standardized infrastructure for vision agents, which hinders systematic study of their learning processes. The authors developed Gym-V, a unified platform with 179 procedurally generated visual environments across 10 domains, to enable controlled experiments. Key findings include the importance of observation scaffolding over the choice of RL algorithm, with captions and game rules significantly impacting learning success. Cross-domain transfer experiments revealed that diverse training generalizes well, while narrow training can lead to negative transfer, with multi-turn interaction amplifying these effects.

论文介绍了Gym-V，这是一个包含10个领域共179个程序生成的视觉环境的统一平台，旨在促进对视觉代理的系统研究。研究发现，观察辅助比选择强化学习算法对训练成功更为关键，而描述和游戏规则显著影响学习结果。跨领域的迁移实验表明，多样化的训练能够很好地泛化，而狭窄的训练可能导致负迁移，多轮交互会放大这些效果。

RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation

Authors: Yixuan Huang, Jiawei Chen, Shengfan Zhang, Zongsheng Cao

Venue: WWW 2026

First: 2026-03-17T17:05:23+00:00 · Latest: 2026-03-17T17:05:23+00:00

Comments: 12 pages, 5 figures. Accepted at WWW 2026

Abs · PDF · Code1 · Code2

Abstract

Collaborative filtering (CF) recommendation has been significantly advanced by integrating Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). However, (i) random edge perturbations often distort critical structural signals and degrade semantic consistency across augmented views, and (ii) data sparsity hampers the propagation of collaborative signals, limiting generalization. To tackle these challenges, we propose RaDAR (Relation-aware Diffusion-Asymmetric Graph Contrastive Learning Framework for Recommendation Systems), a novel framework that combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges. RaDAR introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling to maintain semantic alignment while suppressing noise; (2) diffusion-guided augmentation, which employs progressive noise injection and denoising for enhanced robustness; and (3) relation-aware edge refinement, dynamically adjusting edge weights based on latent node semantics. Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions.

中文标题/摘要

标题：RaDAR：基于关系的扩散不对称图对比学习推荐

通过结合图神经网络（GNN）和图对比学习（GCL），协作过滤（CF）推荐得到了显著提升。然而，（i）随机边扰动往往扭曲了关键的结构信号，降低了不同增强视图之间的语义一致性；（ii）数据稀疏性限制了协作信号的传播，影响了泛化能力。为了解决这些挑战，我们提出了RaDAR（推荐系统中基于关系的扩散不对称图对比学习框架），这是一种结合了两种互补的视图生成机制的新型框架：图生成模型用于捕捉全局结构，关系感知去噪模型用于细化噪声边。 RaDAR引入了三项关键创新：（1）具有全局负采样的不对称对比学习，以保持语义对齐并抑制噪声；（2）扩散引导增强，通过渐进式噪声注入和去噪增强鲁棒性；（3）关系感知边细化，根据潜在节点语义动态调整边权重。在三个公开基准上的广泛实验表明，RaDAR在各种条件下均优于现有方法。

Summary / 总结

RaDAR is a novel recommendation framework that addresses the challenges of random edge perturbations and data sparsity by integrating a graph generative model and a relation-aware denoising model. It introduces asymmetric contrastive learning, diffusion-guided augmentation, and relation-aware edge refinement. Experimental results show that RaDAR outperforms existing methods, especially in noisy and sparse conditions.

RaDAR 是一种新颖的推荐框架，通过结合图生成模型和关系感知去噪模型来解决随机边扰动和数据稀疏性的问题。它引入了不对称对比学习、扩散引导增强和关系感知边精炼，以保持语义对齐、增强鲁棒性和提高泛化能力。实验表明，RaDAR 在嘈杂和稀疏条件下优于现有方法。

Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

Authors: Christian Belardi, Justin Lovelace, Kilian Q. Weinberger, Carla P. Gomes

First: 2026-03-17T17:04:07+00:00 · Latest: 2026-03-17T17:04:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.

中文标题/摘要

标题：自适应矩在即插即用扩散采样中的效果令人惊讶

引导扩散采样依赖于近似难以计算的似然分数，这会显著增加采样动态中的噪声。我们提出使用自适应矩估计来稳定采样过程中的这些噪声似然分数。尽管方法简单，但我们的方法在图像恢复和条件生成任务上达到了最先进的效果，优于更复杂的方法，这些方法通常计算成本更高。我们在合成和真实数据上提供了我们的方法的实证分析，证明通过自适应矩减轻梯度噪声是提高对齐效果的有效方式。

Summary / 总结

The paper addresses the issue of noise in guided diffusion sampling by proposing the use of adaptive moment estimation to stabilize likelihood scores. This simple method outperforms more complex approaches on image restoration and class-conditional generation tasks, showing that reducing gradient noise through adaptive moments can effectively enhance alignment and performance.

论文解决了引导扩散采样中难以计算的似然分数带来的噪声问题，提出使用自适应矩估计来稳定这些分数。这种方法简单有效，在图像恢复和条件生成任务中超越了更复杂的替代方案，达到了最先进的性能。在合成和真实数据上的实证分析表明，通过自适应矩减少梯度噪声可以提高对齐和性能。

InCoder-32B: Code Foundation Model for Industrial Scenarios

Authors: Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv

First: 2026-03-17T17:01:35+00:00 · Latest: 2026-03-17T17:01:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.

中文标题/摘要

标题：InCoder-32B: 工业场景代码基础模型

近期的代码大型语言模型在通用编程任务上取得了显著进展。然而，在需要考虑硬件语义、专业语言结构和严格资源限制的工业场景中，其性能显著下降。为应对这些挑战，我们引入了InCoder-32B（Industrial-Coder-32B），这是首个统一芯片设计、GPU内核优化、嵌入式系统、编译器优化和3D建模领域代码智能的32B参数代码基础模型。通过采用高效的架构，我们从通用代码预训练开始，经过精心挑选的工业代码退火、逐步扩展上下文至128K标记的中期训练，并使用合成工业推理数据，最后通过执行验证进行后训练。我们在14个主流通用代码基准和9个涵盖4个专门领域的工业基准上进行了广泛评估。结果表明，InCoder-32B在通用任务上表现出色，同时在工业领域建立了强大的开源基准。

Summary / 总结

InCoder-32B is a 32B-parameter code foundation model designed to handle industrial scenarios with hardware reasoning and strict resource constraints. It is trained through a multi-stage process including general code pre-training, industrial code annealing, and synthetic industrial reasoning data. InCoder-32B performs competitively on general code benchmarks and sets strong baselines in industrial domains such as chip design, GPU kernel optimization, embedded systems, and compiler optimization.

InCoder-32B 是一个 32B 参数的代码基础模型，旨在处理需要硬件推理和严格资源约束的工业场景。它通过多阶段训练过程进行训练，包括通用代码预训练、工业代码优化以及合成的工业推理数据。InCoder-32B 在通用代码基准测试中表现出色，并在芯片设计、GPU 内核优化、嵌入式系统和编译器优化等工业领域建立了强大的开源基准。

Conservative Continuous-Time Treatment Optimization

Authors: Nora Schneider, Georg Manten, Niki Kilbertus

First: 2026-03-17T17:01:23+00:00 · Latest: 2026-03-17T17:01:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We develop a conservative continuous-time stochastic control framework for treatment optimization from irregularly sampled patient trajectories. The unknown patient dynamics are modeled as a controlled stochastic differential equation with treatment as a continuous-time control. Naive model-based optimization can exploit model errors and propose out-of-support controls, so optimizing the estimated dynamics may not optimize the true dynamics. To limit extrapolation, we add a consistent signature-based MMD regularizer on path space that penalizes treatment plans whose induced trajectory distribution deviates from observed trajectories. The resulting objective minimizes a computable upper bound on the true cost. Experiments on benchmark datasets show improved robustness and performance compared to non-conservative baselines.

中文标题/摘要

标题：保守的连续时间治疗优化

我们开发了一种保守的连续时间随机控制框架，用于从不规则采样的患者轨迹中进行治疗优化。未知的患者动力学被建模为一个受控的随机微分方程，其中治疗作为连续时间的控制。基于模型的优化可能会利用模型误差并提出超出支持范围的控制，因此优化估计的动力学可能不会优化真实的动力学。为了限制外推，我们在路径空间上添加了一个一致的基于签名的MMD正则化项，惩罚那些诱导轨迹分布与观察到的轨迹相偏离的治疗计划。结果的目标函数最小化真实成本的可计算上界。基准数据集上的实验表明，与非保守的基线相比，该方法具有更好的稳健性和性能。

Summary / 总结

The research aims to optimize treatment for patients with irregularly sampled trajectories by developing a conservative continuous-time stochastic control framework. The method models patient dynamics using a controlled stochastic differential equation with treatment as a continuous-time control and includes a regularizer to prevent extrapolation. Experiments demonstrate that this approach outperforms non-conservative methods in terms of robustness and performance on benchmark datasets.

研究旨在通过开发保守的连续时间随机控制框架来优化具有不规则采样轨迹的患者的治疗。方法使用受控随机微分方程来建模患者动力学，其中治疗作为连续时间控制，并包含一个正则化项，该项惩罚与观察到的轨迹偏差的治疗计划。实验表明，与非保守方法相比，该方法在基准数据集上提高了鲁棒性和性能。

CFM: Language-aligned Concept Foundation Model for Vision

Authors: Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer

First: 2026-01-20T09:57:26+00:00 · Latest: 2026-03-17T16:58:24+00:00

Comments: 53 pages, 29 figures, 4 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making difficult. Recent work decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language-aligned concept foundation model for vision that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co-occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code at https://github.com/kawi19/CFM.

中文标题/摘要

标题：CFM：面向视觉的语言对齐概念基础模型

面向视觉的语言对齐基础模型在多种下游任务中表现出色。然而，它们学习到的表示仍然不透明，使得解释其决策过程变得困难。近期工作将这些表示分解为可由人类理解的概念，但这些概念缺乏空间定位，并且仅限于图像分类任务。在本文中，我们提出了一种面向视觉的语言对齐概念基础模型CFM，该模型提供了细粒度的概念，这些概念是可由人类理解且在输入图像中具有空间定位的。当与具有强大语义表示的基础模型配对时，我们能够为任何下游任务提供解释。通过检查概念的局部共现依赖关系，我们能够定义概念关系，从而改进概念命名并获得更丰富的解释。在基准数据集上，我们展示了CFM在分类、分割和描述方面的性能与不透明的基础模型相当，同时提供了细粒度、高质量的概念基础解释。代码见https://github.com/kawi19/CFM。

Summary / 总结

The research aims to improve the interpretability of language-aligned vision foundation models by proposing CFM, which decomposes representations into fine-grained, spatially grounded concepts. CFM enhances the explanations for various downstream tasks by defining concept relationships through local co-occurrence dependencies. Experimental results show that CFM achieves competitive performance in classification, segmentation, and captioning while providing high-quality concept-based explanations.

研究旨在通过引入CFM，一种语言对齐的概念基础模型，提高视觉基础模型的可解释性，该模型为视觉任务提供细粒度且空间上与输入图像对齐的概念。CFM通过局部共现依赖关系定义概念关系，改进概念命名并提供更丰富的解释。实验表明，CFM在分类、分割和生成描述任务上与不透明的基础模型竞争，同时提供高质量的概念基础解释。

IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Authors: Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Yang Feng, Zuozhu Liu

First: 2026-03-17T16:57:02+00:00 · Latest: 2026-03-17T16:57:02+00:00

Abs · PDF · Code1 · Code2

Abstract

3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

中文标题/摘要

标题：IOSVLM：一种基于内窥三维扫描的统一牙科诊断视觉语言模型

三维内窥扫描（IOS）由于丰富的几何证据在常规牙科中越来越受欢迎，统一的多病种诊断对于临床记录和沟通是必要的。虽然最近的工作引入了牙科视觉语言模型（VLMs）以在2D图像或从IOS生成的多视角图像上实现统一的诊断和报告生成，但它们并未充分利用原生的3D几何结构。由于：（i）扫描形式的异质性和复杂的IOS拓扑结构，（ii）多种疾病共存伴随类别不平衡和细微形态的模糊性，（iii）3D IOS文本配对数据有限，因此有必要且具有挑战性。我们提出了IOSVLM，这是一种端到端的3D VLM，将扫描表示为点云，并采用3D编码器-投影器-LLM设计，用于统一诊断和生成视觉问答（VQA）。同时，我们还构建了IOSVQA，这是一个包含19,002个病例和249,055个VQA配对的大规模多源IOS诊断VQA数据集，覆盖23种口腔疾病和多种扫描类型。为解决无色IOS数据与依赖颜色的3D预训练之间的分布差距，我们提出了一种几何到色彩的代理，以稳定细微的几何感知和跨模态对齐。两阶段的课程训练策略进一步增强了鲁棒性。IOSVLM在所有基线模型上均表现出色，宏观准确率提高了至少9.58%，宏观F1提高了1.46%，表明直接3D几何建模对于基于IOS的诊断的有效性。

Summary / 总结

The research aims to develop a 3D vision-language model (IOSVLM) for unified dental diagnosis using 3D intraoral scans (IOS), addressing challenges such as heterogeneous scan forms, multi-disease co-occurrence, and limited paired data. The model uses point clouds and a 3D encoder-projector-LLM design, and includes a large-scale dataset (IOSVQA) with 19,002 cases and 249,055 VQA pairs. The model outperforms strong baselines, demonstrating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis with gains of at least +9.58% macro accuracy and +1.46% macro F1.

研究旨在开发一种3D视觉语言模型（IOSVLM），用于基于3D口腔扫描（IOS）的统一牙科诊断，解决异构扫描形式、多病种共存和数据有限等挑战。该模型采用点云表示和3D编码器-投影器-大型语言模型（LLM）设计，并使用大规模数据集（IOSVQA）进行训练。引入了几何到色彩的代理和两阶段课程训练策略以提高性能。实验结果表明，IOSVLM在强基线模型上表现出色，显著提高了准确率和F1分数。

Anticipatory Planning for Multimodal AI Agents

Authors: Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan, Gang Wu, Franck Dernoncourt, Jihyung Kil, Ryan A. Rossi, Ruiyi Zhang

Venue: CVPR 2026

First: 2026-03-17T16:55:11+00:00 · Latest: 2026-03-17T16:55:11+00:00

Comments: Published at CVPR 2026 Findings Track

Abs · PDF · Code1 · Code2

Abstract

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

中文标题/摘要

标题：多模态AI代理的前瞻性规划

近期多模态代理的进步提高了计算机使用交互和工具使用的效果，但大多数现有系统仍处于反应性状态，仅在孤立优化行动而不考虑未来状态或长期目标。这限制了规划的一致性，阻止代理可靠地解决高阶、多步骤任务。我们引入了TraceR1，这是一种两阶段强化学习框架，通过在执行前预测短期轨迹来明确训练前瞻性推理。第一阶段执行轨迹级强化学习，奖励机制确保预测动作序列的全局一致性。第二阶段应用基于执行反馈的强化微调，使用冻结工具代理的执行反馈来细化步骤级的准确性和可执行性。TraceR1在七个基准测试中进行了评估，涵盖了在线计算机使用、离线计算机使用基准和多模态工具使用推理任务，结果显示其在规划稳定性、执行稳健性和泛化能力方面显著优于反应性和单阶段基线。这些结果表明，前瞻性轨迹推理是构建能够在复杂现实环境中有效推理、规划和行动的多模态代理的关键原则。

LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning

Authors: Marco Paul E. Apolinario, Kaushik Roy

First: 2025-09-25T21:33:40+00:00 · Latest: 2026-03-17T16:51:34+00:00

Comments: 26 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but rely on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decomposition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it performs competitively with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.

中文标题/摘要

标题：LANCE：低秩激活压缩以实现高效的设备端持续学习

设备端学习对于资源受限环境中的个性化、隐私保护和长期适应至关重要。实现这一点需要高效的训练，既包括对现有模型的微调，也包括不断获取新任务而不发生灾难性遗忘。然而，这两种设置都受到反向传播过程中存储激活数据的高内存成本的限制。现有的激活压缩方法可以降低这种成本，但它们依赖于重复的低秩分解，引入了计算开销。此外，这些方法尚未被探索用于持续学习。我们提出了LANCE（低秩激活压缩），这是一种框架，通过一次性的高阶奇异值分解（SVD）获得可重用的低秩子空间，用于激活投影。这消除了重复的分解，减少了内存和计算量。此外，固定的低秩子空间还使设备端的持续学习成为可能，通过将任务分配到正交子空间而不存储特定于任务的大矩阵来分配任务。实验表明，LANCE在CIFAR-10/100、Oxford-IIIT Pets、Flowers102和CUB-200数据集上将激活存储减少至最多250倍，同时保持与完整反向传播相当的准确性。在持续学习基准测试（Split CIFAR-100、Split MiniImageNet、5-数据集）中，它在内存成本的一小部分下与正交梯度投影方法竞争。这些结果将LANCE定位为在边缘设备上实现高效微调和持续学习的实用且可扩展的解决方案。

Summary / 总结

LANCE is a framework for efficient on-device continual learning that uses one-shot higher-order SVD to obtain a reusable low-rank subspace for activation projection, reducing both memory and computation. It achieves up to 250 times less activation storage while maintaining comparable accuracy to full backpropagation on various datasets. On continual learning benchmarks, LANCE performs competitively with orthogonal gradient projection methods at a much lower memory cost.

LANCE 是一种用于高效边缘设备连续学习的框架，通过一次性的高阶奇异值分解获得可重复使用的低秩子空间进行激活投影，减少内存和计算。实验表明，LANCE 可将激活存储减少多达 250 倍，同时在各种数据集上保持与全反向传播相当的准确性，并在连续学习基准测试中以显著较低的内存成本与正交梯度投影方法竞争。

LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

Authors: Hongyi Pan, Gorkem Durak, Halil Ertugrul Aktas, Andrea M. Bejar, Baver Tutun, Emre Uysal, Ezgi Bulbul, Mehmet Fatih Dogan, Berrin Erok, Berna Akkus Yildirim, Sukru Mehmet Erturk, Ulas Bagci

Venue: CVPR 2026

First: 2026-03-15T22:41:40+00:00 · Latest: 2026-03-17T16:50:59+00:00

Comments: This paper was accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method (''energy harmonization'') that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.

中文标题/摘要

标题：LUMINA：一种带有能量谐波协议的多供应商乳腺X线摄影基准

公开的全视野数字乳腺X线摄影（FFDM）数据集在大小、临床注释和供应商多样性方面仍然有限，阻碍了稳健模型的发展。我们介绍了LUMINA，这是一个经过精心策划、多供应商的FFDM数据集，明确编码了采集能量和供应商元数据，以捕捉现有基准中经常被忽视的临床相关外观变化。该数据集包含来自468名患者的1824张图像（960例良性，864例恶性），附有病理确认标签、BI-RADS评估和乳腺密度注释。LUMINA涵盖了六种采集系统，包括高能和低能成像风格，使系统分析供应商和能量引起的领域变化成为可能。为了解决这些变化，我们提出了一种仅前景像素空间对齐方法（“能量谐波”），将图像映射到低能参考，同时保留病灶形态。我们在三个临床相关任务上对CNN和变压器模型进行了基准测试：诊断（良性 vs. 恶性）、BI-RADS分类和密度估计。双视角模型始终优于单视角模型。EfficientNet-B0在诊断任务上达到了93.54%的AUC，而Swin-T在密度预测任务上获得了最佳的宏AUC为89.43%。谐波改善了各种架构的性能，并产生了更局部化的Grad-CAM响应。总体而言，LUMINA提供了（1）供应商多样化的基准和（2）一种模型无关的谐波框架，以实现可靠的乳腺X线摄影AI部署。

Summary / 总结

LUMINA is a multi-vendor FFDM dataset that includes 1824 images from 468 patients with detailed clinical annotations. It addresses the limitations of existing benchmarks by explicitly encoding acquisition energy and vendor metadata. The authors propose an energy harmonization method to align images to a low-energy reference while preserving lesion morphology. They benchmark CNN and transformer models on diagnosis, BI-RADS classification, and density estimation tasks, finding that two-view models outperform single-view models and that harmonization improves performance across architectures. EfficientNet-B0 and Swin-T achieve high AUC scores for diagnosis and density prediction, respectively.

LUMINA是一个包含468名患者1824张图像的多厂商FFDM数据集，附有详细的临床注释。它解决了现有基准中缺乏厂商多样性和能量变化的问题。作者提出了一种能量对齐方法，将图像调整到低能量参考状态，同时保留病灶形态。他们在诊断、BI-RADS分类和密度估计任务上对CNN和变压器模型进行了基准测试，发现双视角模型优于单视角模型，并且对齐提高了各种架构的性能。EfficientNet-B0和Swin-T分别在诊断和密度预测任务上取得了高AUC分数。

Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks

Authors: Enis Baty, Alejandro Hernández Díaz, Rebecca Davidson, Chris Bridges, Simon Hadfield

First: 2024-12-20T18:50:36+00:00 · Latest: 2026-03-17T16:43:00+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

State-Space Models (SSMs) have emerged as an efficient alternative to transformers, yet existing visual SSMs retain deeply ingrained biases from their origins in natural language processing. In this paper, we address these limitations by introducing M2D-SSM, a ground-up re-derivation of selective state-space techniques for multidimensional data. Unlike prior works that apply 1D SSMs directly to images through arbitrary rasterised scanning, our M2D-SSM employs a single 2D scan that factors in both spatial dimensions natively. On ImageNet-1K classification, M2D-T achieves 84.0% top-1 accuracy with only 27M parameters, surpassing all prior SSM-based vision models at that size. M2D-S further achieves 85.3%, establishing state-of-the-art results among SSM-based architectures. Across downstream tasks, Mamba2D achieves 52.2 box AP on MS-COCO object detection (3$\times$ schedule) and 51.7 mIoU on ADE20K segmentation, demonstrating strong generalisation and efficiency at scale. Source code is available at https://github.com/cocoalex00/Mamba2D.

中文标题/摘要

标题：Mamba2D：一种原生多维状态空间模型用于视觉任务

状态空间模型（SSMs）已成为变压器的高效替代方案，但现有的视觉SSMs仍然保留了其自然语言处理起源中的深层偏见。在本文中，我们通过引入M2D-SSM，从头重新推导出适用于多维数据的选择性状态空间技术来解决这些局限性。与先前直接将1D SSM应用于图像并通过任意栅格扫描的方法不同，我们的M2D-SSM采用单一的2D扫描，能够原生地考虑两个空间维度。在ImageNet-1K分类任务上，M2D-T仅使用27M参数实现了84.0%的top-1准确率，超过了所有先前基于SSM的视觉模型。M2D-S进一步实现了85.3%的准确率，成为基于SSM架构的最新成果。在下游任务中，Mamba2D在MS-COCO对象检测上实现了52.2的box AP（3倍训练周期），在ADE20K分割上实现了51.7的mIoU，展示了其在大规模下的强大泛化能力和效率。源代码可在https://github.com/cocoalex00/Mamba2D获取。

Summary / 总结

This paper introduces Mamba2D, a novel multi-dimensional state-space model designed for vision tasks, addressing limitations of existing visual SSMs. Unlike previous methods that apply 1D SSMs to images through rasterization, Mamba2D uses a single 2D scan that natively incorporates both spatial dimensions. The model achieves 84.0% top-1 accuracy on ImageNet-1K with 27M parameters, surpassing prior SSM-based vision models. It further reaches 85.3% accuracy, setting a new state-of-the-art. Across downstream tasks, Mamba2D demonstrates strong generalization and efficiency, achieving 52.2 box AP on MS-COCO object detection and 51.7 mIoU on ADE20K segmentation.

本文提出了Mamba2D，这是一种专为视觉任务设计的新型多维状态空间模型，解决了现有视觉SSM的局限性。不同于以往通过任意栅格化扫描将1D SSM应用于图像的方法，Mamba2D 使用单一的2D扫描，能够原生地处理两个空间维度。在ImageNet-1K分类任务上，Mamba2D 仅用27M参数就达到了84.0%的top-1准确率，超越了之前的SSM基线模型。Mamba2D 进一步实现了85.3%的M2D-S准确率，成为SSM基线模型中的新标杆。在下游任务中，Mamba2D 展现了强大的泛化能力和效率，在MS-COCO目标检测任务上达到了52.2的box AP，在ADE20K分割任务上达到了51.7的mIoU。

History

20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553