arXiv 论文速递

Snapshot: 20260226_0405

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

Authors: Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain

First: 2025-09-30T17:58:03+00:00 · Latest: 2026-02-24T18:58:30+00:00

Comments: 23 pages, 10 figures. Project page: https://rsa-llm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA with Gemini 3 Flash attains performance near the top of the ARC-AGI-2 public leaderboard. RSA also enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further propose a novel aggregation-aware reinforcement learning approach that yields significant performance gains by training the model to combine solutions.

中文标题/摘要

标题：递归自我聚合解锁大型语言模型的深度思考

测试时缩放方法通过增加推理时的计算量来提高大型语言模型（LLMs）的能力。计算量可以在并行中通过选择多个独立解决方案或通过自我完善来缩放。我们提出了一种名为递归自我聚合（RSA）的测试时缩放方法，该方法受到进化方法的启发，结合了并行和序列缩放的优点。RSA的每一步通过聚合子集来精炼候选推理链，从而产生改进的解决方案群体，这些解决方案群体用于下一迭代的候选池。实验证明，RSA在不同任务、模型家族和规模上随着计算预算的增加提供了显著的性能提升。值得注意的是，使用Gemini 3 Flash的RSA达到了ARC-AGI-2公共排行榜的顶级性能。RSA还使Qwen3-4B-Instruct-2507能够与更大的推理模型（包括DeepSeek-R1和o3-mini（高））竞争，并在AIME-25、HMMT-25、推理体操、LiveCodeBench-v6和SuperGPQA中超越了纯粹并行和序列缩放策略。我们还提出了一种新的聚合感知强化学习方法，通过训练模型来组合解决方案，从而获得显著的性能提升。

Summary / 总结

The paper introduces Recursive Self-Aggregation (RSA), a test-time scaling method for large language models (LLMs) that combines parallel and sequential scaling benefits. RSA refines a population of candidate reasoning chains through aggregation, improving performance across various tasks and models. RSA achieves near-top performance on the ARC-AGI-2 leaderboard with Gemini 3 Flash and outperforms parallel and sequential scaling strategies with Qwen3-4B-Instruct-2507 on multiple benchmarks.

论文提出了递归自聚合（RSA）方法，这是一种结合了并行和序列扩展优势的大型语言模型（LLM）测试时扩展方法。RSA通过聚合改进候选推理链群，提高了多种任务和模型的性能。RSA在ARC-AGI-2排行榜上接近顶级性能，并且在多个基准测试中优于纯并行和序列扩展策略，如Qwen3-4B-Instruct-2507。

Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

Authors: Abdulaziz Almuzairee, Henrik I. Christensen

First: 2026-02-24T18:58:11+00:00 · Latest: 2026-02-24T18:58:11+00:00

Comments: For website and code, see https://aalmuzairee.github.io/squint

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

中文标题/摘要

标题：眯视：快速视觉强化学习在仿真实际机器人应用中的研究

视觉强化学习在机器人领域具有吸引力，但成本高昂——离策略方法样本高效但速度慢；在策略方法并行性好但浪费样本。近期研究表明，离策略方法在墙钟时间上比在策略方法更快进行基于状态的控制。将此扩展到视觉领域仍然具有挑战性，其中高维输入图像使训练动态复杂化，并引入了大量存储和编码开销。为了解决这些挑战，我们引入了眯视，一种视觉软演员评论家方法，该方法在墙钟时间上比先前的视觉离策略和在策略方法更快地进行训练。眯视通过并行模拟、分布评论家、分辨率眯视、层归一化、调整更新到数据的比例以及优化实现来实现这一点。我们在SO-101任务集中进行了评估，这是一个新的包含八个在ManiSkill3中具有大量领域随机化的抓取任务集，并展示了从模拟到实际SO-101机器人的转移。我们在单个RTX 3090 GPU上训练策略15分钟，大多数任务在不到6分钟内收敛。

Summary / 总结

Squint is a visual Soft Actor Critic method that addresses the challenges of visual reinforcement learning in robotics by combining parallel simulation, a distributional critic, and other techniques. It achieves faster wall-clock training than previous methods, converging in under 6 minutes for most tasks on a single RTX 3090 GPU. Squint demonstrates effective sim-to-real transfer on a real SO-101 robot after 15 minutes of training.

Squint 是一种视觉 Soft Actor Critic 方法，旨在加速机器人领域的视觉强化学习。它通过并行模拟、分布性评论家和其他技术来应对高维图像输入的挑战。Squint 在单个 RTX 3090 GPU 上以不到 6 分钟的时间训练策略，成功应用于 SO-101 任务集中的任务，并展示了从仿真到现实机器人的迁移能力。

Multi-Vector Index Compression in Any Modality

Authors: Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz, Benjamin Van Durme

First: 2026-02-24T18:57:33+00:00 · Latest: 2026-02-24T18:57:33+00:00

Comments: 12 pages, 4 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.

中文标题/摘要

标题：任意模态下的多向量索引压缩

我们研究了在任意模态下高效地进行多向量检索以供后期交互。后期交互已成为文本、图像、视觉文档和视频中信息检索的主要范式，但其计算和存储成本会随着文档长度线性增长，使得图像、视频和音频丰富的语料库成本高昂。为了解决这一限制，我们探索了在固定向量预算下压缩多向量文档表示的查询无关方法。我们介绍了四种索引压缩方法：序列重置、记忆标记、分层聚合以及一种新颖的注意力引导聚类（AGC）。AGC 使用注意力引导机制来识别文档中最具语义显著性的区域作为聚类中心，并对标记聚合进行加权。我们在涵盖文本（BEIR）、视觉文档（ViDoRe）和视频（MSR-VTT、MultiVENT 2.0）的检索任务上评估了这些方法，结果显示注意力引导聚类始终优于其他参数化压缩方法（序列重置和记忆标记），提供了比非参数化分层聚类更大的索引大小灵活性，并且在与未压缩的完整索引相比时实现了具有竞争力或改进的性能。源代码可在：github.com/hanxiangqin/omni-col-press 获取。

Summary / 总结

The paper aims to address the high computational and storage costs of late interaction in information retrieval across various modalities, such as text, images, and videos. It introduces four methods for compressing multi-vector document representations: sequence resizing, memory tokens, hierarchical pooling, and attention-guided clustering (AGC). AGC uses an attention mechanism to identify semantically important regions and aggregate tokens accordingly. Experiments on retrieval tasks show that AGC outperforms other parameterized methods, offers more flexibility in index size, and achieves competitive or improved performance compared to a full, uncompressed index.

论文旨在解决在文本、图像和视频等多种模态的信息检索中，晚期交互带来的高计算和存储成本问题。它提出了四种多向量文档表示压缩方法：序列缩放、记忆标记、分层池化和注意力引导聚类（AGC）。AGC 使用注意力机制来识别语义上重要的区域并相应地聚合标记。实验结果显示，AGC 在性能上优于其他参数化方法，提供了更大的索引大小灵活性，并且在某些情况下与未压缩的索引相比具有竞争力或更好的性能。

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Authors: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi

First: 2026-02-24T18:55:18+00:00 · Latest: 2026-02-24T18:55:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

中文标题/摘要

标题：从试错中学习：体态LLM的反思测试时规划

体态LLM赋予机器人高级任务推理能力，但它们无法反思错误的原因，导致部署成为一系列独立的试验，错误重复发生而非积累经验。借鉴人类反思型实践者，我们引入了反思测试时规划，该方法结合了两种反思模式：在行动中的反思，即代理在执行前使用测试时缩放生成并评分多个候选动作；以及行动后的反思，即使用测试时训练更新其内部反思模型和行动策略，基于执行后的外部反思。我们还加入了回顾性反思，允许代理重新评估早期决策，并基于事后洞察更新模型以正确分配长期信用。在我们新设计的长时 horizon 家庭环境基准和MuJoCo橱柜装配基准上的实验显示，与基线模型相比有显著改进，消融研究验证了在行动中的反思和行动后反思的互补作用。定性分析，包括真实机器人试验，突出了通过反思实现行为纠正。

Summary / 总结

This paper addresses the limitation of embodied language models (LLMs) in reflecting on their mistakes, proposing Reflective Test-Time Planning. This method includes reflection-in-action, where the agent generates and scores multiple actions before execution, and reflection-on-action, where the agent updates its models based on external reflections after execution. Additionally, retrospective reflection allows for re-evaluation of earlier decisions. Experiments on new benchmarks show significant improvements over baseline models, with ablation studies confirming the effectiveness of both reflection modes. Qualitative analyses demonstrate behavioral corrections through reflection in real-robot trials.

本文针对嵌入式语言模型（LLMs）无法反思错误的问题，提出了反思测试时规划的方法。该方法包括在执行前生成和评分多个动作的反思-在行动，以及根据执行后的外部反思更新模型的反思-事后行动。此外，回顾性反思允许重新评估早期决策。新基准上的实验显示了显著的改进，消融研究证实了两种反思模式的有效性。定性分析表明，通过反思在真实机器人试验中实现了行为纠正。

Human Video Generation from a Single Image with 3D Pose and View Control

Authors: Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

First: 2026-02-24T18:42:20+00:00 · Latest: 2026-02-24T18:42:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

中文标题/摘要

标题：从单张图像生成具有3D姿态和视角控制的人类视频

近期的扩散方法由于其强大的视觉生成能力，在从单张图像生成视频方面取得了显著进展。然而，在图像到视频合成中仍存在挑战，特别是在人类视频生成方面，从单张图像推断视角一致、运动相关的衣物褶皱仍然是一个艰巨的问题。本文中，我们提出了4D人类视频生成（HVG），这是一种能够通过3D姿态和视角控制从单张图像生成高质量、多视角、时空一致的人类视频的潜在视频扩散模型。HVG 通过三个关键设计实现这一点：(i) 骨骼图驱动的姿态调制，通过一种新颖的双维度骨骼图捕捉3D关节的解剖关系，并通过引入3D信息解决多视角中的自遮挡问题；(ii) 视角和时间对齐，确保参考图像与姿态序列之间的多视角一致性，以实现帧到帧的稳定性；(iii) 渐进时空采样与时间对齐，以在长时间多视角动画中保持平滑过渡。在图像到视频任务上的大量实验表明，HVG 在从多样的人类图像和姿态输入生成高质量4D人类视频方面优于现有方法。

Summary / 总结

The research aims to address the challenge of generating high-quality human videos from a single image, focusing on view-consistent and motion-dependent clothing wrinkles. The method, Human Video Generation in 4D (HVG), uses a latent video diffusion model with three key designs: Articulated Pose Modulation, View and Temporal Alignment, and Progressive Spatio-Temporal Sampling. Experiments show that HVG outperforms existing methods in creating multi-view, spatiotemporally coherent human videos from diverse images and pose inputs.

本文解决了从单张图像生成高质量人体视频的挑战，重点关注视图一致性和运动相关的衣物褶皱。作者引入了4D人体视频生成（HVG）模型，该模型采用了三种关键设计：关节姿态调制、视图和时间对齐以及渐进时空采样。实验结果表明，HVG 在从多种人体图像和姿态输入生成多视角、时空一致的人体视频方面优于现有方法。

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Authors: Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang

First: 2026-02-24T18:37:34+00:00 · Latest: 2026-02-24T18:37:34+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.

中文标题/摘要

标题：Spa3R：三维视觉推理的预测空间场建模

尽管视觉语言模型（VLMs）在二维视觉理解方面表现出色，但它们在理解和推理三维空间方面的能力仍然有限，而三维空间理解是空间智能的核心。当前的方法试图通过依赖显式的三维模态或通过部分视图条件几何先验增强VLMs来弥合这一领域差距。然而，这些方法限制了可扩展性，并最终使语言模型承担从稀疏线索隐式重建完整三维几何结构的不明确任务。本文认为，空间智能可以从二维视觉中自然地涌现，而不是通过显式的空间指令调优强加。为此，我们提出了Spa3R，这是一种自监督框架，可以直接从未指定的多视角图像中学习统一的、视图不变的空间表示。Spa3R基于提出的预测空间场建模（PSFM）范式，其中Spa3R学习根据紧凑的潜在表示合成任意未见过的视图的特征场，从而内化对底层三维场景的整体和连贯的理解。我们进一步通过轻量级适配器将预训练的Spa3R编码器集成到现有的VLMs中，形成Spa3-VLM，有效地将语言推理置于全局空间上下文中。在具有挑战性的VSI-Bench实验中，Spa3-VLM在三维VQA上的准确率达到58.6%，显著优于先前的方法。这些结果突显了PSFM作为推进空间智能的可扩展路径。代码可在https://github.com/hustvl/Spa3R获取。

Summary / 总结

This paper addresses the limitation of Vision-Language Models (VLMs) in understanding 3D space and introduces Spa3R, a self-supervised framework that learns a unified spatial representation from multi-view images. Spa3R uses Predictive Spatial Field Modeling (PSFM) to synthesize feature fields for unseen views, enabling a holistic understanding of 3D scenes. When integrated into existing VLMs, Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, outperforming previous methods.

本文针对视觉语言模型（VLMs）在理解三维空间方面的局限性，提出了Spa3R，这是一种无需显式3D数据即可从多视角图像中学习统一空间表示的自监督框架。Spa3R 使用预测性空间场建模（PSFM）来合成未见视图的特征场，从而实现对底层3D场景的整体和连贯理解。通过轻量级适配器将预训练的Spa3R编码器集成到VLMs中，形成Spa3-VLM，使其在3D VQA上的准确率达到58.6%，在VSI-Bench上显著优于先前的方法。

The Diffusion Duality, Chapter II: $Ψ$-Samplers and Efficient Curriculum

Authors: Justin Deschenaux, Caglar Gulcehre, Subham Sekhar Sahoo

First: 2026-02-24T18:35:22+00:00 · Latest: 2026-02-24T18:35:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2

中文标题/摘要

标题：离散扩散的二重性，第二章：Ψ-采样器与高效课程

均匀状态离散扩散模型由于能够自我纠正，在少量步骤生成和指导方面表现出色，因此在这些场景中优于自回归或掩码扩散模型。然而，随着步骤数量的增加，使用祖先采样器时其采样质量会停滞不前。我们引入了一类预测-校正（PC）采样器，该类方法可以泛化先前的方法，并适用于任意噪声过程。当与均匀状态扩散结合使用时，我们的采样器在语言建模和图像建模方面均优于祖先采样，实现了在OpenWebText和LM1B上匹配的单词熵下的更低生成困惑度，并在CIFAR10上获得了更好的FID/IS分数。至关重要的是，与传统采样器不同，我们的PC方法随着采样步骤的增加而继续改进。这些发现共同质疑了掩码扩散是基于扩散的语言建模不可避免的未来的假设。除了采样之外，我们还开发了一种用于高斯松弛训练阶段的内存高效课程，与Duo相比，训练时间减少了25%，内存减少了33%，同时在OpenWebText和LM1B上保持了可比的困惑度，并且下游性能强大。我们在以下链接上发布了代码、检查点和视频教程：https://s-sahoo.com/duo-ch2

Summary / 总结

This paper introduces a family of Predictor-Corrector (PC) samplers for discrete diffusion models, which improve upon ancestral samplers by continuing to enhance sampling quality with more steps. The PC samplers, when paired with uniform-state diffusion, outperform ancestral sampling on both language and image modeling tasks, achieving lower generative perplexity and better FID/IS scores. Additionally, the paper develops a memory-efficient curriculum for Gaussian relaxation training, reducing training time and memory usage while maintaining strong performance on language modeling and downstream tasks.

论文针对均匀状态离散扩散模型在采样步骤增加时采样质量下降的问题，引入了一种预测-校正（PC）采样器，该方法在语言和图像建模任务上优于祖先采样方法，并且随着采样步骤的增加表现出持续改进的趋势。研究还开发了一种内存高效的训练课程，用于高斯松弛训练阶段，减少了训练时间和内存使用，同时在语言建模和下游任务上保持了性能。

Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Authors: Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu

First: 2026-02-24T18:20:57+00:00 · Latest: 2026-02-24T18:20:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.

中文标题/摘要

标题：透过文字看世界：利用语言模型控制视觉检索质量

文本到图像检索是视觉语言学习中的基本任务，但在现实场景中，由于用户查询简短且不具体，这一任务常常受到挑战。这类查询通常只有1到2个词，导致其语义模糊，容易在多种视觉解释中发生碰撞，并且缺乏对检索图像质量的明确控制。为了解决这些问题，我们提出了一种新的质量可控检索范式，该范式通过增加上下文细节来丰富简短的查询，并结合图像质量的明确概念。我们的核心思想是利用生成语言模型作为查询扩展函数，将不明确的查询扩展为描述性形式，捕捉诸如姿态、场景和美学等精细的视觉属性。我们提出了一种通用框架，该框架根据相关性和美学评分模型离散的质量级别条件查询扩展，使得查询丰富不仅具有语义意义，而且具有质量意识。由此产生的系统提供了三个关键优势：1）灵活性，它与任何预训练的视觉语言模型（VLMs）兼容，无需修改；2）透明度，增强的查询可以由用户明确解释；3）可控性，使检索结果能够朝向用户偏好的质量水平进行调整。大量实验表明，我们提出的方法显著提高了检索结果，并提供了有效的质量控制，弥合了现代VLMs的表达能力和简短用户查询的不具体性之间的差距。我们的代码可在https://github.com/Jianglin954/QCQC/ 获取。

Summary / 总结

The paper addresses the challenge of text-to-image retrieval with short and ambiguous queries by proposing a quality-controllable retrieval paradigm. It uses a generative language model to extend underspecified queries into more descriptive forms, incorporating explicit notions of image quality. The framework conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, ensuring that the enriched queries are both semantically meaningful and quality-aware. Experiments show that this approach significantly improves retrieval results and provides effective quality control, making it compatible with any pretrained vision-language model and enabling users to steer retrieval results toward preferred quality levels.

该论文通过提出一种质量可控的检索范式，解决了使用短且含糊的用户查询进行文本到图像检索的挑战。它利用生成语言模型将不明确的查询扩展为更具描述性的形式，同时融入图像质量的明确概念。该框架根据相关性和美学评分模型得出的离散质量级别来条件化查询扩展，以确保语义意义和质量意识。实验表明，这种方法可以提高检索结果并提供有效的质量控制，使其兼容任何预训练的视觉-语言模型且无需修改，并允许用户将检索结果导向其偏好的质量水平。

Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids

Authors: Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott

Venue: RSS 2025

First: 2026-02-24T18:18:36+00:00 · Latest: 2026-02-24T18:18:36+00:00

Comments: 12 pages, 9 figures, 4 tables, accepted to RSS 2025, code is open-source: https://github.com/ethz-asl/wavestar

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information. Yet widely used path planning methods such as sampling and trajectory optimization do not exploit this explicit connectivity information, and search-based methods such as A* suffer from scalability issues in large-scale high-resolution maps. In many applications, Euclidean shortest paths form the underpinning of the navigation system. For such applications, any-angle planning methods, which find optimal paths by connecting corners of obstacles with straight-line segments, provide a simple and efficient solution. In this paper, we present a method that has the optimality and completeness properties of any-angle planners while overcoming computational tractability issues common to search-based methods by exploiting multi-resolution representations. Extensive experiments on real and synthetic environments demonstrate the proposed approach's solution quality and speed, outperforming even sampling-based methods. The framework is open-sourced to allow the robotics and planning community to build on our research.

中文标题/摘要

标题：高效分层任意角度路径规划在多分辨率3D网格上的应用

分层的多分辨率体素映射方法广泛用于表示大型和复杂环境，因为它们可以有效地捕捉这些环境的占用和连接信息。然而，常用的路径规划方法如采样和轨迹优化并未利用这种显式的连接信息，基于搜索的方法如A*在大规模高分辨率地图上面临可扩展性问题。在许多应用中，欧几里得最短路径是导航系统的基础。对于此类应用，任意角度规划方法通过将障碍物的角落用直线段连接来找到最优路径，提供了一个简单而有效的解决方案。在本文中，我们提出了一种方法，该方法具有任意角度规划器的最优性和完备性属性，同时通过利用多分辨率表示来克服基于搜索方法的计算可处理性问题。在真实和合成环境中的大量实验表明，所提出的方法在解决方案质量和速度上优于基于采样的方法。该框架已开源，以允许机器人技术和规划社区在此基础上进行研究。

Summary / 总结

This paper addresses the challenge of efficient path planning in large and complex 3D environments by integrating hierarchical multi-resolution mapping with any-angle planning methods. The proposed method utilizes the connectivity information provided by multi-resolution grids to enhance the scalability of search-based methods like A*. Experimental results show that the approach outperforms sampling-based methods in terms of both solution quality and speed, making it suitable for real-world applications where Euclidean shortest paths are required for navigation systems.

本文提出了将层次多分辨率映射与任何角度路径规划方法相结合的方法，以高效地在大型和复杂的3D环境中进行路径规划。该方法利用多分辨率网格提供的连接信息来增强基于搜索的方法（如A*）的可扩展性。实验结果表明，该方法在解决方案质量和速度上都优于基于采样的方法，适用于需要导航系统中使用欧几里得最短路径的实际应用。

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Authors: Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

Venue: CVPR 2026

First: 2026-02-24T18:17:21+00:00 · Latest: 2026-02-24T18:17:21+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.

中文标题/摘要

标题：NoRD：一种无需推理的数据高效视觉-语言-行动模型

视觉-语言-行动（VLA）模型通过使用统一的端到端架构取代模块化管道，正在推动自主驾驶的发展。然而，当前的VLA面临两个昂贵的要求：（1）大规模数据集收集，（2）密集的推理注释。在本工作中，我们通过NoRD（无需推理的驾驶）解决了这两个挑战。与现有的VLA相比，NoRD在不到60%的数据和无推理注释的情况下实现了竞争力的性能，结果减少了3倍的令牌。我们发现，标准的组相对策略优化（GRPO）在应用于如此小的、无推理的数据集训练的策略时，无法取得显著的改进。我们表明，这一限制源于难度偏差，这在GRPO中不成比例地惩罚了产生高方差滚动的奖励信号。NoRD通过引入Dr. GRPO，一种旨在减轻LLM中难度偏差的最新算法，克服了这一限制。因此，NoRD在Waymo和NAVSIM上实现了竞争力的性能，使用了少量的训练数据和无推理开销，从而使得自主系统更加高效。

Summary / 总结

NoRD is a data-efficient vision-language-action model that addresses the challenges of massive dataset collection and dense reasoning annotations in autonomous driving. By using Dr. GRPO to mitigate difficulty bias, NoRD achieves competitive performance with only 60% of the training data and no reasoning annotations, reducing the number of tokens by three times. This enables more efficient autonomous systems.

NoRD通过使用一种数据高效的方法解决了视觉-语言-行动模型在自动驾驶中面临的大量数据集收集和密集推理注解的挑战。它仅使用60%的训练数据和无推理注解就达到了竞争性性能，减少了三倍的令牌数量。NoRD通过引入Dr. GRPO来缓解难度偏差，使其能够在Waymo和NAVSIM数据集上表现良好，而无需密集的推理注解。

Multi-Round Human-AI Collaboration with User-Specified Requirements

Authors: Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas

First: 2026-02-19T18:54:34+00:00 · Latest: 2026-02-24T18:15:39+00:00

Abs · PDF · Code1 · Code2

Abstract

As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.

中文标题/摘要

标题：多轮人机协作与用户指定要求

随着人类越来越多地依赖多轮对话AI进行高风险决策，需要有原则性的框架来确保此类互动能够可靠地提高决策质量。我们从以人为本的角度出发，遵循两个原则：反事实伤害，确保AI不削弱人类的优势；互补性，确保AI在人类容易出错的地方增加价值。我们通过用户定义的规则形式化这些概念，允许用户明确指定特定任务中的伤害和互补性含义。然后，我们引入了一个在线的、无分布假设的算法，具有有限样本保证，该算法在协作动态中强制执行用户指定的约束。我们通过两个交互设置评估了我们的框架：LLM模拟协作在医学诊断任务上和人类众包研究在图像推理任务上。我们展示了我们的在线程序即使在非平稳互动动态下也能保持规定的反事实伤害和互补性违反率。此外，收紧或放松这些约束会产生可预测的人类下游准确性变化，证实了这两个原则作为实用杠杆，可以引导多轮协作向更好的决策质量发展，而无需建模或约束人类行为。

MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition

Authors: Mehran Shabanpour, Kasra Rad, Sadaf Khademi, Arash Mohammadi

First: 2025-02-09T17:07:46+00:00 · Latest: 2026-02-24T18:08:53+00:00

Abs · PDF · Code1 · Code2

Abstract

High-Density surface Electromyography (HDsEMG) has emerged as a pivotal resource for Human-Computer Interaction (HCI), offering direct insights into muscle activities and motion intentions. However, a significant challenge in practical implementations of HD-sEMG-based models is the low accuracy of inter-session and inter-subject classification. Variability between sessions can reach up to 40% due to the inherent temporal variability of HD-sEMG signals. Targeting this challenge, the paper introduces the MoEMba framework, a novel approach leveraging Selective StateSpace Models (SSMs) to enhance HD-sEMG-based gesture recognition. The MoEMba framework captures temporal dependencies and cross-channel interactions through channel attention techniques. Furthermore, wavelet feature modulation is integrated to capture multi-scale temporal and spatial relations, improving signal representation. Experimental results on the CapgMyo HD-sEMG dataset demonstrate that MoEMba achieves a balanced accuracy of 56.9%, outperforming its state-of-the-art counterparts. The proposed framework's robustness to session-to-session variability and its efficient handling of high-dimensional multivariate time series data highlight its potential for advancing HD-sEMG-powered HCI systems.

中文标题/摘要

标题：MoEMba：基于Mamba的专家混合模型用于高密度EMG手势识别

高密度表面肌电图（HDsEMG）已成为人机交互（HCI）的关键资源，提供了对肌肉活动和运动意图的直接洞察。然而，基于HD-sEMG的模型在实际实施中的主要挑战是会话间和跨个体分类的低准确性。由于HD-sEMG信号固有的时间变异性，会话间变异性可高达40%。针对这一挑战，论文引入了MoEMba框架，这是一种利用选择性状态空间模型（SSMs）增强HD-sEMG手势识别的新方法。MoEMba框架通过通道注意力技术捕捉时间依赖性和跨通道交互。此外，还集成了小波特征调制以捕捉多尺度的时间和空间关系，提高信号表示。CapgMyo HD-sEMG数据集上的实验结果表明，MoEMba实现了56.9%的平衡准确率，优于其最先进的同类方法。所提出的框架对会话间变异性具有鲁棒性，并且能够高效处理高维多变量时间序列数据，突显了其在推进HD-sEMG驱动的HCI系统方面的潜力。

Summary / 总结

The paper addresses the challenge of low accuracy in inter-session and inter-subject classification of High-Density surface Electromyography (HDsEMG) signals for hand gesture recognition. It introduces MoEMba, a framework that uses Selective StateSpace Models and channel attention techniques to capture temporal dependencies and cross-channel interactions. Wavelet feature modulation is also integrated to capture multi-scale temporal and spatial relations. Experimental results show that MoEMba achieves a balanced accuracy of 56.9%, outperforming existing methods in handling session-to-session variability and high-dimensional multivariate time series data.

论文针对高密度表面肌电信号（HDsEMG）在手部手势识别中的跨会话和跨个体分类低精度问题，提出了一种名为MoEMba的新框架，该框架利用选择性状态空间模型和通道注意力技术捕捉时间依赖性和跨通道交互。此外，还集成波let特征调制以捕捉多尺度的时间和空间关系。实验结果表明，MoEMba在CapgMyo HD-sEMG数据集上的平衡准确率为56.9%，优于现有方法，能够有效处理会话间变异性及高维多变量时间序列数据。

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Authors: Mame Diarra Toure, David A. Stephens

First: 2026-02-24T18:05:51+00:00 · Latest: 2026-02-24T18:05:51+00:00

Comments: 8 pages, 17 figures

Abs · PDF · Code1 · Code2

Abstract

In safety-critical classification, the cost of failure is often asymmetric, yet Bayesian deep learning summarises epistemic uncertainty with a single scalar, mutual information (MI), that cannot distinguish whether a model's ignorance involves a benign or safety-critical class. We decompose MI into a per-class vector $C_k(x)=σ_k^{2}/(2μ_k)$, with $μ_k{=}\mathbb{E}[p_k]$ and $σ_k^2{=}\mathrm{Var}[p_k]$ across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the $1/μ_k$ weighting corrects boundary suppression and makes $C_k$ comparable across rare and common classes. By construction $\sum_k C_k \approx \mathrm{MI}$, and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of $C_k$, we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class $C_k$ reduces selective risk by 34.7\% over MI and 56.2\% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where $\sum_k C_k$ achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which $\sum_k C_k$ shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Authors: Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei

First: 2026-02-24T18:04:54+00:00 · Latest: 2026-02-24T18:04:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

中文标题/摘要

标题：SELAUR：基于不确定性感知奖励的自演化大语言模型代理

大型语言模型（LLMs）越来越多地被部署为多步决策代理，有效的奖励设计对于引导学习至关重要。尽管最近的工作探索了各种形式的奖励塑造和步骤级的信用分配，但一个关键信号仍然被忽视：LLMs 的内在不确定性。不确定性反映了模型的信心，揭示了探索所需的地方，并在失败轨迹中提供了有价值的指导线索。我们引入了SELAUR：基于不确定性感知奖励的自演化大语言模型代理，这是一种强化学习框架，将不确定性直接纳入奖励设计中。SELAUR 将熵、最小置信度和边际度量结合成一个综合的令牌级不确定性估计，提供密集的信心对齐监督，并采用一种失败感知的奖励重塑机制，将这些不确定性信号注入步骤级和轨迹级奖励中，以提高探索效率和学习稳定性。在两个基准ALFWorld和WebShop上的实验表明，我们的方法在成功率上始终优于强大的基线。消融研究进一步证明了不确定性信号如何增强探索和鲁棒性。

Summary / 总结

The research aims to enhance the effectiveness of large language models (LLMs) as multi-step decision-making agents by incorporating model uncertainty into reward design. SELAUR, a reinforcement learning framework, uses entropy, least confidence, and margin-based metrics to estimate token-level uncertainty, providing dense supervision. It also employs a failure-aware reward reshaping mechanism to improve exploration and learning stability. Experiments on ALFWorld and WebShop show that SELAUR consistently outperforms strong baselines in terms of success rates.

研究旨在通过将模型不确定性纳入奖励设计来提升大型语言模型（LLMs）作为多步决策代理的有效性。SELAUR是一种强化学习框架，使用熵、最少置信度和边际基线来估计令牌级别的不确定性，提供密集的监督。它还采用一种失败感知的奖励重塑机制，以提高探索效率和学习稳定性。实验在ALFWorld和WebShop两个基准上显示，SELAUR在成功率方面始终优于强大的基线方法。

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

First: 2026-02-23T18:59:27+00:00 · Latest: 2026-02-24T18:03:02+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.

中文标题/摘要

标题：技能注入：衡量代理对技能文件攻击的脆弱性

LLM代理正在迅速发展，得益于代码执行、工具以及最近引入的代理技能功能。技能允许用户通过专门的第三方代码、知识和指令扩展LLM应用程序的功能。虽然这可以将代理能力扩展到新的领域，但也为提示注入攻击提供了新的攻击面。我们识别出基于技能的提示注入是一个重大威胁，并引入了SkillInject基准，评估广泛使用的LLM代理通过技能文件遭受注入的易感性。SkillInject包含202个注入任务对，攻击范围从明显的恶意注入到隐藏在合法指令中的细微、情境依赖的攻击。我们对前沿LLM进行了评估，从有害指令的避免和合法指令的遵守两个方面衡量安全性。我们的结果显示，当前的代理高度易受攻击，前沿模型的攻击成功率高达80%，经常执行极其有害的指令，包括数据泄露、破坏性操作和类似勒索软件的行为。此外，这些结果表明，这个问题不会通过模型扩展或简单的输入过滤来解决，而是需要具备上下文感知授权框架的稳健代理安全。我们的基准可以在https://www.skill-inject.com/获取。

Summary / 总结

The paper addresses the security threat of skill-based prompt injection attacks on LLM agents, which can extend their capabilities but also introduce vulnerabilities. It introduces SkillInject, a benchmark that evaluates the susceptibility of LLM agents to such attacks through skill files. The benchmark includes 202 injection-task pairs with varying levels of maliciousness. The evaluation of leading LLMs shows high vulnerability, with up to 80% attack success rates, and suggests that robust security will require context-aware authorization frameworks rather than model scaling or simple input filtering.

论文关注LLM代理中日益严重的基于技能的提示注入攻击威胁，这些攻击可以扩展代理的功能，但也引入了安全漏洞。研究引入了SkillInject基准，通过技能文件评估LLM代理的易受攻击性。基准包括202个注入任务对，具有不同程度的恶意性。对领先LLM的评估显示了高易受攻击性，成功率高达80%，通常会导致有害行为。结果表明，稳健的安全性需要上下文感知的授权框架，而不仅仅是依赖于模型扩展或简单的输入过滤。

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Authors: Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

First: 2025-12-02T18:31:18+00:00 · Latest: 2026-02-24T18:01:52+00:00

Comments: Accepted by PAKDD 2026 special session on Data Science: Foundations and Applications

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

中文标题/摘要

标题：从审查到调解：大语言模型能否在在线争吵中担任调解人？

大型语言模型（LLMs）的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流，它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本文探讨LLMs是否不仅能作为检测有害内容的审查员，还能作为能够理解并缓解在线冲突的调解人。我们的框架将调解分解为两个子任务：判断，即LLM评估对话的公平性和情感动态；引导，即生成同理心、缓解冲突的消息，引导参与者走向解决。为了评估调解质量，我们构建了一个基于Reddit的大规模数据集，并提出了一种多阶段评估管道，结合原则评分、用户模拟和人工比较。实验表明，基于API的模型在调解时在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。

Summary / 总结

This study investigates whether large language models (LLMs) can act as mediators in online conflicts, beyond their role as moderators. The research proposes a framework that decomposes mediation into judgment and steering tasks. It evaluates mediation quality using a multi-stage pipeline and finds that API-based models outperform open-source models in both reasoning and intervention alignment. The study highlights the potential and limitations of current LLMs in online social mediation.

研究探讨了大型语言模型（LLMs）是否可以在在线冲突中作为调解者发挥作用，而不仅仅是作为审核者。研究将调解分解为判断和引导两个任务，并使用多阶段评估管道进行评估。实验表明，API基模型在推理和干预对齐方面优于开源模型，这既显示了LLMs在这一领域的潜力，也揭示了其当前的局限性。

CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

Authors: Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin, Yuxin Liu, Yueming Jin

Venue: ICASSP 2026

First: 2026-02-24T17:59:21+00:00 · Latest: 2026-02-24T17:59:21+00:00

Comments: Accepted by ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.

中文标题/摘要

标题：CG-DMER：混合对比生成框架用于解耦多模态心电图表示学习

准确解读心电图（ECG）信号对于诊断心血管疾病至关重要。最近将ECG与伴随的临床报告结合的多模态方法显示出强大的潜力，但仍面临两个主要的模态问题：（1）模内：现有模型以导联无关的方式处理ECG，忽视了导联间的时空依赖性，限制了其在建模细微诊断模式方面的有效性；（2）模间：现有方法直接将ECG信号与临床报告对齐，由于报告的自由文本性质，引入了模态特定的偏差。针对这两个问题，我们提出CG-DMER，一种基于对比生成框架的解耦多模态ECG表示学习方法，通过两个关键设计：（1）时空掩码建模旨在通过在时空维度上应用掩码并重建缺失信息，更好地捕捉细微的时间动态和导联间的空间依赖性；（2）一种表示解耦和对齐策略旨在通过引入模态特定和模态共享编码器，减少不必要的噪声和模态特定偏差，确保模态不变和模态特定表示之间的清晰分离。在三个公开数据集上的实验表明，CG-DMER在多种下游任务中达到了最先进的性能。

Summary / 总结

CG-DMER is a hybrid contrastive-generative framework designed to improve the interpretation of ECG signals by addressing intra-modality and inter-modality issues. It uses spatial-temporal masked modeling to capture fine-grained temporal dynamics and inter-lead spatial dependencies, and a representation disentanglement strategy to reduce modality-specific biases. Experiments show that CG-DMER outperforms existing methods across various downstream tasks on three public datasets.

CG-DMER 是一种混合对比生成框架，旨在通过解决模内和模间问题来提高心电图信号的解释。它使用空间-时间掩码建模来捕捉细粒度的时间动态和跨导联的空间依赖性，并通过引入模内和模间编码器来分离不变和特定模态的信息，减少噪声和偏差。实验表明，CG-DMER 在各种下游任务中优于现有方法。

A Very Big Video Reasoning Suite

Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng

First: 2026-02-23T18:59:41+00:00 · Latest: 2026-02-24T17:59:15+00:00

Comments: Homepage: https://video-reason.com/

Abs · PDF · Code1 · Code2

Abstract

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .

中文标题/摘要

标题：一个非常大的视频推理套件

视频模型的快速进步主要集中在视觉质量上，而对其推理能力的探索则相对不足。视频推理将智能根植于时空一致的视觉环境中，超越了文本所能自然捕捉的内容，使人们能够直观地推理时空结构，如连续性、交互性和因果关系。然而，系统地研究视频推理及其扩展行为受到大规模训练数据缺乏的阻碍。为解决这一问题，我们引入了非常大的视频推理（VBVR）数据集，这是一个前所未有的大规模资源，涵盖了200个经过精心分类的推理任务，涉及超过一百万段视频片段，比现有数据集大三个数量级。我们还提出了VBVR-Bench，这是一种可验证的评估框架，通过引入基于规则、与人类对齐的评分者，超越了基于模型的评判，使视频推理能力的诊断具有可重复性和可解释性。利用VBVR套件，我们进行了第一个大规模的视频推理扩展研究，并观察到了对未见过的推理任务的早期泛化迹象。总体而言，VBVR为可泛化的视频推理的下一阶段研究奠定了基础。数据、基准工具包和模型可在https://video-reason.com/ 公开获取。

Summary / 总结

The research aims to explore the reasoning capabilities of video models beyond visual quality, addressing the lack of large-scale training data. The study introduces the Very Big Video Reasoning (VBVR) Dataset with over one million video clips for 200 reasoning tasks, and VBVR-Bench, a verifiable evaluation framework. Key findings include early signs of emergent generalization to unseen tasks, advancing the field of generalizable video reasoning.

研究关注视频推理这一尚未充分探索的领域，对于理解时空一致的视觉环境至关重要。为了解决大规模训练数据不足的问题，作者引入了包含超过一百万视频片段和200个推理任务的Very Big Video Reasoning (VBVR) 数据集。他们还提出了VBVR-Bench评估框架，其中包括基于规则的评分者来评估视频推理能力。研究发现，存在对未见过的推理任务的早期泛化迹象，为未来可泛化视频推理的研究奠定了基础。

A Benchmark for Deep Information Synthesis

Authors: Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger, Aysim Toker, Roy Miles, Andreea-Maria Oncescu, Jasivan Alex Sivakumar, Philipp Borchert, Ismail Elezi, Meiru Zhang, Ka Yiu Lee, Guchun Zhang, Jun Wang, Gerasimos Lampouras

Venue: ICLR 2026

First: 2026-02-24T17:43:32+00:00 · Latest: 2026-02-24T17:43:32+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.

中文标题/摘要

标题：深度信息综合基准

基于大型语言模型（LLM）的代理越来越多地用于解决涉及工具使用（如网络浏览、代码执行和数据分析）的复杂任务。然而，当前的评估基准未能充分评估它们从多个来源综合信息并超越简单事实检索进行推断的能力。为了解决这一问题，我们引入了DEEPSYNTH，这是一种新型基准，旨在评估代理在综合信息收集、综合和结构化推理以产生见解的现实、耗时问题上的表现。DEEPSYNTH包含来自7个领域和数据源的120个任务，覆盖67个国家。DEEPSYNTH使用多阶段数据收集管道构建，要求注释者收集官方数据源、创建假设、进行手动分析并设计具有可验证答案的任务。在DEEPSYNTH上评估时，11个最先进的LLM和深度研究代理的最大F1得分为8.97和17.5，突显了基准的难度。我们的分析表明，当前的代理在幻觉和处理大量信息空间的推理方面存在困难，突显了DEEPSYNTH作为指导未来研究的关键基准的重要性。

Summary / 总结

The paper introduces DEEPSYNTH, a benchmark designed to evaluate large language model (LLM)-based agents on complex, time-consuming tasks that require synthesizing information from multiple sources and inferring insights. The benchmark includes 120 tasks across 7 domains and 67 countries, and 11 state-of-the-art LLMs and deep research agents achieve low F1 scores, indicating the difficulty of the tasks. The study highlights the challenges of hallucinations and reasoning over large information spaces, underscoring the importance of DEEPSYNTH for future research.

论文介绍了DEEPSYNTH基准，旨在评估基于大型语言模型的代理在合成多源信息和执行结构化推理的复杂、耗时任务上的能力。该基准包含来自67个国家的7个领域中的120个任务，11个最先进的代理在该基准上的表现不佳，表明任务的难度。研究揭示了幻觉和处理大量信息空间推理的挑战，突显了DEEPSYNTH对未来研究的重要性。

LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis

Authors: Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R. Roth, Marius George Linguraru

Venue: ISBI

First: 2026-02-24T17:42:46+00:00 · Latest: 2026-02-24T17:42:46+00:00

Comments: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.

中文标题/摘要

标题：LUMEN：纵向多模态放射学模型用于预后和诊断

大型视觉-语言模型（VLMs）已从通用应用发展到临床领域的专业用途，展示了在放射学领域决策支持的潜力。一种有前景的应用是通过视觉和自然语言问答（VQA）界面分析放射学影像数据（如胸部X光片），辅助放射科医生进行决策。当有纵向影像时，放射科医生会分析时间变化，这对于准确诊断和预后至关重要。手动的纵向分析是一个耗时的过程，推动了纵向影像解释训练框架的发展。我们介绍了一种新的训练框架LUMEN，该框架针对纵向胸部X光片解释进行了优化，利用多图像和多任务指令微调来增强预后和诊断性能。我们在公开的MIMIC-CXR及其相关Medical-Diff-VQA数据集上进行了实验。我们进一步制定了一个包含纵向研究的新指令遵循数据集，以促进预后VQA任务的发展。我们的方法在诊断VQA任务中显著优于基线模型，并且更重要的是，展示了预后能力的前景。这些结果强调了精心设计、指令调优的VLMs在纵向放射学影像数据放射学解释中的价值。

Summary / 总结

LUMEN is a training framework designed for longitudinal chest X-ray (CXR) interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. Experiments on MIMIC-CXR and Medical-Diff-VQA datasets show significant improvements over baseline models in diagnostic VQA tasks and promising potential for prognostic capabilities.

LUMEN 是一种训练框架，旨在增强胸部 X 光片的纵向分析能力，提升预测和诊断性能。它通过多图像和多任务指令微调来提高 VQA 任务的表现。实验表明，该方法在 MIMIC-CXR 和 Medical-Diff-VQA 数据集上的表现显著优于基线模型，显示出其在准确和临床意义上进行放射学解释的潜力。

SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

Authors: Jose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig, Pablo Rey Valiente, Jens Lambrecht, Jörg Krüger

First: 2026-02-24T17:42:34+00:00 · Latest: 2026-02-24T17:42:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

中文标题/摘要

标题：SynthRender和IRIS：工业物体感知的双向仿真实际转移开源框架和数据集

物体感知对于机器人物料处理和质量检验等任务至关重要。然而，现代监督深度学习感知模型在半受控条件下实现稳健自动化需要大量数据集。获取和标注专有部件的数据集成本是广泛部署的主要障碍。在此背景下，我们发布了SynthRender，一个具有引导域随机化能力的合成图像生成开源框架。此外，我们还对从真实部件的2D图像创建3D资产的现实到仿真实验技术进行了基准测试。结合域随机化，这些合成资产即使对于缺乏3D文件的部件也能提供低成本、可转移的数据。我们还引入了IRIS，工业现实-仿真实图像集，包含32个类别，具有多种纹理、类内变化、强类间相似性和约20,000个标签。SynthRender的多基准消融实验概述了高效数据生成的指南。我们的方法超越了现有方法，在一个公开的机器人数据集上达到99.1%的mAP@50，在一个汽车基准上达到98.3%的mAP@50，在IRIS上达到95.3%的mAP@50。

Summary / 总结

The research aims to address the challenge of acquiring large datasets for robust object perception in semi-uncontrolled environments, particularly for industrial applications. The study introduces SynthRender, an open-source framework for synthetic image generation with Guided Domain Randomization, and IRIS, a dataset with diverse textures and labels. The method outperforms existing approaches, achieving high mean Average Precision (mAP) scores on various benchmarks, including 99.1% mAP@50 on a robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

本文旨在解决在工业环境中获取大量数据以实现稳健的物体感知的挑战。作者引入了SynthRender，一个带有引导域随机化能力的开源合成图像生成框架，以及包含32类工业部件的IRIS数据集。研究者对现实到模拟的技术进行了基准测试，并展示了使用SynthRender生成的合成资产可以实现高精度，超越现有方法，在一个机器人数据集上达到99.1%的mAP@50，在一个汽车基准上达到98.3%的mAP@50，在IRIS上达到95.3%的mAP@50。

egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

Authors: Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, Christian Holz

Venue: NeurIPS 2025

First: 2025-10-25T03:04:51+00:00 · Latest: 2026-02-24T17:38:14+00:00

Comments: Accepted for publication at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.

中文标题/摘要

标题：egoEMOTION：主观视角和生理信号在现实任务中识别情绪和个性

理解情感是预测人类行为的关键，但当前的主观视角基准数据大多忽略了塑造决策和行动的情绪状态。现有主观感知任务主要关注物理活动、手物交互和注意力建模，假设中性情感和统一的个性特征。这限制了视觉系统捕捉行为关键内部驱动因素的能力。在本文中，我们提出了egoEMOTION，这是第一个将主观视角视觉和生理信号与控制和现实场景中的密集自我报告情绪和个性相结合的数据集。我们的数据集包括来自43名参与者超过50小时的记录，使用Meta的Project Aria眼镜采集。每个会话提供同步的眼球追踪视频、头戴式光电容积描记图、惯性运动数据和生理基线作为参考。参与者完成了情绪诱发任务和自然活动，并使用Circumplex模型和Mikels的轮盘以及Big Five模型自我报告其情感状态。我们定义了三个基准任务：（1）连续情感分类（正向性、唤醒、支配性）；（2）离散情绪分类；（3）特质水平的个性推断。我们展示了经典的基于学习的方法，在现实情感预测中作为简单基线，从主观视角系统捕获的信号中产生更好的估计，而不是处理生理信号。我们的数据集将情绪和个性确立为主观感知的核心维度，并为情感驱动的行为、意图和交互建模开辟了新方向。

Summary / 总结

The research aims to understand human emotions and personality in real-world scenarios by integrating egocentric vision and physiological signals. The study uses a dataset called egoEMOTION, which includes over 50 hours of recordings from 43 participants, capturing synchronized visual and physiological data along with self-reports of emotions and personality traits. Key findings show that a classical learning-based method performs better in predicting continuous affect from egocentric vision signals compared to processing physiological signals alone.

研究旨在通过结合第一人称视角和生理信号来理解真实场景中的人类情感和个性。方法包括使用Meta的Project Aria眼镜收集43名参与者共50小时的数据，捕捉同步的眼球追踪视频、头部穿戴的光体积描记图和惯性运动数据。主要发现表明，与单独使用生理信号相比，一种经典的基于学习的方法在从第一人称视角信号中预测连续情感和离散情感方面表现更好。

UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Authors: Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi

First: 2026-02-24T17:33:12+00:00 · Latest: 2026-02-24T17:33:12+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.

中文标题/摘要

标题：UDVideoQA：城市动态视频问答数据集，用于城市交通多对象时空推理

理解城市交通中复杂多智能体的动力学仍然是视频语言模型的基本挑战。本文介绍了Urban Dynamics VideoQA基准数据集，该数据集捕捉了动态城市场景的非剧本化现实行为。UDVideoQA源自16小时的交通录像，记录了多个城市交叉口在不同交通、天气和光照条件下的动态。该数据集采用事件驱动的动态模糊技术，确保隐私保护同时不牺牲场景保真度。使用统一的注释流水线，数据集包含8小时密集注释视频中的28K问答对，平均每秒一个问题。其分类法遵循分层推理级别，从基本理解到事件推理、逆向推理和反事实推理，使视觉定位和因果推理的系统评估成为可能。全面的实验在UDVideoQA上基准测试了10个SOTA视频语言模型，在一个互补的视频问答生成基准上测试了8个模型。结果揭示了感知-推理差距的持续存在，表明在抽象推理方面表现优异的模型往往在基本视觉定位方面失败。虽然Gemini Pro等模型在零样本准确性上达到最高，但对UDVideoQA进行微调的小型Qwen2.5-VL 7B模型实现了与专有系统相当的性能。在VideoQGen中，Gemini 2.5 Pro和Qwen3 Max生成最相关和复杂的问答，但所有模型都表现出有限的语言多样性，强调了人类中心评估的必要性。UDVideoQA套件包括数据集、注释工具以及用于视频问答和视频问答生成的基准测试，为推进稳健、隐私保护和现实世界多模态推理提供了基础。UDVideoQA可在https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/ 获取。

Summary / 总结

The paper introduces UDVideoQA, a dataset for evaluating multi-object spatio-temporal reasoning in urban dynamics through traffic video question answering. It consists of 16 hours of traffic footage with 28K question-answer pairs, focusing on diverse urban scenarios. Experiments show a gap between abstract inference and visual grounding, with fine-tuning Qwen2.5-VL 7B on UDVideoQA improving performance. Models like Gemini Pro excel in zero-shot accuracy but struggle with visual grounding. The dataset supports both VideoQA and VideoQGen benchmarks, promoting robust and privacy-aware multimodal reasoning.

论文介绍了UDVideoQA数据集，用于通过交通视频问答评估城市动态中的多对象时空推理。该数据集包含16小时的交通录像，有28K个问题-答案对，重点关注多样化的城市场景。实验表明，在抽象推理和视觉接地之间存在差距，通过在UDVideoQA上微调Qwen2.5-VL 7B可以提高性能。如Gemini Pro等模型在零样本准确率方面表现出色，但在视觉接地方面却遇到困难。该数据集支持视频问答和视频问题生成的基准测试，促进稳健且隐私保护的多模态推理。

An Enhanced Projection Pursuit Tree Classifier with Visual Methods for Assessing Algorithmic Improvements

Authors: Natalia da Silva, Dianne Cook, Eun-Kyung Lee

First: 2026-02-24T17:27:17+00:00 · Latest: 2026-02-24T17:27:17+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper presents enhancements to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions. The original algorithm uses linear combinations of variables in a tree structure where depth is constrained to be less than the number of classes -- a limitation that proves too rigid for complex classification problems. Our extensions improve performance in multi-class settings with unequal variance-covariance structures and nonlinear class separations by allowing more splits and more flexible class groupings in the projection pursuit computation. Proposing algorithmic improvements is straightforward; demonstrating their actual utility is not. We therefore develop two visual diagnostic approaches to verify that the enhancements perform as intended. Using high-dimensional visualization techniques, we examine model fits on benchmark datasets to assess whether the algorithm behaves as theorized. An interactive web application enables users to explore the behavior of both the original and enhanced classifiers under controlled scenarios. The enhancements are implemented in the R package PPtreeExt.

中文标题/摘要

标题：一种增强的投影追求树分类器及其可视化方法以评估算法改进

本文提出了对投影追求树分类器的改进以及用于评估其在高维空间中影响的可视化诊断方法。原始算法在树结构中使用变量的线性组合，其中深度限制为类数以下——这一限制对于复杂分类问题来说过于僵化。我们的扩展在多类设置中提高了性能，特别是在具有不等方差-协方差结构和非线性类分离的情况下，通过允许更多的分割和更灵活的类群组来改进投影追求计算。提出算法改进相对简单，但证明其实际效用则不然。因此，我们开发了两种可视化诊断方法来验证这些改进是否按预期执行。利用高维可视化技术，我们对基准数据集进行模型拟合检查，以评估算法是否如理论所述行为。一个交互式网络应用程序使用户能够在受控场景下探索原始和增强分类器的行为。这些改进已实现于R包PPtreeExt中。

Cooperative-Competitive Team Play of Real-World Craft Robots

Authors: Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang, Yufeng Zhang, Cheng Zhou, Zhengyou Zhang, Lei Han

Venue: ICRA 2026

First: 2026-02-24T17:15:37+00:00 · Latest: 2026-02-24T17:15:37+00:00

Comments: Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), Vienna, Austria

Abs · PDF · Code1 · Code2

Abstract

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi-agent sim-to-real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim-to-real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi-robot car competitive game and a cooperative task in real-world settings.

中文标题/摘要

标题：真实世界工艺机器人协作竞争团队玩法

近年来，多智能体深度强化学习（RL）在开发智能游戏代理方面取得了显著进展。然而，使用多智能体RL高效训练集体机器人以及将学到的策略转移到实际应用中仍然是开放的研究问题。在本文中，我们首先开发了一个全面的机器人系统，包括仿真、分布式学习框架和物理机器人组件。然后，我们提出并评估了针对该平台设计的高效训练协作和竞争策略的强化学习技术。为了解决多智能体仿真到现实世界的转移挑战，我们引入了离域状态初始化（OODSI）以减轻仿真到现实世界的差距影响。在实验中，OODSI将仿真到现实世界的性能提高了20%。我们通过多机器人汽车竞争游戏和实际环境中的协作任务实验展示了我们方法的有效性。

Summary / 总结

This study aims to develop efficient training methods for cooperative and competitive policies of multi-agent robots using deep reinforcement learning. The researchers created a comprehensive robotic system and introduced Out of Distribution State Initialization (OODSI) to improve sim-to-real transfer, achieving a 20% improvement in Sim2Real performance. Experiments with multi-robot car games and cooperative tasks in real-world settings validated the approach.

本文旨在解决使用多智能体深度强化学习（RL）训练集体机器人以及将学习到的策略转移到实际应用中的挑战。作者开发了一个全面的机器人系统，并提出了一种离分布状态初始化（OODSI）方法以提高模拟到现实的性能。实验结果显示，OODSI在多机器人汽车的竞速游戏和协作任务中将模拟到现实的性能提高了20%。

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Authors: Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen

First: 2026-02-08T10:55:03+00:00 · Latest: 2026-02-24T17:14:22+00:00

Comments: 17 pages, 5 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.

中文标题/摘要

标题：AceGRPO：自适应课程增强组相对策略优化在自主机器学习工程中的应用

自主机器学习工程（MLE）要求代理在长时间范围内进行持续迭代优化。虽然基于大语言模型的代理显示出潜力，但当前基于提示的MLE代理因参数冻结而表现出行为停滞。尽管强化学习（RL）可以提供解决方案，但将其应用于MLE受到执行延迟高和数据选择低效的阻碍。鉴于这些挑战，我们提出了AceGRPO，其包含两个核心组件：（1）不断重新利用执行跟踪的演变数据缓冲区，以及（2）由可学习潜力函数引导的自适应采样，该函数动态优先考虑代理学习前沿的任务以最大化学习效率。利用AceGRPO，我们训练的Ace-30B模型在MLE-Bench-Lite上实现了100%的有效提交率，接近专有前沿模型的性能，并优于更大的开源基线（例如DeepSeek-V3.2），展示了其在持续迭代优化中的稳健能力。代码可在https://github.com/yuzhu-cai/AceGRPO/获取。

Summary / 总结

AceGRPO is designed to address the challenges of Autonomous Machine Learning Engineering by overcoming the limitations of current prompt-based agents and the inefficiencies of traditional Reinforcement Learning. It introduces an Evolving Data Buffer and Adaptive Sampling to continuously repurpose execution traces and prioritize tasks based on learnability potential, respectively. The trained Ace-30B model demonstrates a 100% valid submission rate on MLE-Bench-Lite, matches the performance of proprietary models, and outperforms larger open-source baselines, showcasing its robust capability for sustained iterative optimization.

AceGRPO旨在解决自主机器学习工程中的挑战，克服当前基于提示的代理和传统强化学习的局限性。它引入了不断进化的数据缓冲区和自适应采样，以持续优化任务并优先在代理的学习前沿进行学习。训练后的Ace-30B模型在MLE-Bench-Lite上实现了100%的有效提交率，接近专有模型的性能，并优于更大的开源基线，展示了其持续迭代优化的能力。

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Authors: Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu

First: 2026-02-18T20:42:39+00:00 · Latest: 2026-02-24T17:10:02+00:00

Abs · PDF · Code1 · Code2

Abstract

The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.

中文标题/摘要

标题：SimToolReal：一种面向对象的零样本灵巧工具操作策略

操作工具的能力显著扩展了机器人可以执行的任务集。然而，工具操作代表了一类具有挑战性的灵巧性，需要抓取细长物体、手持物体旋转以及进行有力的交互。由于收集这些行为的遥操作数据具有挑战性，因此模拟到现实的强化学习（RL）是一种有前途的替代方案。然而，先前的方法通常需要大量的工程努力来建模物体并调整每个任务的奖励函数。在本文中，我们提出了SimToolReal，朝着为工具操作生成通用的模拟到现实的RL策略迈出了一步。我们不是专注于单一物体和任务，而是通过程序生成大量类似工具的物体原语，并训练一个具有通用目标的RL策略，即操纵每个物体到随机目标姿态。这种方法使SimToolReal能够在测试时进行通用灵巧工具操作，而无需任何物体或任务特定的训练。我们证明SimToolReal在120个跨越24个任务、12个物体实例和6个工具类别的现实世界操作中，比先前的重新瞄准和固定抓取方法高出37%，并且与针对特定目标物体和任务训练的专家RL策略的性能相当。最后，我们展示了SimToolReal在一系列日常工具上具有良好的零样本泛化能力，实现了超过120次操作的强零样本性能。

Summary / 总结

The paper introduces SimToolReal, a policy designed for zero-shot dexterous tool manipulation, which procedurally generates a variety of tool-like objects in simulation and trains a single RL policy to manipulate these objects to random goal poses. The approach significantly outperforms previous methods by 37% and achieves strong zero-shot performance across 120 real-world rollouts involving 24 tasks, 12 objects, and 6 tool categories.

该论文提出了一种名为SimToolReal的方法，通过强化学习实现零样本工具操作。该方法在模拟中生成多种工具样物体，并训练一个单一策略将这些物体移动到随机目标位置。这种方法使得策略能够在不进行额外训练的情况下泛化到真实世界的工具。结果表明，SimToolReal 的性能比之前的方法高出37%，并在120次真实世界操作中实现了强大的零样本性能，涉及24个任务、12个物体实例和6个工具类别。

BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

Authors: Jiaxing Yu, Dongyang Ren, Hangyu Xu, Zhouyuxiao Yang, Yuanqi Li, Jie Guo, Zhengkang Zhou, Yanwen Guo

Venue: CVPR 2026

First: 2026-02-24T17:03:45+00:00 · Latest: 2026-02-24T17:03:45+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.

中文标题/摘要

标题：BrepGaussian：多视图图像中的高斯点绘制法进行CAD重建

边界表示（B-rep）模型将3D实体表示为其显式的边界：修剪的角、边和面。从无结构数据中恢复B-rep表示是计算机视觉和图形学中一个具有挑战性和价值的任务。最近深度学习的进步极大地提高了3D形状几何的恢复，但仍依赖于密集且干净的点云，并且难以泛化到新形状。我们提出了一种新颖的框架B-rep高斯点绘制（BrepGaussian），该框架从2D图像中学习3D参数表示。我们采用具有可学习特征的高斯点绘制渲染器，随后采用特定的拟合策略。为了分离几何重建和特征学习，我们引入了一种两阶段学习框架，首先捕捉几何和边缘，然后细化补丁特征以实现干净的几何和一致的实例表示。大量实验表明，我们的方法在最先进的方法中表现出更优的性能。在接收后我们将发布我们的代码和数据集。

Summary / 总结

The paper aims to recover the boundary representation (B-rep) of 3D solids from multi-view images using a novel framework called B-rep Gaussian Splatting (BrepGaussian). This method uses a Gaussian Splatting renderer with learnable features and a two-stage learning strategy to first capture geometry and edges, then refine patch features. Experiments show that BrepGaussian outperforms existing methods in recovering clean and coherent 3D geometry from 2D images.

研究旨在从多视角图像中恢复3D固体的边界表示（B-rep），解决现有方法依赖密集点云和难以处理新形状的问题。提出的B-rep高斯斑点（BrepGaussian）框架采用两阶段学习过程，首先捕捉几何形状和边缘，然后细化斑块特征以实现干净的几何形状和一致的实例表示。实验表明，BrepGaussian在准确性和对新形状的泛化能力方面优于现有方法。

Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones

Authors: Rong Zou, Marco Cannici, Davide Scaramuzza

First: 2026-02-24T17:02:56+00:00 · Latest: 2026-02-24T17:02:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.

中文标题/摘要

标题：基于事件辅助的快速飞行无人机锐利辐射场重建

快速飞行的空中机器人在有限的电池约束下允许多快好省的检查，直接应用于基础设施检查、地形探索和搜救。然而，高速度导致图像中严重的运动模糊，使姿态估计产生显著的漂移和噪声，使得基于神经辐射场（NeRF）的密集三维重建特别具有挑战性，因为它们对这些退化非常敏感。在本文中，我们提出了一种统一框架，利用异步事件流与运动模糊帧相结合，从敏捷的无人机飞行中重建高保真辐射场。通过将事件-图像融合嵌入到NeRF优化中，并使用事件和帧模态联合精炼基于事件的视觉-惯性里程计先验，我们的方法在没有地面真实监督的情况下恢复了锐利的辐射场和准确的相机轨迹。我们在合成数据和由快速飞行无人机捕获的真实序列上验证了我们的方法。尽管无人机飞行高度动态，导致RGB帧严重退化且姿态先验变得不可靠，但我们的方法仍能重建高保真辐射场并保留场景细节，与最先进的方法相比，在真实数据上的性能提高了超过50%。

Summary / 总结

This work addresses the challenge of dense 3D reconstruction for fast-flying drones using Neural Radiance Fields (NeRFs), which are sensitive to motion blur and pose drift. The authors propose a unified framework that integrates event streams with motion-blurred frames to reconstruct high-fidelity radiance fields. By fusing events and images in NeRF optimization and jointly refining visual-inertial odometry priors, the method achieves accurate radiance fields and camera trajectories without ground-truth supervision. Experiments on both synthetic and real-world data demonstrate that the proposed method outperforms state-of-the-art techniques by over 50% in real-world scenarios.

该研究通过将事件流和运动模糊帧集成到Neural Radiance Field优化中，解决了快速飞行无人机的高保真辐射场重建问题。该方法利用事件和帧数据共同优化视觉惯性里程计先验，实现了在无需地面真值监督的情况下准确的相机轨迹恢复和锐利的辐射场重建。实验结果表明，该方法在合成数据和真实场景数据上均优于现有方法，性能提升超过50%，即使在严重的运动模糊和不可靠的位姿估计情况下也能保持场景细节。

Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

Authors: Noé Artru, Rukhshanda Hussain, Emeline Got, Alexandre Messier, David B. Lindell, Abdallah Dib

Venue: CVPR

First: 2026-02-24T17:02:11+00:00 · Latest: 2026-02-24T17:02:11+00:00

Comments: 14 pages, 8 figures, to be published in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Abs · PDF · Code1 · Code2

Abstract

Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.

中文标题/摘要

标题：Skullptor：秒级多视角表面法线预测的高保真3D头像重建

从图像中重建高保真3D头像几何结构对于广泛的应用至关重要，但现有方法面临根本性的限制。传统的摄影测量能够实现极高的细节，但需要大量的相机阵列（25-200+视角）、大量的计算以及在面部毛发等复杂区域的手动清理。最近的替代方案存在根本性的权衡：基础模型能够高效地进行单图像重建，但缺乏精细的几何细节，而基于优化的方法能够实现更高的保真度，但需要密集的视角和昂贵的计算。我们通过结合两种范式的优点来弥合这一差距。我们的方法引入了一种多视角表面法线预测模型，该模型将单目基础模型与跨视角注意力相结合，在前向传递中生成几何上一致的法线。然后，我们利用这些预测作为逆渲染优化框架中的强几何先验，以恢复高频表面细节。我们的方法在单图像和多视角方法中表现出色，实现了与密集视角摄影测量相媲美的高保真重建，同时减少了相机需求和计算成本。代码和模型将被发布。

Summary / 总结

The research aims to improve high-fidelity 3D head reconstruction from images, addressing the limitations of existing methods. It combines monocular foundation models with multi-view surface normal prediction and inverse rendering optimization to achieve high detail with fewer cameras and less computation. The method outperforms both single-image and multi-view approaches, matching the detail of dense-view photogrammetry while reducing requirements and costs.

研究旨在通过图像实现高保真3D头部重建，解决现有方法的局限性。提出了一种结合单目基础模型和多视角表面法线预测以及逆向渲染优化的混合方法。该方法减少了对大量摄像头和计算成本的需求，同时实现了与密集视角光束成像技术相媲美的几何细节重建。

Probing Graph Neural Network Activation Patterns Through Graph Topology

Authors: Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis

First: 2026-02-24T16:52:36+00:00 · Latest: 2026-02-24T16:52:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Curvature notions on graphs provide a theoretical description of graph topology, highlighting bottlenecks and denser connected regions. Artifacts of the message passing paradigm in Graph Neural Networks, such as oversmoothing and oversquashing, have been attributed to these regions. However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs. Through Massive Activations, which correspond to extreme edge activation values in Graph Transformers, we probe this correspondence. Our findings on synthetic graphs and molecular benchmarks reveal that MAs do not preferentially concentrate on curvature extremes, despite their theoretical link to information flow. On the Long Range Graph Benchmark, we identify a systemic \textit{curvature shift}: global attention mechanisms exacerbate topological bottlenecks, drastically increasing the prevalence of negative curvature. Our work reframes curvature as a diagnostic probe for understanding when and why graph learning fails.

中文标题/摘要

标题：通过图拓扑探究图神经网络激活模式

图上的曲率概念为图拓扑提供了一个理论描述，突显了瓶颈和更密集连接的区域。图神经网络中消息传递范式的伪平滑和伪压缩现象被认为与这些区域有关。然而，图的拓扑结构如何与GNNs学习的偏好相互作用仍然不清楚。通过Massive Activations，即图变压器中极端边激活值，我们探究了这种对应关系。在合成图和分子基准测试中，我们的发现表明，尽管理论上与信息流有关，MAs并不倾向于集中在曲率极端上。在Long Range Graph基准测试中，我们识别出一种系统性的“曲率偏移”：全局注意力机制加剧了拓扑瓶颈，大幅增加了负曲率的出现频率。我们的工作将曲率重新定义为理解何时以及为何图学习失败的一种诊断探针。

Summary / 总结

The study investigates how graph topology interacts with the learned preferences of Graph Neural Networks (GNNs) by analyzing Massive Activations (MAs) in Graph Transformers. The research finds that MAs do not concentrate on curvature extremes, despite the theoretical link to information flow. On the Long Range Graph Benchmark, a systemic curvature shift is identified, where global attention mechanisms exacerbate topological bottlenecks, leading to an increase in negative curvature. This work reframes curvature as a diagnostic tool to understand when and why graph learning fails.

研究通过分析图变换器中的极大激活（MAs）来探究图拓扑与图神经网络（GNNs）学习偏好之间的交互。在合成图和分子基准测试中，MAs 并未集中在曲率极端区域，尽管存在理论上的信息流关联。在长范围图基准测试中，全局注意力机制加剧了拓扑瓶颈，导致负曲率显著增加。这项工作将曲率重新定义为诊断工具，以理解何时以及为什么图学习会失败。

ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning

Authors: Duowen Chen, Yan Wang

Venue: CVPR 2026

First: 2026-02-24T16:41:16+00:00 · Latest: 2026-02-24T16:41:16+00:00

Comments: CVPR 2026. code: https://github.com/DuowenC/FSSLlib

Abs · PDF · Code1 · Code2 · Code3

Abstract

Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low-confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re-include the discarded samples into training by a positive-negative proxy pool to mitigate the impact of potentially-incorrect pseudo-labels. Insight experiments & theoretical analysis show our significant performance and convergence in FSSL.

中文标题/摘要

标题：ProxyFL：一种代理引导的联邦半监督学习框架

联邦半监督学习（FSSL）旨在通过利用客户端部分标注的本地数据以隐私保护的方式协作训练全局模型。在FSSL中，数据异质性是一个具有挑战性的问题，它既存在于客户端之间，也存在于客户端内部。外部异质性指的是不同客户端之间数据分布的差异，而内部异质性则表示客户端内标记数据和未标记数据之间的不匹配。大多数FSSL方法通常设计固定或动态的参数聚合策略，以收集客户端知识在服务器端（外部）和/或过滤掉低置信度的未标记样本以减少本地客户端中的错误。但是，前者难以通过直接权重精确拟合理想的全局分布，而后者会导致较少的数据参与FL训练。为此，我们提出了一种名为ProxyFL的代理引导框架，旨在通过统一的代理同时缓解外部和内部异质性。即，我们将分类器的可学习权重视为代理，以模拟局部和全局的类别分布。对于外部，我们明确地将全局代理优化为异常值而不是直接权重；对于内部，我们通过正负代理池重新包含被丢弃的样本进行训练，以减轻潜在错误伪标签的影响。洞察实验和理论分析显示了我们在FSSL中的显著性能和收敛性。

Summary / 总结

ProxyFL is a framework designed to address data heterogeneity in Federated Semi-Supervised Learning (FSSL) by leveraging learnable classifier weights as proxies. It mitigates external heterogeneity by optimizing global proxies against outliers and internal heterogeneity by re-including discarded samples through a positive-negative proxy pool. Experimental results demonstrate significant performance and convergence improvements in FSSL compared to existing methods.

ProxyFL 是一个框架，旨在通过利用可学习的分类器权重作为代理来解决联邦半监督学习（FSSL）中的数据异质性问题，以同时缓解外部和内部异质性。它通过显式优化全局代理来处理异常值，并通过正负代理池重新纳入被丢弃的样本进行训练，以提高伪标签的准确性。实验结果表明，与现有方法相比，ProxyFL 在 FSSL 中表现出显著的性能和收敛性改进。

Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

Authors: Zhangjie Xia, Yu Yang, Pan Xu

First: 2026-02-24T16:32:50+00:00 · Latest: 2026-02-24T16:32:50+00:00

Comments: 33 pages, 9 figures, 11 tables

Abs · PDF · Code1 · Code2

Abstract

Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics. Existing methods typically address dynamics mismatch either globally over the state space or via pointwise data filtering; these approaches can miss localized cross-domain similarities or incur high computational cost. We propose Localized Dynamics-Aware Domain Adaptation (LoDADA), which exploits localized dynamics mismatch to better reuse source data. LoDADA clusters transitions from source and target datasets and estimates cluster-level dynamics discrepancy via domain discrimination. Source transitions from clusters with small discrepancy are retained, while those from clusters with large discrepancy are filtered out. This yields a fine-grained and scalable data selection strategy that avoids overly coarse global assumptions and expensive per-sample filtering. We provide theoretical insights and extensive experiments across environments with diverse global and local dynamics shifts. Results show that LoDADA consistently outperforms state-of-the-art off-dynamics offline RL methods by better leveraging localized distribution mismatch.

中文标题/摘要

标题：基于局部动力学感知的域适应离线强化学习

离线强化学习（RL）旨在使用目标域的有限目标数据和不同过渡动力学下丰富的源数据来学习一个策略。现有方法通常在状态空间上全局处理动力学差异，或通过点数据过滤；这些方法可能会错过局部跨域相似性或导致高计算成本。我们提出了基于局部动力学感知的域适应（LoDADA），它利用局部动力学差异更好地重用源数据。LoDADA 对源数据集和目标数据集中的过渡进行聚类，并通过域区分估计聚类级别的动力学差异。来自动力学差异较小的聚类的源过渡被保留，而来自动力学差异较大的聚类的过渡被过滤掉。这提供了一种细粒度且可扩展的数据选择策略，避免了过于粗略的全局假设和昂贵的逐样本过滤。我们提供了理论见解，并在具有不同全局和局部动力学变化的环境中进行了广泛的实验。结果表明，LoDADA 通过更好地利用局部分布差异，一致地优于最先进的离线强化学习方法。

Summary / 总结

The research addresses the challenge of learning a policy for a target domain using limited target data and abundant source data with different transition dynamics. It proposes Localized Dynamics-Aware Domain Adaptation (LoDADA), which clusters transitions from source and target datasets and retains source transitions from clusters with small dynamics discrepancy, filtering out those with large discrepancy. This method avoids global assumptions and expensive per-sample filtering, leading to better performance in diverse environments with both global and local dynamics shifts compared to existing methods.

论文针对离线动力学离线强化学习问题，目标是在有限的目标数据和大量具有不同转移动力学的源数据下学习一个策略。现有方法要么全局处理动力学差异，要么使用点对点的数据过滤，这可能会错过局部相似性或计算成本高昂。提出的局部动力学感知域适应（LoDADA）方法对源和目标数据集中的转换进行聚类，并通过域鉴别估计簇级动力学差异。它保留来自小差异簇的源转换，并过滤掉来自大差异簇的转换，提供了一种细粒度且可扩展的数据选择策略。在各种环境中的实验结果表明，LoDADA通过有效利用局部分布差异，优于最先进的离线强化学习方法。

PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction

Authors: Akila Sampath, Vandana Janeja, Jianwu Wang

First: 2026-01-23T00:43:51+00:00 · Latest: 2026-02-24T16:26:41+00:00

Abs · PDF · Code1 · Code2

Abstract

The accurate estimation of Arctic snow depth remains a critical time-varying inverse problem due to the scarcity in associated sea ice parameters. Existing process-based and data-driven models are either highly sensitive to sparse data or lack the physical interpretability required for climate-critical applications. To address this gap, we introduce PhysE-Inv, a novel framework that integrates a sophisticated sequential architecture, namely an LSTM Encoder-Decoder with Multi-head Attention and contrastive learning, with physics-guided inference. Our core innovation lies in a physics-constrained inversion methodology. This methodology first leverages the hydrostatic balance forward model as a target-formulation proxy, enabling effective learning in the absence of direct ground truth; second, it uses reconstruction physics regularization over a latent space to dynamically discover hidden physical parameters from noisy, incomplete time-series input. Evaluated against state-of-the-art baselines, PhysE-Inv significantly improves prediction performance, reducing error by 20% while demonstrating superior physical consistency and resilience to data sparsity compared to empirical methods. Beyond Arctic snow depth, PhysE-Inv can be applied broadly to other noisy, data-scarce problems in Earth and climate science.

中文标题/摘要

标题：PhysE-Inv：一种用于北极积雪深度预测的物理编码逆模型方法

由于缺乏相关的海冰参数，北极积雪深度的准确估计仍然是一个关键的时间变化逆问题。现有的基于过程和数据驱动的模型要么对稀疏数据高度敏感，要么缺乏气候关键应用所需的物理可解释性。为了解决这一差距，我们引入了PhysE-Inv，这是一种新颖的框架，它将复杂的序列架构，即带有多重注意力机制和对比学习的LSTM编码器-解码器，与物理引导的推理相结合。我们核心的创新在于一种物理约束的逆方法。该方法首先利用静力平衡前向模型作为目标公式代理，从而在没有直接地面真实值的情况下实现有效的学习；其次，它使用潜空间的重建物理正则化来动态发现来自噪声和不完整时间序列输入的隐藏物理参数。与最先进的基线相比，PhysE-Inv 显著提高了预测性能，将误差降低了20%，同时在物理一致性和对数据稀疏性的鲁棒性方面表现出色，优于经验方法。除了北极积雪深度，PhysE-Inv 还可以广泛应用于地球和气候科学中的其他噪声和数据稀缺问题。

Summary / 总结

The paper introduces PhysE-Inv, a physics-encoded inverse modeling approach for predicting Arctic snow depth, addressing the limitations of existing models in handling sparse data and lacking physical interpretability. The method combines an LSTM Encoder-Decoder with multi-head attention and contrastive learning, constrained by physics principles. It uses hydrostatic balance and physics regularization to learn from noisy time-series data, achieving a 20% reduction in prediction error and better physical consistency compared to state-of-the-art models. Beyond Arctic snow depth, PhysE-Inv can be applied to other data-scarce problems in Earth and climate science.

论文提出了PhysE-Inv，一种用于预测北极雪深的物理编码逆向建模方法，解决了现有模型在处理稀疏数据和缺乏物理可解释性方面的局限性。该方法结合了LSTM编码器-解码器和多头注意力机制及对比学习，并受到物理原理的约束。它利用水静力平衡前向模型和物理正则化从噪声数据中推断隐藏参数。实验结果表明，PhysE-Inv在预测误差上比最先进的模型降低了20%，并且在物理一致性和对数据稀疏性的鲁棒性方面表现出色。

Tool Building as a Path to "Superintelligence"

Authors: David Koplow, Tomer Galanti, Tomaso Poggio

First: 2026-02-24T16:22:10+00:00 · Latest: 2026-02-24T16:22:10+00:00

Abs · PDF · Code1 · Code2

Abstract

The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $γ$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $γ$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

中文标题/摘要

标题：工具构建作为通向“超级智能”的路径

勤奋学习者框架表明，通过测试时搜索，LLM可以在满足足够步骤成功概率$γ$的情况下实现超级智能。在本研究中，我们设计了一个基准来衡量在逻辑离分布推理上的$γ$值。我们构建了一类任务，涉及GF(2)电路重构，每一步推理难度增加，并且从信息论角度来看，除非LLM仔细整合所有提供的信息，否则无法可靠解决。我们的分析表明，虽然小型LLM的$γ$值随深度增加呈超线性下降，但前沿模型在该任务上表现出部分稳健性。此外，我们发现大规模成功推理依赖于精确的工具调用，这将使LLM通过勤奋学习者框架实现通用超级智能成为可能。

Summary / 总结

This work evaluates the step-success probability $γ$ for large language models (LLMs) in achieving superintelligence through the Diligent Learner framework, which involves test-time search. The authors design a benchmark with GF(2) circuit reconstruction tasks that become increasingly difficult with each reasoning step. They find that while small LLMs show declining $γ$ values, frontier models exhibit partial robustness. Successful reasoning at scale depends on precise tool calls, highlighting tool design as a critical capability for LLMs to achieve general superintelligence.

研究探讨了大型语言模型（LLMs）通过Diligent Learner框架实现超级智能的可能性，该框架依赖于测试时搜索和步骤成功率$γ$。研究人员设计了一种基准测试，涉及逐步增加难度的GF(2)电路重构任务，以衡量$γ$。研究发现，虽然较小的LLMs在任务深度增加时$γ$值下降，但较大的模型在该任务上表现出部分鲁棒性。研究强调了精确工具调用对于大规模推理成功的重要性，表明工具设计对于LLMs通过Diligent Learner框架实现通用超级智能至关重要。

Uncertainty Propagation Networks for Neural Ordinary Differential Equations

Authors: Hadi Jahanshahi, Zheng H. Zhu

Venue: Neurocomputing, 2026, 133134

First: 2025-08-22T22:24:46+00:00 · Latest: 2026-02-24T16:11:07+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper introduces Uncertainty Propagation Network (UPN), a novel family of neural differential equations that naturally incorporate uncertainty quantification into continuous-time modeling. Unlike existing neural ODEs that predict only state trajectories, UPN simultaneously model both state evolution and its associated uncertainty by parameterizing coupled differential equations for mean and covariance dynamics. The architecture efficiently propagates uncertainty through nonlinear dynamics without discretization artifacts by solving coupled ODEs for state and covariance evolution while enabling state-dependent, learnable process noise. The continuous-depth formulation adapts its evaluation strategy to each input's complexity, provides principled uncertainty quantification, and handles irregularly-sampled observations naturally. Experimental results demonstrate UPN's effectiveness across multiple domains: continuous normalizing flows (CNFs) with uncertainty quantification, time-series forecasting with well-calibrated confidence intervals, and robust trajectory prediction in both stable and chaotic dynamical systems.

中文标题/摘要

标题：不确定性传播网络在神经常微分方程中的应用

本文介绍了不确定性传播网络（UPN），这是一种新颖的神经常微分方程家族，能够自然地将不确定性量化纳入连续时间建模中。与现有的只能预测状态轨迹的神经常微分方程不同，UPN同时建模状态的演化及其相关的不确定性，通过参数化耦合的微分方程来建模均值和协方差动力学。该架构通过求解状态和协方差演化的耦合常微分方程来高效地在非线性动力学中传播不确定性，同时允许状态依赖的学习过程噪声。连续深度的表述能够根据每个输入的复杂性调整其评估策略，提供原则性的不确定性量化，并自然处理不规则采样的观测值。实验结果表明，UPN在多个领域中均表现出色：具有不确定性量化的连续正则化流（CNFs）、具有校准良好的置信区间的时间序列预测，以及在稳定和混沌动力系统中的鲁棒轨迹预测。

Summary / 总结

The research introduces Uncertainty Propagation Network (UPN), a new type of neural ordinary differential equations that model both state evolution and its uncertainty. UPN achieves this by solving coupled differential equations for mean and covariance dynamics, enabling efficient uncertainty propagation without discretization artifacts. Experiments show UPN's effectiveness in continuous normalizing flows, time-series forecasting, and robust trajectory prediction in various dynamical systems, providing well-calibrated confidence intervals and principled uncertainty quantification.

该研究引入了不确定性传播网络（UPN），将神经常微分方程（ODEs）扩展以同时建模状态的演化及其不确定性。UPN通过求解均值和协方差的耦合ODEs来实现这一点，从而在不产生离散化伪影的情况下高效传播不确定性。实验表明，UPN在连续正则化流、具有校准置信区间的时间序列预测以及各种动力系统中的稳健轨迹预测（包括稳定和混沌系统）方面表现出色。

OCR-Agent: Agentic OCR with Capability and Memory Reflection

Authors: Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

First: 2026-02-24T16:10:27+00:00 · Latest: 2026-02-24T16:10:27+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

中文标题/摘要

标题：OCR-Agent: 具有能力和记忆反思的代理OCR

大型视觉-语言模型（VLMs）通过迭代优化方法在复杂的视觉理解任务中展现了显著的潜力。然而，这些模型通常缺乏有效的自我纠正机制，使得它们难以独立纠正认知偏差。因此，在多轮修订过程中，它们往往陷入重复且无效的尝试，无法稳定提高答案质量。为解决这一问题，我们提出了一种新的迭代自我纠正框架，赋予模型两种关键能力：能力反思和记忆反思。该框架引导模型首先通过能力反思诊断错误并生成纠正计划，然后通过记忆反思回顾过去尝试以避免重复并探索新解决方案，最后通过严格的重新推理优化答案。在具有挑战性的OCRBench v2基准测试中，OCR-Agent在英文子集上比当前开源SOTA模型InternVL3-8B高出+2.0，在中文子集上高出+1.2，同时在视觉理解（79.9）和推理（66.5）方面达到最先进的结果，甚至超越了更大规模的微调模型。我们的方法表明，结构化的自我意识反思可以显著增强VLMs的推理稳健性，而无需额外的训练。代码：https://github.com/AIGeeksGroup/OCR-Agent.

Summary / 总结

The research aims to improve the self-correction ability of large Vision-Language Models (VLMs) to enhance their performance on complex visual understanding tasks. The proposed OCR-Agent framework introduces two key capabilities: Capability Reflection and Memory Reflection. These capabilities enable the model to diagnose errors, generate correction plans, and avoid repetitive attempts by reviewing past attempts. Experiments on the OCRBench v2 benchmark show that OCR-Agent outperforms the current SOTA model InternVL3-8B by 2.0 points on English and 1.2 points on Chinese subsets, while achieving state-of-the-art results in Visual Understanding and Reasoning.

研究旨在通过引入一种名为OCR-Agent的新型迭代自我纠正框架来提升大型视觉语言模型（VLMs）的自我纠正能力。该框架包括两种关键能力：能力反思和记忆反思。能力反思帮助诊断错误并生成纠正计划，而记忆反思则回顾过去尝试以避免重复并探索新解决方案。在OCRBench v2上的实验表明，OCR-Agent在英语和中文子集上分别比当前SOTA模型InternVL3-8B高出+2.0和+1.2，同时在视觉理解和推理方面达到了最先进的结果。该方法增强了VLMs的推理稳健性，无需额外训练。

Position-Aware Sequential Attention for Accurate Next Item Recommendations

Authors: Timur Nabiev, Evgeny Frolov

First: 2026-02-24T16:09:47+00:00 · Latest: 2026-02-24T16:09:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is permutation-equivariant over sequence positions and thus has no intrinsic notion of temporal order beyond causal masking. We argue that additive positional embeddings make the attention mechanism only superficially sensitive to sequence order: positional information is entangled with item embedding semantics, propagates weakly in deep architectures, and limits the ability to capture rich sequential patterns. To address these limitations, we introduce a kernelized self-attention mechanism, where a learnable positional kernel operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights. When applied per attention block, this kernel enables adaptive multi-scale sequential modeling. Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.

中文标题/摘要

标题：位置感知序列注意力以实现准确的下一个项目推荐

序列自注意力模型通常依赖于加性位置嵌入，这些嵌入将位置信息注入到项目表示中。在缺乏位置信号的情况下，注意力块在序列位置上是置换等变的，因此除了因果掩码之外，没有内在的时间顺序概念。我们认为，加性位置嵌入使注意力机制仅表面地对序列顺序敏感：位置信息与项目嵌入语义纠缠在一起，在深层架构中传播较弱，并限制了捕捉丰富序列模式的能力。为了解决这些限制，我们引入了一种核化自注意力机制，在这种机制中，可学习的位置核在位置空间中纯操作，与语义相似性分离，并直接调节注意力权重。当应用于每个注意力块时，这种核使能够实现自适应的多尺度序列建模。在标准的下一个项目预测基准上的实验表明，我们的位置核注意力机制在强竞争基线下持续表现出优越性。

Summary / 总结

The paper addresses the limitations of sequential self-attention models that rely on additive positional embeddings, which can make the attention mechanism insensitive to sequence order. To overcome this, the authors propose a kernelized self-attention mechanism that operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights. This method enables adaptive multi-scale sequential modeling. Experiments on standard next-item prediction benchmarks demonstrate that the proposed positional kernel attention consistently outperforms strong competing baselines.

论文针对依赖加性位置嵌入的序列自注意力模型存在的问题，这些问题包括位置信息与项目语义交织，限制了对丰富序列模式的捕捉能力。为了解决这些问题，作者提出了一种核化自注意力机制，其中可学习的位置核独立地在位置空间中操作，直接调节注意力权重。实验结果表明，这种方法在标准的下一个项目预测基准上优于强大的基线模型。

PIME: Prototype-based Interpretable MCTS-Enhanced Brain Network Analysis for Disorder Diagnosis

Authors: Kunyu Zhang, Yanwu Yang, Jing Zhang, Xiangjie Shi, Shujian Yu

First: 2026-02-24T16:04:52+00:00 · Latest: 2026-02-24T16:04:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent deep learning methods for fMRI-based diagnosis have achieved promising accuracy by modeling functional connectivity networks. However, standard approaches often struggle with noisy interactions, and conventional post-hoc attribution methods may lack reliability, potentially highlighting dataset-specific artifacts. To address these challenges, we introduce PIME, an interpretable framework that bridges intrinsic interpretability with minimal-sufficient subgraph optimization by integrating prototype-based classification and consistency training with structural perturbations during learning. This encourages a structured latent space and enables Monte Carlo Tree Search (MCTS) under a prototype-consistent objective to extract compact minimal-sufficient explanatory subgraphs post-training. Experiments on three benchmark fMRI datasets demonstrate that PIME achieves state-of-the-art performance. Furthermore, by constraining the search space via learned prototypes, PIME identifies critical brain regions that are consistent with established neuroimaging findings. Stability analysis shows 90% reproducibility and consistent explanations across atlases.

中文标题/摘要

标题：PIME：原型基础可解释的MCTS增强脑网络分析用于疾病诊断

基于fMRI的诊断的近期深度学习方法通过建模功能连接网络实现了有希望的准确性。然而，标准方法往往难以处理嘈杂的交互，而传统的后验归因方法可能缺乏可靠性，可能会突出显示数据集特定的伪影。为了解决这些挑战，我们引入了PIME，这是一种可解释的框架，通过结合原型分类和一致性训练与学习期间的结构扰动，将内在可解释性与最小充分子图优化相结合。这鼓励了一个结构化的潜在空间，并在训练后通过原型一致的目标使用蒙特卡洛树搜索（MCTS）提取紧凑的最小充分解释子图。在三个基准fMRI数据集上的实验表明，PIME达到了最先进的性能。此外，通过通过学习的原型限制搜索空间，PIME识别出与已建立的神经影像学发现一致的关键脑区。稳定性分析显示90%的可重复性和跨图谱的一致解释。

Summary / 总结

The research aims to improve the interpretability and reliability of deep learning methods for fMRI-based diagnosis by addressing noisy interactions and dataset-specific artifacts. PIME, a prototype-based interpretable framework, integrates prototype-based classification, consistency training, and structural perturbations to optimize a structured latent space. Post-training, MCTS is used to extract compact minimal-sufficient explanatory subgraphs under a prototype-consistent objective. Experiments on three benchmark fMRI datasets show that PIME achieves state-of-the-art performance and identifies consistent critical brain regions with high reproducibility.

PIME 是一种原型基于的可解释框架，通过结合原型分类和一致性训练来增强脑网络分析，以诊断疾病。它优化了最小充分子图，并使用蒙特卡洛树搜索来提取紧凑且可解释的子图。在三个功能性磁共振成像数据集上的实验表明，PIME 达到了最先进的性能，并且能够识别出与已建立的神经影像学发现一致的关键脑区，具有高重现性。

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Authors: Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu

First: 2026-02-24T16:04:26+00:00 · Latest: 2026-02-24T16:04:26+00:00

Comments: 24 pages, 17 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.

中文标题/摘要

标题：LogicGraph : 通过神经符号生成与验证系统地评估多路径逻辑推理

大型语言模型（LLMs）的评估主要强调收敛逻辑推理，成功定义为生成单一正确证明。然而，许多实际推理问题允许多种有效的推导，要求模型探索多种逻辑路径而非局限于一条路径。为解决这一局限，我们引入了LogicGraph，这是首个旨在系统评估多路径逻辑推理的基准，通过结合神经符号框架利用反向逻辑生成和语义实例化构建。该管道生成了由高深度多路径推理和内在逻辑干扰正式化的求解器验证推理问题，每个实例都关联有一整套最小证明。我们进一步提出了一种无参考评估框架，以严格评估模型在收敛和发散两种情况下的性能。实验表明，最先进的语言模型普遍倾向于过早锁定一条路径并忽略替代方案，推理深度越大，覆盖差距越大。LogicGraph揭示了这种发散差距，并提供了可操作的见解以激励未来改进。我们的代码和数据将在https://github.com/kkkkarry/LogicGraph发布。

Summary / 总结

LogicGraph is a benchmark designed to evaluate multi-path logical reasoning in large language models (LLMs) by leveraging a neuro-symbolic framework. It generates reasoning problems with multiple valid derivations and inherent logical distractions, ensuring a comprehensive evaluation of models' ability to explore diverse logical paths. Experiments show that state-of-the-art LLMs often prematurely commit to a single route and fail to explore alternative paths, highlighting a significant coverage gap with increasing reasoning depth.

LogicGraph 是一个用于评估大型语言模型（LLMs）多路径逻辑推理能力的基准，通过神经-符号框架生成需要探索多种有效推理路径的问题，而不是仅仅找到一个正确证明。实验表明，最先进的 LLMs 往往过早地选择一条路径并忽略其他选择，特别是在推理深度增加时。该基准揭示了模型在处理多种逻辑路径方面的不足，并为未来改进提供了参考。

History

20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553