Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Authors: Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan
First: 2026-02-23T18:59:58+00:00 · Latest: 2026-02-23T18:59:58+00:00
Comments: Project page: https://amshaker.github.io/Mobile-O/
Abstract
Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
中文标题/摘要
标题:Mobile-O:移动设备上的统一多模态理解和生成
统一多模态模型可以在单一架构中同时理解和生成视觉内容。现有模型仍然数据饥渴且过于沉重,无法部署在边缘设备上。我们提出了Mobile-O,这是一种紧凑的视觉-语言-扩散模型,将统一的多模态智能带到了移动设备上。其核心模块Mobile Conditioning Projector (MCP) 使用深度可分离卷积和层间对齐将视觉-语言特征与扩散生成器融合。这种设计使得跨模态条件化在计算成本极低的情况下得以实现。Mobile-O 仅在几百万样本上进行训练,并以新颖的四元组格式(生成提示、图像、问题、答案)进行后续训练,从而同时增强了视觉理解和生成能力。尽管效率高,Mobile-O 在 GenEval 上的表现与其它统一模型相当或更优,达到 74%,并且比 Show-O 和 JanusFlow 快 6 倍和 11 倍,分别领先 5% 和 11%。在视觉理解方面,Mobile-O 在七个基准测试中平均领先 15.3% 和 5.1%。在 iPhone 上,Mobile-O 每处理一张 512x512 的图像仅需约 3 秒,从而建立了首个适用于边缘设备的实时统一多模态理解和生成的实用框架。我们希望 Mobile-O 能够简化在设备上完全运行的实时统一多模态智能的研究,无需依赖云服务。我们的代码、模型、数据集和移动应用程序可在 https://amshaker.github.io/Mobile-O/ 公开获取。
Summary / 总结
Mobile-O is a compact vision-language-diffusion model designed for efficient multimodal understanding and generation on mobile devices. It uses a Mobile Conditioning Projector (MCP) to fuse vision-language features with a diffusion generator, enabling efficient cross-modal conditioning. Trained on a few million samples and post-trained in a quadruplet format, Mobile-O outperforms other unified models in both visual understanding and generation, achieving competitive or superior performance while running significantly faster. It processes images in about 3 seconds on an iPhone, making it the first practical framework for real-time unified multimodal intelligence on edge devices.
Mobile-O 是一种紧凑的视觉-语言-扩散模型,旨在移动设备上运行,通过高效的数据利用和计算效率解决了现有统一多模态模型的限制。它使用 Mobile Conditioning Projector (MCP) 将视觉-语言特征与扩散生成器融合,实现高效的跨模态条件化。Mobile-O 在视觉理解和生成任务中达到了竞争力或优越的表现,比其他模型高出 5% 到 11%,同时运行速度提高了 6 到 11 倍。它在 iPhone 上处理一张 512x512 的图像大约需要 3 秒,是第一个在边缘设备上实现实时统一多模态理解和生成的实用框架。
OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents
Authors: Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan
First: 2026-02-19T18:59:54+00:00 · Latest: 2026-02-23T18:59:54+00:00
Abstract
Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
中文标题/摘要
标题:OpenEarthAgent:统一的工具增强地理空间代理框架
近期多模态推理的进步使代理能够解释图像、将其与语言关联起来并执行结构化分析任务。将此类能力扩展到遥感领域仍然具有挑战性,因为模型必须在保持连贯的多步逻辑的同时,在空间尺度、地理结构和多光谱指数上进行推理。为弥合这一差距,OpenEarthAgent 引入了一个统一框架,用于开发基于卫星图像、自然语言查询和详细推理轨迹训练的工具增强地理空间代理。训练管道依赖于结构化推理轨迹的监督微调,使模型与跨多种分析上下文的验证多步工具交互对齐。伴随的语料库包括14,538个训练实例和1,169个评估实例,训练集中有超过100,000个推理步骤,评估集中有超过7,000个推理步骤。它涵盖了城市、环境、灾害和基础设施领域,并结合了GIS操作和NDVI、NBR和NDBI等指数分析。基于显式的推理轨迹,学习到的代理展示了结构化的推理、稳定的地理空间理解以及通过工具驱动的地理空间交互实现的可解释行为。我们报告了相对于强大基线的一致改进,并且在与最近的开源和闭源模型的性能上具有竞争力。
Summary / 总结
The research aims to develop geospatial agents capable of handling complex tasks in the remote sensing domain by integrating multimodal reasoning and tool-augmentation. The method involves training a unified framework, OpenEarthAgent, using a large dataset of satellite imagery, natural-language queries, and reasoning traces. Key findings show consistent improvements over a strong baseline and competitive performance compared to recent models, with the agent demonstrating structured reasoning and stable spatial understanding in diverse conditions.
OpenEarthAgent 是一个统一框架,用于开发能够解释卫星图像和自然语言查询的工具增强型地理空间代理,执行结构化分析任务。它使用带有详细推理轨迹的大规模数据集进行监督微调,涵盖多个领域。该框架在强基线模型上表现出一致的改进,并在与最近模型的性能比较中表现出竞争力,展示了在各种条件下结构化推理和稳定的地理空间理解能力。
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Authors: Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu
Venue: CVPR 2026
First: 2026-02-23T18:59:45+00:00 · Latest: 2026-02-23T18:59:45+00:00
Comments: Accepted by CVPR 2026. Project Page: https://cwchenwang.github.io/tttLRM
Abstract
We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.
中文标题/摘要
标题:tttLRM:测试时训练的长上下文和自回归3D重建
我们提出了一种名为tttLRM的新颖大型3D重建模型,该模型利用测试时训练(TTT)层,以线性计算复杂度实现长上下文、自回归3D重建,进一步扩展了模型的能力。我们的框架高效地将多个图像观察压缩到TTT层的快速权重中,在潜在空间中形成隐式的3D表示,可以解码为各种显式格式,例如用于下游应用的高斯斑点(GS)。我们的模型的在线学习变体支持从流式观察中进行渐进的3D重建和细化。我们证明,对新颖视图合成任务的预训练可以有效地转移到显式3D建模,从而提高重建质量和加快收敛速度。大量实验表明,与现有方法相比,我们的方法在物体和场景的前向3D高斯重建中表现出更优的性能。
Summary / 总结
The research motivation is to enable long-context, autoregressive 3D reconstruction with linear computational complexity. The main method involves using a Test-Time Training (TTT) layer to compress multiple image observations into fast weights, forming an implicit 3D representation that can be decoded into various explicit formats. Key experimental findings show that the proposed tttLRM method outperforms state-of-the-art approaches in feedforward 3D Gaussian reconstruction on both objects and scenes, with improved reconstruction quality and faster convergence after pretraining on novel view synthesis tasks.
研究动机是利用Test-Time Training (TTT)层实现具有线性计算复杂度的长上下文自回归3D重建。主要方法是将多个图像观察压缩到TTT层的快速权重中,形成隐式的3D表示,可以解码为各种显式的格式。关键实验发现表明,所提出的方法在物体和场景的3D高斯重建中优于最先进的方法,预训练在新颖视图合成任务后,重建质量更高且收敛速度更快。
A Very Big Video Reasoning Suite
Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng
First: 2026-02-23T18:59:41+00:00 · Latest: 2026-02-23T18:59:41+00:00
Comments: Homepage: https://video-reason.com/
Abstract
Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .
中文标题/摘要
标题:一个非常大的视频推理套件
视频模型的快速发展主要集中在视觉质量上,而对其推理能力的探索则相对不足。视频推理将智能置于时空一致的视觉环境中,超越了文本所能自然捕捉的内容,使人们能够直观地推理时空结构,如连续性、交互性和因果关系。然而,系统地研究视频推理及其扩展行为受到大规模训练数据缺乏的阻碍。为解决这一问题,我们引入了非常大的视频推理(VBVR)数据集,这是一个前所未有的大规模资源,涵盖了200个经过精心分类的推理任务,涉及超过一百万段视频片段,比现有数据集大三个数量级。我们还提出了VBVR-Bench,这是一种可验证的评估框架,通过引入基于规则、与人类对齐的评分者,超越基于模型的评判,实现可重复和可解释的视频推理能力诊断。利用VBVR套件,我们进行了第一个大规模的视频推理扩展研究,并观察到了对未见过的推理任务的早期泛化迹象。总体而言,VBVR为可泛化的视频推理下一阶段的研究奠定了基础。数据、基准工具包和模型可在https://video-reason.com/ 公开获取。
Summary / 总结
The research aims to explore the reasoning capabilities of video models beyond visual quality, addressing the lack of large-scale training data for video reasoning. The study introduces the Very Big Video Reasoning (VBVR) Dataset with over one million video clips and 200 reasoning tasks, and VBVR-Bench, a verifiable evaluation framework. Key findings include early signs of emergent generalization to unseen reasoning tasks, indicating potential for scalable video reasoning.
研究旨在探索视频模型在超越视觉质量之外的推理能力,并解决大规模训练数据的缺乏问题。研究引入了包含超过一百万视频片段和200个推理任务的Very Big Video Reasoning (VBVR) 数据集,远大于现有数据集。VBVR-Bench 评估框架使用基于规则的评分器来评估视频推理,实现可重复和可解释的结果。研究发现对未见过的推理任务存在早期泛化迹象,为未来通用视频推理研究奠定了基础。
Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
Authors: Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani
Venue: CVPR 2026
First: 2026-02-23T18:59:30+00:00 · Latest: 2026-02-23T18:59:30+00:00
Comments: CVPR 2026. Project website: https://flow3r-project.github.io/
Abstract
Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.
中文标题/摘要
标题:Flow3r:可扩展视觉几何学习的分解流预测
当前的前馈3D/4D重建系统依赖于密集的几何和姿态监督——在大规模获取时成本高昂,特别是在动态现实场景中尤为稀缺。我们提出了Flow3r框架,该框架通过密集的2D对应关系(`流`)作为监督,使从未标记的单目视频中进行可扩展训练成为可能。我们的核心见解是,流预测模块应该被分解:使用一个图像的几何潜在变量和另一个图像的姿态潜在变量来预测两个图像之间的流。这种分解直接指导了场景几何和相机运动的学习,并自然地扩展到动态场景。在受控实验中,我们展示了分解流预测优于其他设计,并且性能随着未标记数据的增加而一致地提高。将分解流整合到现有的视觉几何架构中,并使用约80万未标记视频进行训练,Flow3r在八个涵盖静态和动态场景的基准测试中取得了最先进的结果,其最大的改进出现在标记数据最稀缺的野外动态视频中。
Summary / 总结
Flow3r is a framework that uses dense 2D correspondences (flow) as supervision to augment visual geometry learning, enabling scalable training from unlabeled monocular videos. The key method is to factorize flow prediction, predicting flow between two images using geometry latents from one and pose latents from the other. This approach improves performance consistently with more unlabeled data and achieves state-of-the-art results across various benchmarks, especially on dynamic scenes with scarce labeled data.
Flow3r 是一个框架,使用密集的 2D 对应关系(流)作为监督,从未标记的单目视频中实现视觉几何的大规模训练。关键方法是将流预测模块分解为,使用一个图像的几何潜在变量和另一个图像的姿态潜在变量来预测两个图像之间的流。这种方法在静态和动态场景上都提高了性能,最大的改进出现在未标记的动态视频上。Flow3r 使用大约 80 万个未标记的视频在八个基准测试中达到了最先进的结果。
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko
First: 2026-02-23T18:59:27+00:00 · Latest: 2026-02-23T18:59:27+00:00
Abstract
LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.
中文标题/摘要
标题:技能注入:衡量代理对技能文件攻击的脆弱性
LLM代理正在迅速发展,得益于代码执行、工具以及最近引入的代理技能功能。技能允许用户通过专门的第三方代码、知识和指令扩展LLM应用程序的功能。尽管这可以将代理能力扩展到新的领域,但也为提示注入攻击提供了新的攻击面。我们识别出基于技能的提示注入是一个重大威胁,并引入了SkillInject基准,评估广泛使用的LLM代理通过技能文件遭受注入攻击的易感性。SkillInject包含202个注入任务对,攻击范围从明显的恶意注入到隐藏在合法指令中的微妙、上下文相关的攻击。我们对前沿LLM进行了评估,从有害指令的避免和合法指令的遵守两个方面衡量安全性。结果显示,当前的代理高度易受攻击,前沿模型的攻击成功率高达80%,经常执行极其有害的指令,包括数据泄露、破坏性操作和类似勒索软件的行为。此外,这些结果表明,这个问题不会通过模型扩展或简单的输入过滤来解决,而是需要具备上下文感知授权框架的稳健代理安全。我们的基准可以在https://www.skill-inject.com/找到。
Summary / 总结
The paper addresses the vulnerability of language model (LM) agents to skill-based prompt injection attacks, which exploit the use of third-party skills to extend agent capabilities. It introduces SkillInject, a benchmark consisting of 202 injection-task pairs, to evaluate the susceptibility of LLM agents to such attacks. The evaluation shows that leading LLMs are highly vulnerable, with up to 80% attack success rate, often executing harmful instructions like data exfiltration and ransomware-like behavior. The results indicate that robust security will require context-aware authorization frameworks rather than just model scaling or simple input filtering.
研究旨在应对LLM代理中基于技能的提示注入攻击日益增长的威胁,这些攻击可以扩展代理的功能但也会带来安全风险。研究引入了SkillInject基准,用于评估LLM代理通过技能文件对这类攻击的易感性。基准包括202个注入任务对,具有不同程度的恶意性。对领先LLM模型的评估显示,这些模型高度易受攻击,成功率高达80%,常常导致严重的操作,如数据泄露和勒索软件行为。研究结果表明,稳健的安全性需要依赖上下文感知的授权框架,而不仅仅是依赖模型扩展或简单的输入过滤。
Agentic AI for Scalable and Robust Optical Systems Control
Authors: Zehao Wang, Mingzhe Han, Wei Cheng, Yue-Kai Huang, Philip Ji, Denton Wu, Mahdi Safari, Flemming Holtorf, Kenaish AlQubaisi, Norbert M. Linke, Danyang Zhuo, Yiran Chen, Ting Wang, Dirk Englund, Tingjun Chen
First: 2026-02-23T18:54:32+00:00 · Latest: 2026-02-23T18:54:32+00:00
Abstract
We present AgentOptics, an agentic AI framework for high-fidelity, autonomous optical system control built on the Model Context Protocol (MCP). AgentOptics interprets natural language tasks and executes protocol-compliant actions on heterogeneous optical devices through a structured tool abstraction layer. We implement 64 standardized MCP tools across 8 representative optical devices and construct a 410-task benchmark to evaluate request understanding, role-aware responses, multi-step coordination, robustness to linguistic variation, and error handling. We assess two deployment configurations--commercial online LLMs and locally hosted open-source LLMs--and compare them with LLM-based code generation baselines. AgentOptics achieves 87.7%--99.0% average task success rates, significantly outperforming code-generation approaches, which reach up to 50% success. We further demonstrate broader applicability through five case studies extending beyond device-level control to system orchestration, monitoring, and closed-loop optimization. These include DWDM link provisioning and coordinated monitoring of coherent 400 GbE and analog radio-over-fiber (ARoF) channels; autonomous characterization and bias optimization of a wideband ARoF link carrying 5G fronthaul traffic; multi-span channel provisioning with launch power optimization; closed-loop fiber polarization stabilization; and distributed acoustic sensing (DAS)-based fiber monitoring with LLM-assisted event detection. These results establish AgentOptics as a scalable, robust paradigm for autonomous control and orchestration of heterogeneous optical systems.
中文标题/摘要
标题:代理AI在可扩展和稳健的光学系统控制中的应用
我们提出了AgentOptics,一种基于模型上下文协议(MCP)的高保真度自主光学系统控制的代理AI框架。AgentOptics 解释自然语言任务并通过结构化的工具抽象层执行符合协议的操作,覆盖了8种代表性光学设备上的64个标准化MCP工具,并构建了一个包含410个任务的基准测试,以评估请求理解、角色感知响应、多步协调、语言变异的鲁棒性以及错误处理。我们评估了两种部署配置——商用在线LLM和本地托管的开源LLM,并与基于LLM的代码生成基线进行比较。AgentOptics 实现了87.7%至99.0%的平均任务成功率,显著优于代码生成方法,后者最高成功率仅为50%。我们还通过五个案例研究进一步展示了其更广泛的应用,这些案例研究不仅扩展到设备级控制,还涉及系统编排、监控和闭环优化。这些案例包括DWDM链路配置、相干400 GbE和模拟射频光纤(ARoF)通道的协调监控;宽带ARoF链路的自主表征和偏置优化,该链路承载5G前传流量;多段通道配置,包括发射功率优化;闭环光纤偏振稳定;以及基于LLM辅助事件检测的分布式声学传感(DAS)光纤监控。这些结果确立了AgentOptics 作为自主控制和编排异构光学系统的可扩展和稳健范式的地位。
Summary / 总结
AgentOptics is an agentic AI framework for autonomous optical system control using the Model Context Protocol (MCP). It interprets natural language tasks and executes actions on various optical devices. The framework achieves an average task success rate of 87.7% to 99.0%, significantly outperforming code-generation approaches which only reach up to 50% success. It demonstrates broad applicability in device-level control, system orchestration, monitoring, and closed-loop optimization across different optical systems and channels.
AgentOptics 是一个使用 Model Context Protocol (MCP) 的自主光学系统控制框架,能够解析自然语言任务并在多种光学设备上执行操作。该框架的任务成功率在 87.7% 到 99.0% 之间,优于代码生成方法。它通过 DWDM 链路配置、相干 400 GbE 监控和光纤偏振稳定等案例研究展示了广泛的应用性。
TROLL: Trust Regions improve Reinforcement Learning for Large Language Models
Authors: Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann
Venue: ICLR 2026
First: 2025-10-04T14:14:20+00:00 · Latest: 2026-02-23T18:54:13+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.
中文标题/摘要
标题:TROLL:信任区域提高大型语言模型的强化学习
使用PPO类似剪裁目标的强化学习(RL)已成为基于奖励的大型语言模型(LLM)微调的标准选择。尽管最近的工作探索了改进的优势估计和归一化方法,但剪裁机制本身仍未得到改进。剪裁最初作为原则性KL信任区域的代理引入,但它是对KL约束的粗略近似,经常导致不稳定的更新和次优性能。我们用一种新颖的离散可微信任区域投影取代剪裁目标,提供原则性的令牌级KL约束。投影作用于模型最重要的令牌logits的稀疏子集,以平衡计算成本和投影效果。我们的方法,大型语言模型的信任区域优化(TROLL),在训练期间直接替代PPO类似的剪裁,而不改变模型的推理行为。在数学推理和代码生成任务、模型系列以及优势估计方法方面,TROLL在训练速度、稳定性和最终成功率方面均优于PPO类似的剪裁。
Summary / 总结
The research aims to improve the stability and performance of reinforcement learning for large language models by replacing the clipping mechanism in PPO-like objectives with a discrete differentiable trust region projection. This method provides principled token-level KL constraints and balances computational cost and projection effectiveness. Experiments show that TROLL outperforms PPO-like clipping in terms of training speed, stability, and final success rates across various tasks and model families.
TROLL 通过将 PPO 类目标中的剪辑机制替换为一种新颖的离散可微信任区域投影,提供了一种针对标记的 KL 约束。这种方法在各种任务和模型家族中,在训练速度、稳定性和最终成功率方面都优于 PPO 类剪辑机制。
Recurrent Structural Policy Gradient for Partially Observable Mean Field Games
Authors: Clarisse Wibault, Johannes Forkel, Sebastian Towers, Tiphaine Wibault, Juan Duque, George Whittle, Andreas Schaab, Yucheng Yang, Chiyuan Wang, Michael Osborne, Benjamin Moll, Jakob Foerster
First: 2026-02-23T18:53:09+00:00 · Latest: 2026-02-23T18:53:09+00:00
Abstract
Mean Field Games (MFGs) provide a principled framework for modeling interactions in large population models: at scale, population dynamics become deterministic, with uncertainty entering only through aggregate shocks, or common noise. However, algorithmic progress has been limited since model-free methods are too high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) use Monte Carlo rollouts for the common noise in combination with exact estimation of the expected return, conditioned on those samples. However, HSMs have not been scaled to Partially Observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for settings involving public information. We also introduce MFAX, our JAX-based framework for MFGs. By leveraging known transition dynamics, RSPG achieves state-of-the-art performance as well as an order-of-magnitude faster convergence and solves, for the first time, a macroeconomics MFG with heterogeneous agents, common noise and history-aware policies. MFAX is publicly available at: https://github.com/CWibault/mfax.
中文标题/摘要
标题:部分可观测的均场博弈的循环结构策略梯度
均场博弈(MFGs)提供了一种原理性的框架来建模大规模群体模型中的相互作用:在大规模情况下,群体动力学变得确定性,不确定性仅通过总体冲击或共同噪声进入。然而,由于无模型方法的方差过高且精确方法的可扩展性较差,算法进展有限。最近的混合结构方法(HSMs)使用蒙特卡洛展开来处理共同噪声,并结合了基于这些样本的预期回报的精确估计。然而,HSMs尚未扩展到部分可观测的设置。我们提出了循环结构策略梯度(RSPG),这是第一个具有历史意识的HSM,适用于涉及公共信息的设置。我们还引入了MFAX,这是一个基于JAX的MFG框架。通过利用已知的转换动态,RSPG实现了最先进的性能,收敛速度提高了数量级,并首次解决了包含异质代理、共同噪声和历史意识策略的宏观经济MFG。MFAX可在以下网址获取:https://github.com/CWibault/mfax。
Summary / 总结
The paper addresses the challenge of applying model-free methods and exact methods in Mean Field Games (MFGs) due to their high variance and poor scalability, respectively. It introduces Recurrent Structural Policy Gradient (RSPG), a history-aware Hybrid Structural Method (HSM) that uses Monte Carlo rollouts for common noise and exact estimation of expected return. RSPG demonstrates state-of-the-art performance and faster convergence, solving a macroeconomics MFG with heterogeneous agents, common noise, and history-aware policies for the first time.
论文旨在解决在部分可观测状态和公共噪声条件下大规模人群模型中应用无模型方法的挑战。它提出了一个历史感知的混合结构方法——递归结构策略梯度(RSPG),该方法利用蒙特卡洛滚动对公共噪声进行采样,并精确估计基于这些样本的预期回报。RSPG实现了最先进的性能和更快的收敛速度,并首次解决了包含异质代理、公共噪声和历史感知策略的宏观经济MFG模型。
Towards a Science of AI Agent Reliability
Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan
First: 2026-02-18T18:05:44+00:00 · Latest: 2026-02-23T18:49:07+00:00
Comments: Interactive dashboard available at: https://hal.cs.princeton.edu/reliability
Abstract
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
中文标题/摘要
标题:迈向AI代理可靠性的科学
AI代理正越来越多地被部署以执行重要任务。尽管在标准基准上的准确率得分不断提高,表明了快速的进步,但许多代理仍然在实践中继续失败。这种差异突显了当前评估的基本局限性:将代理行为压缩为单一的成功指标掩盖了关键的操作缺陷。值得注意的是,它忽略了代理是否在多次运行中表现一致、能否抵御干扰、是否能预测性地失败或其误差严重性是否受到限制。基于安全关键工程,我们通过提出十二个具体的指标,从四个关键维度分解代理可靠性:一致性、鲁棒性、可预测性和安全性,提供了一个全面的性能概况。在两个互补基准上评估14个模型后,我们发现最近的能力提升仅在可靠性方面带来了微小的改进。通过揭示这些持续存在的局限性,我们的指标补充了传统的评估,同时提供了关于代理如何表现、退化和失败的推理工具。
Summary / 总结
The paper addresses the gap between AI agent performance on benchmarks and their practical reliability. It introduces twelve metrics to evaluate reliability along four dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models, the study finds that recent advancements have only marginally improved reliability, highlighting persistent issues in AI agent performance and behavior.
研究旨在解决AI代理在基准测试中的表现与其在实际应用中的可靠性之间的差距。它提出了十二个指标来评估AI代理在一致性、鲁棒性、可预测性和安全性四个维度上的可靠性。通过对14个模型的评估,研究发现最近的进步仅在可靠性方面带来了微小的改进,突显了AI代理在性能、鲁棒性和可预测性方面的持续问题。
A Benchmark of Causal vs. Correlation AI for Predictive Maintenance
Authors: Shaunak Dhande, Chutian Ma, Giacinto Paolo Saggese, Paul Smith, Krishna Taduri
First: 2025-11-30T23:59:37+00:00 · Latest: 2026-02-23T18:46:56+00:00
Abstract
Predictive maintenance in manufacturing environments presents a challenging optimization problem characterized by extreme cost asymmetry, where missed failures incur costs roughly fifty times higher than false alarms. Predictive maintenance in manufacturing environments presents a challenging optimization problem characterized by extreme cost asymmetry, where missed failures incur costs roughly fifty times higher than false alarms. Conventional machine learning approaches typically optimize statistical accuracy metrics that do not reflect this operational reality and cannot reliably distinguish causal relationships from spurious correlations. This study benchmarks eight predictive models, ranging from baseline statistical approaches to Bayesian structural causal methods, on a dataset of 10,000 CNC machines with a 3.3 percent failure prevalence. While ensemble correlation-based models such as Random Forest (L4) achieve the highest raw cost savings (70.8 percent reduction), the Bayesian Structural Causal Model (L7) delivers competitive financial performance (66.4 percent cost reduction) with an inherent ability of failure attribution, which correlation-based models do not readily provide. The model achieves perfect attribution for HDF, PWF, and OSF failure types. These results suggest that causal methods, when combined with domain knowledge and Bayesian inference, offer a potentially favorable trade-off between predictive performance and operational interpretability in predictive maintenance applications.
中文标题/摘要
标题:因果关系AI与相关性AI在预测性维护中的基准测试
制造环境中的预测性维护提出了一个具有极端成本不对称性的优化问题,其中未发现故障的成本大约是误报成本的五十倍。预测性维护在制造环境中提出了一个具有极端成本不对称性的优化问题,其中未发现故障的成本大约是误报成本的五十倍。传统机器学习方法通常优化统计准确度指标,这些指标未能反映这种运营现实,也无法可靠地区分因果关系和虚假相关性。本研究在包含10,000台CNC机床的数据集上,对3.3%的故障率进行了八种预测模型的基准测试,从基线统计方法到贝叶斯结构因果方法。尽管集成的相关性模型,如随机森林(L4)实现了最高的原始成本节省(70.8%的减少),但贝叶斯结构因果模型(L7)提供了具有竞争力的财务表现(66.4%的成本减少),并且具有相关性模型无法轻易提供的故障归因能力。该模型在HDF、PWF和OSF故障类型上实现了完美的归因。这些结果表明,当与领域知识和贝叶斯推理结合使用时,因果方法在预测性能和操作解释性之间可能提供一个有利的权衡,在预测性维护应用中具有潜在的优势。
Summary / 总结
This study addresses the challenge of predictive maintenance in manufacturing by benchmarking eight models, from statistical approaches to Bayesian structural causal methods, on a dataset of 10,000 CNC machines. While correlation-based models like Random Forest achieve the highest cost savings, the Bayesian Structural Causal Model provides competitive financial performance with the added benefit of failure attribution, achieving perfect attribution for specific failure types. This suggests that causal methods can offer a favorable trade-off between predictive performance and operational interpretability.
该研究针对制造环境中预测性维护的挑战,即未检测到的故障成本远高于误报成本。研究对包括统计和贝叶斯结构因果方法在内的八种预测模型进行了基准测试,数据集包含10,000台CNC机器。虽然集成相关性模型如随机森林提供了最高的原始成本节省,但贝叶斯结构因果模型在财务表现上具有竞争力,并且具有相关性模型所不具备的故障归因能力。该贝叶斯模型对特定类型的故障实现了完美的归因,表明因果方法可以在预测性能和操作解释性之间提供一个有利的权衡。
Find the Fruit: Zero-Shot Sim2Real RL for Occlusion-Aware Plant Manipulation
Authors: Nitesh Subedi, Hsin-Jung Yang, Devesh K. Jha, Soumik Sarkar
First: 2025-05-22T11:37:39+00:00 · Latest: 2026-02-23T18:46:55+00:00
Abstract
Autonomous harvesting in the open presents a complex manipulation problem. In most scenarios, an autonomous system has to deal with significant occlusion and require interaction in the presence of large structural uncertainties (every plant is different). Perceptual and modeling uncertainty make design of reliable manipulation controllers for harvesting challenging, resulting in poor performance during deployment. We present a sim2real reinforcement learning (RL) framework for occlusion-aware plant manipulation, where a policy is learned entirely in simulation to reposition stems and leaves to reveal target fruit(s). In our proposed approach, we decouple high-level kinematic planning from low-level compliant control which simplifies the sim2real transfer. This decomposition allows the learned policy to generalize across multiple plants with different stiffness and morphology. In experiments with multiple real-world plant setups, our system achieves up to 86.7% success in exposing target fruits, demonstrating robustness to occlusion variation and structural uncertainty.
中文标题/摘要
标题:寻找果实:零样本模拟到现实的RL在遮挡感知植物操作中的应用
在开放环境中进行自主收获是一个复杂的操作问题。在大多数情况下,自主系统必须处理显著的遮挡,并在存在大量结构不确定性(每株植物都不同)的情况下进行交互。感知和建模不确定性使得设计可靠的收获操作控制器变得具有挑战性,导致部署时性能不佳。我们提出了一种模拟到现实的强化学习(RL)框架,用于遮挡感知的植物操作,其中策略完全在模拟中学习以重新定位茎和叶子以暴露目标果实。在我们提出的方法中,我们将高层的运动规划与低层的顺应控制解耦,简化了模拟到现实的转移。这种分解使得学习到的策略能够在具有不同刚度和形态的多种植物之间泛化。在多个真实世界的植物设置实验中,我们的系统在暴露目标果实方面取得了高达86.7%的成功率,展示了对遮挡变化和结构不确定性具有鲁棒性。
Summary / 总结
The research aims to address the complex manipulation challenges in autonomous harvesting, particularly the issues of occlusion and structural uncertainties. The authors propose a simulation-to-real reinforcement learning framework that learns a policy to reposition stems and leaves to reveal target fruits. By decoupling high-level kinematic planning from low-level control, the system generalizes well across different plants. Experiments show that the system achieves up to 86.7% success in exposing target fruits, indicating robustness to occlusion and structural variations.
研究旨在解决自主收获中的复杂操作问题,特别是处理遮挡和结构不确定性带来的挑战。方法是采用一种从仿真到现实的强化学习框架,在仿真中学习一个策略来重新定位茎叶以揭示目标果实。关键实验发现是,该系统在多个实际植物设置中实现了高达86.7%的目标果实暴露成功率,展示了对遮挡变化和结构不确定性较强的鲁棒性。
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Authors: Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak
First: 2026-02-23T18:46:27+00:00 · Latest: 2026-02-23T18:46:27+00:00
Comments: Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)
Abstract
With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
中文标题/摘要
标题:KNIGHT:基于知识图谱的自适应难度调整多项选择题生成
随着大型语言模型(LLMs)的发展,它们在检索增强生成(RAG)等应用中变得至关重要。然而,评估这些系统仍然受到构建专门评估数据集所需时间和成本的限制。我们提出了KNIGHT,一种基于LLM和知识图谱的框架,可以从外部来源生成多项选择题(MCQ)数据集。KNIGHT构建了一个特定主题的知识图谱,这是一种结构化且简洁的实体和关系总结,可以重复使用以生成由教师控制难度级别的问题,包括多跳问题,而无需反复重新输入完整源文本。这个知识图谱作为可重复使用的压缩状态,使得问题生成成为图上的廉价读取操作。我们以维基百科/维基数据为例实例化KNIGHT,同时保持框架的领域无关性和本体无关性。作为案例研究,KNIGHT生成了六个历史、生物学和数学领域的MCQ数据集。我们从五个标准评估质量:流畅性、明确性(单一正确答案)、主题相关性、选项独特性和基于提供的来源可回答性(作为幻觉的代理)。结果表明,KNIGHT能够从可重复使用的图表示中实现高效生成,这些标准下的质量都很高,并且模型排名与MMLU风格的基准一致,同时支持特定主题和难度控制的评估。
Summary / 总结
KNIGHT is an LLM-based framework that generates multiple-choice questions from a topic-specific knowledge graph, enabling the creation of instructor-controlled difficulty levels without re-feeding the full source text. It produces six MCQ datasets in History, Biology, and Mathematics, achieving high quality across fluency, unambiguity, topic relevance, option uniqueness, and answerability. The framework supports topic-specific and difficulty-controlled evaluation, making question generation token- and cost-efficient.
KNIGHT 是一个基于LLM的框架,通过构建特定主题的知识图谱生成多项选择题(MCQ),实现高效且教师可控的难度调整。它从外部来源构建一个结构化的实体和关系摘要,无需重新输入完整文本即可生成问题。KNIGHT 在历史、生物学和数学领域生成了高质量的 MCQ 数据集,满足流畅性、明确性、主题相关性、选项独特性和可回答性等标准。该框架支持高效和成本效益的生成,并与MMLU风格的基准对齐。
Modeling Epidemiological Dynamics Under Adversarial Data and User Deception
Authors: Yiqi Su, Christo Kurisummoottil Thomas, Walid Saad, Bud Mishra, Naren Ramakrishnan
First: 2026-02-23T18:45:55+00:00 · Latest: 2026-02-23T18:45:55+00:00
Abstract
Epidemiological models increasingly rely on self-reported behavioral data such as vaccination status, mask usage, and social distancing adherence to forecast disease transmission and assess the impact of non-pharmaceutical interventions (NPIs). While such data provide valuable real-time insights, they are often subject to strategic misreporting, driven by individual incentives to avoid penalties, access benefits, or express distrust in public health authorities. To account for such human behavior, in this paper, we introduce a game-theoretic framework that models the interaction between the population and a public health authority as a signaling game. Individuals (senders) choose how to report their behaviors, while the public health authority (receiver) updates their epidemiological model(s) based on potentially distorted signals. Focusing on deception around masking and vaccination, we characterize analytically game equilibrium outcomes and evaluate the degree to which deception can be tolerated while maintaining epidemic control through policy interventions. Our results show that separating equilibria-with minimal deception-drive infections to near zero over time. Remarkably, even under pervasive dishonesty in pooling equilibria, well-designed sender and receiver strategies can still maintain effective epidemic control. This work advances the understanding of adversarial data in epidemiology and offers tools for designing more robust public health models in the presence of strategic user behavior.
中文标题/摘要
标题:在敌对数据和用户欺骗下的流行病动力学建模
流行病学模型越来越多地依赖于自我报告的行为数据,如疫苗接种状态、口罩使用和社交距离遵守情况,以预测疾病传播并评估非药物干预措施(NPIs)的影响。虽然这些数据提供了有价值的实时见解,但它们往往受到战略性误报的影响,个体出于避免惩罚、获取利益或表达对公共卫生当局的不信任而选择报告行为。为了考虑这种人类行为,本文引入了一种博弈论框架,将人口与公共卫生当局之间的互动建模为信号博弈。个体(发送者)选择如何报告其行为,而公共卫生当局(接收者)则根据可能被扭曲的信号更新其流行病学模型。聚焦于口罩和疫苗的欺骗行为,我们从理论上分析了博弈均衡结果,并评估了在政策干预下可以容忍的欺骗程度,同时仍能维持流行病控制。研究结果表明,分离均衡(最小欺骗)随着时间的推移将感染率驱至接近零。令人惊讶的是,即使在混合均衡下存在普遍的不诚实行为,精心设计的发送者和接收者策略仍能维持有效的流行病控制。这项工作推进了对流行病学中敌对数据的理解,并提供了在存在战略用户行为的情况下设计更稳健公共卫生模型的工具。
Summary / 总结
The paper introduces a game-theoretic framework to model the interaction between individuals and public health authorities in the context of self-reported behavioral data, such as mask usage and vaccination status. By characterizing game equilibrium outcomes, the study evaluates the impact of deception on epidemic control and finds that even under pervasive dishonesty, effective epidemic control can still be maintained with well-designed strategies. The research advances the understanding of adversarial data in epidemiology and provides tools for robust public health modeling.
本文提出了一种基于博弈论的框架,用于建模个体与公共卫生机构在战略误报行为(如佩戴口罩和接种疫苗)时的互动。该框架基于信号博弈,其中个体选择如何报告其行为,公共卫生机构则根据这些可能被扭曲的信号更新其模型。研究发现,分离均衡,即最小的欺骗,可以随着时间推移将感染率降低到接近零。即使在普遍不诚实的情况下,精心设计的策略仍能维持有效的疫情控制。
AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization
Authors: Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, Ion Stoica
First: 2026-02-23T18:45:31+00:00 · Latest: 2026-02-23T18:45:31+00:00
Abstract
The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promising frontiers remain under-exploited. We introduce AdaEvolve, a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem. AdaEvolve uses an "accumulated improvement signal" to unify decisions across three levels: Local Adaptation, which dynamically modulates the exploration intensity within a population of solution candidates; Global Adaptation, which routes the global resource budget via bandit-based scheduling across different solution candidate populations; and Meta-Guidance which generates novel solution tactics based on the previously generated solutions and their corresponding improvements when the progress stalls. We demonstrate that AdaEvolve consistently outperforms the open-sourced baselines across 185 different open-ended optimization problems including combinatorial, systems optimization and algorithm design problems.
中文标题/摘要
标题:AdaEvolve:自适应大语言模型驱动的零阶优化
自动化程序生成的范式正从单次生成转向推理时搜索,其中大型语言模型(LLMs)作为语义变异操作符在进化循环中发挥作用。虽然有效,但这些系统目前由静态调度控制,未能考虑搜索过程中的非平稳动态。这种刚性导致了大量计算资源的浪费,因为资源被无差别地分配给停滞不前的群体,而有潜力的前沿则被忽视。我们提出了AdaEvolve框架,将LLM驱动的进化重新表述为分层自适应优化问题。AdaEvolve利用“累积改进信号”在三个层次上统一决策:局部自适应,动态调节候选解群体内的探索强度;全局自适应,通过基于多臂老虎机调度将全局资源预算分配到不同的候选解群体;元指导,基于先前生成的解及其改进情况,在进度停滞时生成新的解策略。我们证明了AdaEvolve在185个不同类型的开放优化问题中,包括组合优化、系统优化和算法设计问题上,始终优于开源基准。
Summary / 总结
AdaEvolve is a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem. It uses an 'accumulated improvement signal' to dynamically adjust exploration intensity, route global resources, and generate novel tactics. Experiments show that AdaEvolve outperforms open-sourced baselines across 185 open-ended optimization problems, including combinatorial, systems optimization, and algorithm design tasks.
AdaEvolve 是一个框架,将 LLM 驱动的进化重新表述为分层自适应优化问题,使用 '累积改进信号' 动态调整探索强度、分配全局资源并生成新策略。它在 185 个开放优化问题中始终优于开源基准。
LAD: Learning Advantage Distribution for Reasoning
Authors: Wendi Li, Sharon Li
First: 2026-02-23T18:44:10+00:00 · Latest: 2026-02-23T18:44:10+00:00
Abstract
Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.
中文标题/摘要
标题:LAD:推理中的学习优势分布
当前大规模模型推理的强化学习目标主要集中在最大化预期奖励。这种范式可能导致对主导奖励信号的过度拟合,而忽视了其他同样有效的推理路径,从而限制了多样性和探索。为了解决这一问题,我们引入了学习优势分布(LAD),这是一种分布匹配框架,用学习由优势引起的分布替代优势最大化。通过建立最优策略更新与基于优势的目标分布之间的等价性,我们推导出一个实用的LAD目标,该目标以最小化由策略诱导和优势诱导分布之间的$f$-散度的形式表示。这产生了一个梯度更新,增加了高优势响应的可能性,同时抑制了过度自信的概率增长,防止了崩溃,而无需额外的熵正则化。与GRPO相比,LAD没有额外的训练成本,并且自然地扩展到LLM后训练。在受控的多臂老虎机环境中,LAD准确地恢复了多模态优势分布,验证了理论形式。在多个LLM基础模型上的数学和代码推理任务中进行的实验表明,LAD能够可靠地提高准确性和生成多样性。
Summary / 总结
The research aims to enhance the diversity and exploration in large-model reasoning by addressing the overfitting issue in current reinforcement learning objectives. It introduces Learning Advantage Distributions (LAD), a framework that shifts from maximizing expected rewards to learning the advantage-induced distribution. Key experimental findings show that LAD improves both accuracy and generative diversity in math and code reasoning tasks across various language model backbones, validating its theoretical formulation in a controlled bandit setting.
研究旨在通过解决对主导奖励信号的过度拟合问题,增强大型模型推理中的多样性和探索性。方法引入了学习优势分布(LAD),将重点从最大化预期奖励转移到学习优势分布。这种方法通过促进高优势响应并抑制过度自信,提高了数学和代码推理任务中的准确性和生成多样性。实验表明,LAD成功地恢复了多模态优势分布,并在准确性和多样性方面优于现有方法。
To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Authors: Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen, Yu Hou, Yifan Wu, Yang Ruan, Rui Zhang
First: 2026-02-23T18:42:50+00:00 · Latest: 2026-02-23T18:42:50+00:00
Abstract
Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy.
Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time.
Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost.
Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability.
Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.
中文标题/摘要
标题:是否需要推理:医学问答中的选择性链式思考
目标:通过避免不必要的推理来提高大型语言模型(LLM)在医学问答(MedQA)中的效率,同时保持准确性。
方法:我们提出了选择性链式思考(Selective CoT),这是一种推理时策略,首先预测问题是否需要推理,仅在需要时生成推理。在四个生物医学问答基准测试(HeadQA、MedQA-USMLE、MedMCQA、PubMedQA)上评估了两个开源LLM(Llama-3.1-8B和Qwen-2.5-7B)。评估指标包括准确率、生成的总令牌数和推理时间。
结果:选择性CoT将推理时间减少了13-45%,令牌使用量减少了8-47%,准确率损失不超过4%。在某些模型-任务配对中,它在准确性和效率上都优于标准CoT。与固定长度CoT相比,选择性CoT在显著降低计算成本的同时达到了相似或更高的准确率。
讨论:选择性CoT通过仅在有益时调用显式推理来动态平衡推理深度和效率,减少回忆型问题上的冗余,同时保持可解释性。
结论:选择性CoT提供了一种简单、模型无关且成本效益高的医学问答方法,将推理努力与问题复杂性对齐,以增强基于LLM的临床系统的实际部署能力。
Summary / 总结
The study aims to enhance the efficiency of medical question answering using large language models by employing Selective Chain-of-Thought (Selective CoT), which predicts whether reasoning is necessary and generates rationales only when needed. Evaluations on four biomedical QA benchmarks showed that Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss. It also achieved higher accuracy and efficiency in some model-task pairs compared to standard CoT, and reached similar or superior accuracy at lower computational cost than fixed-length CoT.
研究旨在通过使用Selective Chain-of-Thought(Selective CoT)来提高大型语言模型在医学问答中的效率,该方法预测是否需要推理,并仅在必要时生成推理。这种方法将推理时间减少了13-45%,并将令牌使用量减少了8-47%,同时保持了最小的准确性损失。在某些模型-任务组合中,Selective CoT在准确性和效率上都超过了标准CoT,在较低的计算成本下实现了相似或更高的准确性。
NanoKnow: How to Know What Your Language Model Knows
Authors: Lingwei Gu, Nour Jedidi, Jimmy Lin
First: 2026-02-23T18:37:49+00:00 · Latest: 2026-02-23T18:37:49+00:00
Abstract
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.
中文标题/摘要
标题:NanoKnow:如何了解你的语言模型知道什么
大型语言模型(LLMs)是如何知道它们所知道的内容的?回答这个问题一直很困难,因为预训练数据通常是“黑箱”——未知或不可访问的。最近发布的nanochat——一系列具有完全开放预训练数据的小型LLMs——解决了这一问题,因为它提供了模型参数知识来源的透明视图。为了理解知识是如何被LLMs编码的,我们发布了NanoKnow基准数据集,该数据集根据答案是否出现在nanochat的预训练语料库中,将自然问题和SQuAD中的问题划分为不同的部分。利用这些划分,我们现在可以正确地解开LLMs在生成输出时依赖的知识来源。为了展示NanoKnow的实用性,我们使用八个nanochat检查点进行了实验。我们的发现表明:(1)闭卷准确率强烈受预训练数据中答案频率的影响;(2)提供外部证据可以减轻这种频率依赖性;(3)即使有外部证据,当答案在预训练期间被看到时,模型更准确,这表明参数知识和外部知识是互补的;(4)无关信息是有害的,准确性会根据无关上下文的位置和数量而降低。我们将在https://github.com/castorini/NanoKnow/发布所有NanoKnow的成果。
Summary / 总结
The study aims to understand how large language models (LLMs) acquire their knowledge by utilizing nanochat, a family of small LLMs with fully open pre-training data. The researchers created NanoKnow, a benchmark dataset that categorizes questions from Natural Questions and SQuAD based on whether their answers are in nanochat's pre-training corpus. Key findings include that closed-book accuracy is heavily influenced by answer frequency in the pre-training data, external evidence can reduce this dependence, and models are more accurate when answers are seen during pre-training, indicating the complementarity of parametric and external knowledge. Additionally, non-relevant information negatively impacts accuracy. All NanoKnow artifacts are available at https://github.com/castorini/NanoKnow.
研究旨在通过利用nanochat透明的预训练数据来理解大型语言模型(LLMs)是如何获取知识的。研究引入了NanoKnow基准数据集,该数据集根据答案是否出现在nanochat的预训练数据中来分类问题。实验使用八个nanochat检查点表明,闭卷准确度取决于预训练数据中的答案频率,外部证据可以减少这种依赖性。然而,当答案在预训练期间被看到时,模型更准确,表明参数知识和外部知识是互补的。此外,无关信息会降低准确度。所有NanoKnow的资源可在https://github.com/castorini/NanoKnow获取。
NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
Authors: Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris
First: 2026-02-23T18:35:18+00:00 · Latest: 2026-02-23T18:35:18+00:00
Comments: 25 pages, 15 figures. Project webpage: https://nova-plan.github.io/
Abstract
Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/
中文标题/摘要
标题:NovaPlan:通过闭环视频语言规划实现零样本长时程操作
解决长时程任务需要机器人将高层次语义推理与低层次物理交互相结合。尽管视觉-语言模型(VLM)和视频生成模型可以分解任务并想象结果,但它们往往缺乏实现世界执行所需的物理基础。我们提出了NovaPlan,这是一种分层框架,将闭环VLM和视频规划与几何上接地的机器人执行统一起来,以实现零样本长时程操作。在高层次上,VLM规划器将任务分解为子目标,并在闭环中监控机器人执行,使系统能够通过自主重新规划从单步失败中恢复。为了计算低层次的机器人动作,我们从生成的视频中提取并利用与任务相关的对象关键点和人类手部姿态作为运动学先验,并采用切换机制选择更好的一个作为机器人动作的参考,即使在严重遮挡或深度不准确的情况下也能保持稳定的执行。我们在三个长时程任务和功能性操作基准(FMB)上展示了NovaPlan的有效性。我们的结果表明,NovaPlan可以在没有任何先验演示或训练的情况下执行复杂的装配任务并表现出灵巧的错误恢复行为。项目页面:https://nova-plan.github.io/
Summary / 总结
NovaPlan is a hierarchical framework that integrates closed-loop video language planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. It decomposes tasks into sub-goals and monitors robot execution, allowing for autonomous re-planning in case of failures. The system uses task-relevant object keypoints and human hand poses as kinematic priors from generated videos to compute robot actions, ensuring stable execution even under occlusion or depth inaccuracy. NovaPlan demonstrates effectiveness in complex assembly tasks and error recovery without prior demonstrations or training.
NovaPlan 是一个层次框架,结合了视觉语言模型和视频规划与几何上接地的机器人执行,用于零样本长时程操作。它使用闭环 VLM 计划器分解任务并监控机器人执行,实现自主重新规划。对于低级动作,它从生成的视频中提取和利用与任务相关的物体关键点和人类手部姿态,并采用切换机制确保执行的稳定性。NovaPlan 在复杂装配任务和错误恢复方面展示了有效性,无需任何先验演示或训练。
ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models
Authors: Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, Huzefa Rangwala
First: 2026-02-23T18:34:29+00:00 · Latest: 2026-02-23T18:34:29+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs
中文标题/摘要
标题:ReSyn:自主扩展合成环境以支持推理模型
可验证奖励的强化学习(RLVR)已成为通过验证者提供的监督训练推理语言模型(RLMs)的一种有前途的方法。尽管验证者实现比解决方案注解更容易,但现有的合成数据生成方法仍主要以解决方案为中心,而基于验证者的方 法则依赖于少数手工构建的程序化环境。在本工作中,我们通过引入ReSyn,一种生成多样化推理环境的流水线,扩展了RLVR,该流水线配备了实例生成器和验证者,涵盖了诸如约束满足、算法谜题和空间推理等任务。使用RL在ReSyn数据上训练的Qwen2.5-7B-Instruct模型在推理基准和跨域数学基准上均取得了持续的改进,包括在具有挑战性的BBEH基准上相对提高了27%。消融实验表明,基于验证者的监督和任务多样性的增加都做出了显著贡献,提供了生成大规模推理环境可以增强RLMs推理能力的实证证据
Summary / 总结
The paper introduces ReSyn, a pipeline for generating diverse reasoning environments with instance generators and verifiers to scale reinforcement learning with verifiable rewards (RLVR) for training reasoning language models (RLMs). A Qwen2.5-7B-Instruct model trained with RL on ReSyn data shows consistent improvements across various reasoning benchmarks and out-of-domain math benchmarks, with a 27% relative improvement on the BBEH benchmark. Ablations indicate that verifier-based supervision and increased task diversity are crucial for enhancing reasoning abilities in RLMs.
研究动机是通过生成多样化的推理环境来改进带有验证奖励的强化学习(RLVR),以训练推理语言模型(RLMs)。主要方法是创建ReSyn管道,其中包括实例生成器和验证器,以涵盖各种任务。关键实验发现表明,使用ReSyn数据进行RL训练的Qwen2.5-7B-Instruct模型在推理基准和跨域数学基准上取得了持续的改进,BBEH基准的相对改进达到27%。消融实验表明,基于验证器的监督和增加的任务多样性对于增强RLMs的推理能力至关重要。
Benchmarking Unlearning for Vision Transformers
Authors: Kairan Zhao, Iurie Luca, Peter Triantafillou
First: 2026-02-23T18:33:16+00:00 · Latest: 2026-02-23T18:33:16+00:00
Abstract
Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for VTs. This work is the first to attempt this, benchmarking MU algorithm performance in different VT families (ViT and Swin-T) and at different capacities. The work employs (i) different datasets, selected to assess the impacts of dataset scale and complexity; (ii) different MU algorithms, selected to represent fundamentally different approaches for MU; and (iii) both single-shot and continual unlearning protocols. Additionally, it focuses on benchmarking MU algorithms that leverage training data memorization, since leveraging memorization has been recently discovered to significantly improve the performance of previously SOTA algorithms. En route, the work characterizes how VTs memorize training data relative to CNNs, and assesses the impact of different memorization proxies on performance. The benchmark uses unified evaluation metrics that capture two complementary notions of forget quality along with accuracy on unseen (test) data and on retained data. Overall, this work offers a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs. And, for the first time, it sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline.
中文标题/摘要
标题:视觉变换器的遗忘基准测试
机器遗忘(MU)研究已获得强劲动力:MU现被广泛认为是构建安全和公平AI的关键能力。同时,针对计算机视觉任务的变换器架构研究也非常成功:视觉变换器(VTs)逐渐成为CNNs的强大替代品。然而,视觉任务的MU研究主要集中在CNNs上,而不是VTs。虽然MU基准测试已涵盖LLMs、扩散模型和CNNs,但尚无针对VTs的基准测试。这项工作是首次尝试这一领域,对不同VT家族(ViT和Swin-T)及其不同容量下的MU算法性能进行了基准测试。该工作采用了(i) 不同的数据集,以评估数据集规模和复杂性的影响;(ii) 不同的MU算法,以代表MU的完全不同方法;(iii) 单次学习和连续学习协议。此外,它还关注了利用训练数据记忆的MU算法基准测试,因为利用记忆已被发现能显著提高之前SOTA算法的性能。在这一过程中,该工作描述了VTs相对于CNNs如何记忆训练数据,并评估了不同记忆代理对性能的影响。基准测试使用统一的评估指标,这些指标捕捉了遗忘质量的两个互补概念,以及在未见过(测试)数据和保留数据上的准确性。总体而言,这项工作提供了一个基准测试基础,使人们能够对现有(和未来的)VTs上的MU算法进行可重复、公平和全面的比较。并且,首次揭示了现有算法在VT设置中的表现,建立了有希望的参考性能基准。
Summary / 总结
This study benchmarks machine unlearning (MU) for Vision Transformers (VTs), addressing a gap in MU research for vision tasks. It evaluates different MU algorithms on ViT and Swin-T families across various datasets and unlearning protocols, focusing on algorithms that leverage training data memorization. Key findings include how VTs memorize data differently from CNNs and the impact of different memorization proxies on MU performance. The work introduces unified evaluation metrics to assess both forget quality and accuracy on unseen and retained data, providing a reproducible and fair comparison basis for MU algorithms on VTs.
这项研究对Vision Transformers (VTs)进行了机器遗忘(MU)基准测试,填补了现有文献中主要集中在CNNs上的空白。研究使用不同的VT家族(ViT和Swin-T)、容量、数据集和MU算法来评估。关键发现包括VTs与CNNs相比如何记忆训练数据的表征,以及不同记忆代理对性能的影响。该研究引入了统一的评估指标来评估遗忘质量和未见数据和保留数据上的准确性,为VTs上的MU算法提供了可重复、公平和全面的比较基础。
VillageNet: Graph-based, Easily-interpretable, Unsupervised Clustering for Broad Biomedical Applications
Authors: Aditya Ballal, Gregory A. DePaul, Esha Datta, Asuka Hatano, Erik Carlsson, Ye Chen-Izu, Javier E. López, Leighton T. Izu
First: 2025-01-16T06:56:43+00:00 · Latest: 2026-02-23T18:26:51+00:00
Comments: Software available at https://villagenet.streamlit.app/ Github Link: https://github.com/lordareicgnon/VillageNet
Abstract
Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(N*k*d), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets.
中文标题/摘要
标题:VillageNet:基于图的、易于解释的无监督聚类及其在广泛生物医学应用中的应用
对具有多种变量的大规模高维数据集进行聚类,对于从这些数据集中提取高层次的潜在信息至关重要。在此,我们开发了一种无监督聚类算法,称为“Village-Net”。Village-Net 特别设计用于在没有先验知识的情况下有效聚类高维数据。该算法分为两个阶段:首先,使用 K-Means 聚类将数据集划分为我们称之为“村庄”的不同子集。接下来,创建一个加权网络,每个节点代表一个村庄,捕捉它们之间的接近关系。为了实现最佳聚类,我们使用由我们团队成员开发的一种社区检测算法 Walk-likelihood Community Finder (WLCF) 处理此网络。Village-Net 聚类的一个显著特点是,它能够根据数据的内在特性自主确定进一步分析的最佳聚类数量。我们通过基准测试展示了其在已知真实标签的现有实际数据集上的竞争力,特别是在与最先进的方法相比时,其归一化互信息 (NMI) 分数表现尤为突出。该算法计算效率高,时间复杂度为 O(N*k*d),其中 N 表示实例数量,k 表示村庄数量,d 表示数据集的维度,使其非常适合处理大规模数据集。
Summary / 总结
Village-Net is an unsupervised clustering algorithm designed to effectively cluster high-dimensional datasets without prior knowledge of the number of clusters. It operates in two phases: first, K-Means clustering divides the dataset into subsets called 'villages', and then a weighted network is created to capture the proximity relationships between these villages. The network is further processed using the Walk-likelihood Community Finder (WLCF) algorithm to achieve optimal clustering. Extensive benchmarking on real-world datasets shows that Village-Net outperforms other state-of-the-art methods in terms of normalized mutual information (NMI) scores. The algorithm is computationally efficient with a time complexity of O(N*k*d).
Village-Net 是一种无需预先知道聚类数量的无监督聚类算法,用于处理高维数据集。该算法分为两步:首先使用 K-Means 聚类将数据集划分为“村庄”,然后创建一个加权网络以捕捉村庄之间的接近关系。该网络使用 Walk-likelihood Community Finder (WLCF) 算法进行处理以实现最优聚类。在真实世界数据集上的广泛基准测试表明,Village-Net 在归一化互信息 (NMI) 指标上优于其他最先进的方法。该算法具有计算效率,时间复杂度为 O(N*k*d)。
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Authors: Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe
Venue: ICLR 2026
First: 2025-06-09T13:34:50+00:00 · Latest: 2026-02-23T18:25:13+00:00
Comments: ICLR 2026
Abstract
Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focus on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
中文标题/摘要
标题:AbstRaL:通过强化抽象思维增强LLMs的推理能力
近期研究表明,大型语言模型(LLMs),尤其是较小的模型,在小学数学(GSM)推理方面往往缺乏稳健性。特别是在面对分布变化时,如数值或名义变量的变化,或插入分散性从句,它们的表现往往会下降。一种可能的策略是生成合成数据,进一步“实例化”推理问题的潜在变化。在本文中,我们反而关注“抽象化”推理问题的策略。这不仅有助于抵消分布变化,还促进了与符号工具的连接,以推导解决方案。聚焦于GSM,我们发现这一抽象过程通过强化学习(RL)比单纯的监督微调更容易获得,后者往往无法产生忠实的抽象。我们的方法AbstRaL——通过RL在粒度抽象数据上促进LLMs的抽象推理——显著减轻了在最近的GSM扰动基准上的性能下降。此外,通过AbstRaL提高GSM稳健性也被证明会隐式地提升LLMs在OOD数学和一般推理任务上的能力,表明抽象思维广泛地促进了更好的泛化。
Summary / 总结
This study addresses the robustness issues of large language models (LLMs) in grade school math reasoning, particularly their performance drops under distribution shifts. Instead of generating synthetic data, the authors propose an abstraction strategy enhanced by reinforcement learning (RL) to improve LLMs' reasoning capabilities. The method, AbstRaL, significantly reduces performance degradation on GSM perturbation benchmarks and enhances LLMs' general reasoning abilities, suggesting that abstract thinking broadly improves generalizability.
该研究针对大型语言模型(LLMs)在小学数学推理中的鲁棒性问题,特别是其在分布变化下的性能下降。作者提出了一种抽象策略,使用强化学习(RL)来增强LLMs的抽象思维能力。方法AbstRaL显著提高了小学数学推理的鲁棒性,并且也提升了LLMs在分布外(OOD)数学和一般推理任务的能力。
EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
Authors: Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang
First: 2026-02-05T00:33:02+00:00 · Latest: 2026-02-23T18:23:57+00:00
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.
中文标题/摘要
标题:EBPO:经验贝叶斯收缩以稳定组相对策略优化
可验证奖励的强化学习(RLVR)已被证明能够增强大型语言模型(LLMs)的推理能力。然而,主流方法如组相对策略优化(GRPO)面临严重的稳定性挑战:在计算约束条件下(小组规模较小)它们遭受高估计方差问题,并且在所有响应均产生相同零奖励的饱和失败状态下,梯度信号消失。为解决这一问题,我们提出了一种新的经验贝叶斯策略优化(EBPO)框架,该框架通过借用策略累积的全局统计信息来正则化局部组基线。EBPO 不是孤立地估计基线,而是使用一个动态平衡局部组统计信息与通过 Welford 在线算法更新的全局先验的收缩估计器。理论上,我们证明了与 GRPO 相比,EBPO 严格具有更低的均方误差(MSE)、有界熵衰减和在失败场景中非消失的惩罚信号。实验上,EBPO 在包括 AIME 和 OlympiadBench 在内的多种基准测试中均优于 GRPO 和其他现有基准,表现出更优的训练稳定性,即使在小组规模较小的情况下也能实现高性能提升,并且从难度分层的课程学习中获益显著。
Summary / 总结
EBPO is a novel framework that addresses the stability challenges in Group Relative Policy Optimization (GRPO) by employing a shrinkage estimator that borrows strength from global statistics. Theoretically, EBPO ensures lower Mean Squared Error and non-vanishing penalty signals in failure scenarios. Empirically, EBPO outperforms GRPO and other baselines across various benchmarks, demonstrating superior training stability even with small group sizes.
论文提出了一种新的方法Empirical Bayes Policy Optimization (EBPO),以解决Group Relative Policy Optimization (GRPO)的稳定性问题。EBPO通过结合局部组统计和全局统计来减少方差,并确保在失败场景中梯度信号不消失。实验表明,EBPO在各种基准测试中优于GRPO和其他基线,展示了即使在小组规模下也有更好的训练稳定性和性能提升。
The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm
Authors: Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth
First: 2026-01-09T03:19:37+00:00 · Latest: 2026-02-23T18:16:48+00:00
Abstract
Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is not just limited to human preferences, but is also consequential even in safety critical domains such as medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how - high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80\%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor -- uncertainty.
中文标题/摘要
标题:在不确定性下的人类与AI平庸幻象:通过概率范式导航难以捉摸的真相
在基准测试AI系统的相对能力时,包括大型语言模型(LLMs)和视觉模型,通常会忽略底层专家答案不确定性的影响。这种模糊不仅限于人类偏好,甚至在医学等安全关键领域也普遍存在不确定性。在这些领域中,不确定性是普遍存在的。在本文中,我们引入了概率范式来理论解释:即使对于专家来说,高确定性的底层答案几乎总是必要的,而在具有高底层答案变异性数据集上,随机标注者和专家之间的差异可能很小。因此,在忽略底层答案评估数据中的不确定性时,可能会得出误导性的结论,即非专家的表现与专家相似。利用概率范式,我们提出了预期准确率和预期F1的概念,以估计给定底层答案变异性时专家人类或系统的得分。我们的工作导致了这样的建议:在确定系统的能力时,结果应按底层答案概率分层,通常通过地面真相专家的一致率来衡量。当整体性能低于80%的阈值时,分层评估变得至关重要。在分层评估下,高确定性区间内的性能比较更加可靠,减轻了关键混杂因素——不确定性的影响。
Summary / 总结
This paper addresses the issue of ignoring uncertainty in ground truth answers when benchmarking AI systems, particularly in safety-critical domains. It introduces a probabilistic paradigm to explain how high certainty in ground truth answers is crucial for achieving high scores, while in datasets with high variation, there may be little difference between experts and random labelers. The study proposes using expected accuracy and expected F1 to estimate expert performance given ground truth variability and recommends stratifying results by the probability of the ground truth answer, especially when overall performance drops below 80%. This approach enhances the reliability of performance comparisons by mitigating the effect of uncertainty.
本文探讨了在评估AI系统时忽略地面真实答案不确定性所带来的误导性结论问题。它引入了一个概率范式来解释为什么在地面真实答案高度确定的情况下,才能获得高分数,而在高变异性数据集下,随机标注者和专家之间的差异可能很小。作者提出使用预期准确率和预期F1来估计给定地面真实答案变异性下的得分,并建议根据地面真实答案的一致率对结果进行分层,特别是在整体性能低于80%的阈值时。这种方法在高确定性区间内增强了性能比较的可靠性,减轻了关键混杂因素——不确定性的影响。
Transcending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine
Authors: Soumick Chatterjee
Venue: Artificial Intelligence for Biomedical Data, AIBIO 2025, CCIS 2696, pp 243-248, 2026
First: 2026-02-23T18:15:30+00:00 · Latest: 2026-02-23T18:15:30+00:00
Abstract
The dependence on expert annotation has long constituted the primary rate-limiting step in the application of artificial intelligence to biomedicine. While supervised learning drove the initial wave of clinical algorithms, a paradigm shift towards unsupervised and self-supervised learning (SSL) is currently unlocking the latent potential of biobank-scale datasets. By learning directly from the intrinsic structure of data - whether pixels in a magnetic resonance image (MRI), voxels in a volumetric scan, or tokens in a genomic sequence - these methods facilitate the discovery of novel phenotypes, the linkage of morphology to genetics, and the detection of anomalies without human bias. This article synthesises seminal and recent advances in "learning without labels," highlighting how unsupervised frameworks can derive heritable cardiac traits, predict spatial gene expression in histology, and detect pathologies with performance that rivals or exceeds supervised counterparts.
中文标题/摘要
标题:超越注释瓶颈:生物医学领域的AI驱动发现
专家注释的依赖长期以来一直是将人工智能应用于生物医学领域的主要限制因素。虽然监督学习推动了临床算法的初期发展,但目前正朝着无监督学习和半监督学习(SSL)的范式转变,这正在释放生物银行规模数据集的潜在价值。通过直接从数据的内在结构中学习——无论是磁共振成像(MRI)中的像素、体积扫描中的体素还是基因组序列中的标记——这些方法促进了新型表型的发现、形态与遗传的关联以及无人类偏见的异常检测。本文综述了“无标签学习”的关键进展,强调了无监督框架如何推导可遗传的心脏特征、预测组织学中的空间基因表达以及检测性能可媲美或超越监督方法的病理学。
Summary / 总结
The research addresses the challenge of relying on expert annotation in AI applications for biomedicine, focusing on the shift towards unsupervised and self-supervised learning methods. These methods learn directly from the data's intrinsic structure, such as MRI images or genomic sequences, to discover new phenotypes, link morphology to genetics, and detect anomalies without human bias. Key findings include the ability to derive heritable cardiac traits, predict spatial gene expression in histology, and detect pathologies with performance comparable to or better than supervised methods.
研究解决了生物医学中依赖专家标注的问题,重点关注无监督和半监督学习方法的转变。这些方法直接从数据的内在结构中学习,如MRI图像或基因序列,以发现新的表型、将形态学与遗传学联系起来,并检测异常,而无需人类偏见。主要发现包括能够推导出可遗传的心脏特征、预测组织学中的空间基因表达以及检测病理学,其性能与或优于监督方法相当或更好。
FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding
Authors: João Pereira, Vasco Lopes, João Neves, David Semedo
Venue: AAAI 2026
First: 2026-01-24T02:17:07+00:00 · Latest: 2026-02-23T18:12:49+00:00
Comments: Accepted at AAAI 2026
Abstract
Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.
中文标题/摘要
标题:FineVAU:一种新的面向细粒度视频异常理解的人类对齐基准
视频异常理解(VAU)是一个专注于描述视频中不寻常事件的新任务。尽管兴趣日益增长,但VAU的评估仍然是一个开放的挑战。现有的基准依赖于基于n-gram的度量标准(例如,BLEU,ROUGE-L)或基于LLM的评估。前者无法捕捉到LVLM响应的丰富、自由形式和视觉基础的特性,而后者则侧重于评估语言质量而非事实相关性,往往导致主观判断与人类感知不一致。在本文中,我们通过提出FineVAU,一种新的VAU基准,解决了这一问题,该基准将重点转向对异常视频的丰富、细粒度和领域特定的理解。我们将VAU表述为一个三重问题,目标是全面理解视频中异常事件的关键描述元素:事件(What)、参与者(Who)和位置(Where)。我们的基准引入了a) FVScore,一种新颖的人类对齐评估指标,评估LVLM答案中关键视觉元素的出现情况,提供可解释的细粒度反馈;以及b) FineW3,一种通过结构化和全自动程序编纂的新颖、全面的数据集,该数据集通过高质量的细粒度视觉信息增强了现有的人类注释。人类评估表明,我们提出的方法在与当前方法相比在异常感知方面与人类感知的对齐程度更高。对FineVAU的详细实验揭示了LVLM在感知需要空间和细粒度时间理解的异常事件方面的关键局限性,尽管在粗粒度、静态信息和具有强烈视觉线索的事件上表现出色。
Summary / 总结
FineVAU is a new benchmark for Video Anomaly Understanding (VAU) that focuses on rich, fine-grained, and domain-specific understanding of anomalies. It introduces FVScore, a human-aligned evaluation metric, and FineW3, a comprehensive dataset. Experiments show that current language models struggle with spatial and fine-grained temporal understanding of anomalous events, despite performing well on static information and events with strong visual cues.
FineVAU 是一个新的视频异常理解基准,专注于异常视频的丰富、细致和领域特定的理解。它引入了 FVScore,一个新型的评估指标,和 FineW3,一个综合数据集,以评估语言模型响应中的关键视觉元素。人类评估表明,FVScore 在异常感知方面与人类感知更一致,优于现有指标。实验揭示,尽管语言模型在静态信息和具有强烈视觉线索的事件上表现良好,但在空间和细致的时间理解异常事件方面存在局限性。
Competition for attention predicts good-to-bad tipping in AI
Authors: Neil F. Johnson, Frank Y. Huo
First: 2026-02-16T00:43:56+00:00 · Latest: 2026-02-23T18:12:05+00:00
Abstract
More than half the global population now carries devices that can run ChatGPT-like language models with no Internet connection and minimal safety oversight -- and hence the potential to promote self-harm, financial losses and extremism among other dangers. Existing safety tools either require cloud connectivity or discover failures only after harm has occurred. Here we show that a large class of potentially dangerous tipping originates at the atomistic scale in such edge AI due to competition for the machinery's attention. This yields a mathematical formula for the dynamical tipping point n*, governed by dot-product competition for attention between the conversation's context and competing output basins, that reveals new control levers. Validated against multiple AI models, the mechanism can be instantiated for different definitions of 'good' and 'bad' and hence in principle applies across domains (e.g. health, law, finance, defense), changing legal landscapes (e.g. EU, UK, US and state level), languages, and cultural settings.
中文标题/摘要
标题:注意力竞争预测AI的好坏临界点
全球超过一半的人口现在携带可以离线运行类似ChatGPT的语言模型且缺乏安全监管的设备——这增加了自我伤害、经济损失和其他危险的风险。现有的安全工具要么需要云连接,要么在危害发生后才检测到失败。我们展示了由于边缘AI中注意力的竞争,一类潜在危险的临界点起源于原子级。这提供了一个由对话背景与竞争输出盆地之间的点积竞争决定的动力学临界点n*的数学公式,揭示了新的控制杠杆。该机制在多个AI模型上得到了验证,可以针对不同的“好”和“坏”的定义进行实例化,因此原则上适用于不同领域(如健康、法律、金融、国防)、法律环境(如欧盟、英国、美国和州级)、语言和文化背景。
Summary / 总结
The research aims to address the potential risks associated with the widespread use of edge AI devices capable of running language models without internet connection. It introduces a mathematical formula to predict the tipping point at which AI conversations can shift from good to bad outcomes due to competition for the device's attention. The study validates this mechanism across multiple AI models and suggests its applicability across various domains and cultural settings, providing new control strategies for managing these risks.
研究旨在应对广泛使用无需互联网连接即可运行语言模型的边缘AI设备所带来的潜在风险。它引入了一个数学公式来预测由于争夺设备注意力而导致AI对话从良好走向不良的临界点。该研究验证了这一机制在多个AI模型中的有效性,并表明其适用于各种领域、文化背景以及不同的法律环境,提供了新的管理这些风险的控制策略。
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Authors: Narges Norouzi, Idil Esen Zulfikar, Niccolò Cavagnero, Tommie Kerssies, Bastian Leibe, Gijs Dubbelman, Daan de Geus
Venue: CVPR 2025
First: 2026-02-19T20:14:14+00:00 · Latest: 2026-02-23T18:10:18+00:00
Comments: CVPR 2025. Code: https://www.tue-mps.org/videomt/
Abstract
Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x-10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/
中文标题/摘要
标题:VidEoMT:你的ViT实际上也是一个视频分割模型
现有的在线视频分割模型通常将每帧分割器与复杂的专门跟踪模块结合使用。虽然这些模块有效,但它们引入了显著的架构复杂性和计算开销。最近的研究表明,当具有足够的容量并进行大规模预训练时,普通的视觉变换器(ViT)编码器可以进行准确的图像分割,而无需专门的模块。受此观察的启发,我们提出了视频编码器仅掩码变换器(VidEoMT),这是一种简单的编码器仅视频分割模型,消除了专用跟踪模块的需要。为了在编码器仅ViT中实现时间建模,VidEoMT引入了一种轻量级的查询传播机制,通过重用上一帧的查询来携带跨帧的信息。为了平衡这一点并适应新内容,它采用了一种查询融合策略,将传播的查询与一组时空无关的已学习查询相结合。因此,VidEoMT在不增加复杂性的情况下获得了跟踪器的好处,实现了竞争力的准确性,同时比传统方法快5-10倍,使用ViT-L骨干时可达到每秒160帧。代码:https://www.tue-mps.org/videomt/
Summary / 总结
VidEoMT is proposed to simplify video segmentation by removing specialized tracking modules, instead relying on a Vision Transformer (ViT) encoder enhanced with a lightweight query propagation mechanism and query fusion strategy. This model achieves competitive accuracy and runs up to 160 FPS, being 5x-10x faster than existing methods. The key innovation is the ability to balance temporal modeling with adaptability to new content without increasing architectural complexity. Code: https://www.tue-mps.org/videomt/
VidEoMT 提出了一种使用 Vision Transformer (ViT) 编码器的方法来简化视频分割,去掉了复杂的跟踪模块。它引入了一个轻量级的查询传播机制来实现时间建模,并采用查询融合策略以适应新内容。这使得模型既准确又高效,使用 ViT-L 骨干网络时可达到每秒 160 帧的运行速度,并且比现有方法快 5 到 10 倍,同时保持了竞争力的准确性。
CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching
Authors: Yuzhe Wang, Yaochen Zhu, Jundong Li
First: 2026-02-23T18:06:15+00:00 · Latest: 2026-02-23T18:06:15+00:00
Comments: 8 pages plus references, 3 figures, 3 tables. Under review
Abstract
As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.
中文标题/摘要
标题:CausalFlip:超越语义匹配的LLM因果判断基准
随着大型语言模型(LLMs)在复杂、高风险决策场景中的应用日益增多,将它们的推理基础建立在因果关系上而非偶然的相关性变得至关重要。然而,传统推理基准上的强大表现并不能保证LLMs真正具备因果推理能力,因为高准确率可能只是由于记忆了语义模式而非分析了潜在的真实因果结构。为了弥合这一关键差距,我们提出了一种新的因果推理基准CausalFlip,旨在鼓励开发新的LLM范式或训练算法,使LLM的推理基础建立在因果关系上而非语义相关性。CausalFlip由基于事件三元组构建的因果判断问题组成,这些事件三元组可以形成不同的共因、链式和碰撞关系。基于此,对于每个事件三元组,我们构建了语义相似的问题对,这些问题重用了相同的事件但导致了相反的因果答案,使得依赖于语义匹配的模型系统地产生错误预测。为了进一步探究模型对语义模式的依赖,我们引入了一种噪声前缀评估,该评估在中间因果推理步骤前添加了因果无关的文本,而不改变潜在的因果关系或推理过程的逻辑。我们对多种训练范式下的LLMs进行了评估,包括仅答案训练、显式因果推理链(CoT)监督以及一种旨在减轻推理过程中对相关性依赖的内部化因果推理方法。结果显示,显式的CoT仍然可能被虚假的语义相关性误导,而内部化推理步骤则显著提高了因果基础,表明更好地激发基底LLMs的潜在因果推理能力是可行的。
Summary / 总结
CausalFlip is a new benchmark designed to evaluate the causal reasoning ability of large language models (LLMs) beyond semantic matching. It consists of causal judgment questions built over event triples that can form different causal relations, and pairs of semantically similar questions that yield opposite causal answers. The benchmark also includes a noisy-prefix evaluation to probe models' reliance on semantic patterns. Evaluations show that explicit Chain-of-Thought (CoT) can still be misled by spurious semantic correlations, while internalizing reasoning steps improves causal grounding, indicating the potential to better elicit latent causal reasoning capabilities of LLMs.
CausalFlip 是一个新的基准,旨在评估大型语言模型(LLMs)的因果推理能力,而不仅仅是语义匹配。它包含基于事件三元组构建的因果判断问题,这些事件三元组可以形成不同的因果关系,并且包含一对语义相似的问题,它们给出相反的因果答案。基准还包括一个噪声前缀评估,以探究模型对语义模式的依赖。对不同训练范式的 LLM 评估显示,内化推理步骤可以改善因果定位,表明显式的因果链推理(CoT)仍然可能被虚假的语义相关性误导,而内化的因果推理更为有效。
Closing the Gap Between Text and Speech Understanding in LLMs
Authors: Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh
First: 2025-10-15T14:57:16+00:00 · Latest: 2026-02-23T18:05:51+00:00
Abstract
Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
中文标题/摘要
标题:在大规模语言模型中缩小文本与语音理解之间的差距
大规模语言模型(LLMs)可以被调整以扩展其文本能力以处理语音输入。然而,这些适应语音的LLMs在语言理解任务上的表现始终不如其基于文本的对应物——甚至不如级联管道。我们称这种不足为文本-语音理解差距:当一个适应语音的LLM处理语音输入时,相对于原始基于文本的LLM处理等效文本时观察到的性能下降。最近缩小这一差距的方法要么依赖大规模的文本语料库的语音合成,这既昂贵又高度依赖合成数据,要么依赖大规模的专有语音数据集,这些数据集不可重复。因此,仍需要更高效的数据替代方案来缩小文本-语音理解差距。在本研究中,我们分析了这一差距由两个因素驱动:(i)适应过程中对文本能力的遗忘,(ii)语音和文本之间的跨模态不一致。基于这一分析,我们引入了SALAD——高效样本对齐与通过主动选择和跨模态蒸馏学习——它结合了跨模态蒸馏和目标合成数据,以提高对齐并减轻遗忘。将SALAD应用于3B和7B LLMs,在公共语料库的语音数据量少一个数量级以上的情况下,SALAD在广泛领域的知识、语言理解和推理基准测试中实现了与强开源模型相当的性能。
Summary / 总结
This paper addresses the text-speech understanding gap in Large Language Models (LLMs) by analyzing it as driven by forgetting of text capabilities and cross-modal misalignment. The authors propose SALAD, a method combining cross-modal distillation with targeted synthetic data to improve alignment and mitigate forgetting. SALAD achieves competitive performance with a strong open-weight model across various benchmarks while using significantly less speech data compared to previous approaches.
本文探讨了大型语言模型(LLMs)在语音理解和文本理解之间的差距,其中适应语音的LLMs在性能上低于文本模型。作者分析了这种差距是由两种因素造成的:适应过程中对文本能力的遗忘以及语音和文本之间的跨模态不对齐。他们提出了一种名为SALAD的方法,结合跨模态蒸馏和目标合成数据来改善对齐并减轻遗忘。SALAD在广泛领域的基准测试中实现了与现有方法相当的性能,同时使用了比现有方法少一个数量级的公开语音数据。
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou
First: 2026-01-22T18:58:55+00:00 · Latest: 2026-02-23T18:05:24+00:00
Abstract
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
中文标题/摘要
标题:PyraTok:语言对齐的分层标记器,用于视频理解和生成
离散视频VAEs是现代文本到视频生成和视频理解系统的基石,但现有的标记器通常在单个尺度上学习具有有限词汇量和浅层语言监督的视觉码本,导致跨模态对齐不良和零样本迁移效果不佳。我们提出了PyraTok,一种语言对齐的分层标记器,能够在多个时空分辨率上学习语义结构化的离散潜在变量。PyraTok 基于一个预训练的视频VAE和一个新颖的语言对齐分层量化(LaPQ)模块,该模块使用共享的大二进制码本在多个深度上离散化编码特征,从而产生紧凑且具有表现力的视频标记序列。为了紧密耦合视觉标记与语言,PyraTok 联合优化多尺度文本引导量化和标记层次上的全局自回归目标。在十个基准测试中,PyraTok 在视频重建方面达到了最先进的(SOTA)性能,一致地提高了文本到视频的质量,并在视频分割、动作定位和视频理解方面设立了新的SOTA零样本性能,能够扩展到高达4K/8K分辨率。
Summary / 总结
PyraTok is designed to improve cross-modal alignment and zero-shot transfer in video understanding and generation by learning semantically structured discrete latents across multiple spatiotemporal resolutions. It uses a Language aligned Pyramidal Quantization (LaPQ) module to discretize encoder features at several depths with a shared large binary codebook. PyraTok optimizes multi-scale text-guided quantization and a global autoregressive objective, leading to state-of-the-art video reconstruction, enhanced text-to-video quality, and new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, even at high resolutions up to 4K/8K.
PyraTok 是一种语言对齐的分层 tokenizer,能够在多个时空分辨率下学习语义结构化的离散潜在变量,从而改善跨模态对齐和零样本迁移在视频理解和生成中的表现。它使用了语言对齐的分层量化(LaPQ)模块,在多个深度使用共享的大二进制码本对编码特征进行离散化,并联合优化多尺度文本引导量化和全局自回归目标。PyraTok 在视频重建、文本到视频生成以及各种视频理解任务中均达到了最先进的性能,并且能够很好地扩展到高分辨率。
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Authors: Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
First: 2025-11-25T18:59:45+00:00 · Latest: 2026-02-23T17:59:38+00:00
Comments: Tech report. Project page: https://nvlabs.github.io/LocateAnything3D/
Abstract
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 38.90 AP_3D, surpassing the previous best by +13.98 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
中文标题/摘要
标题:LocateAnything3D:基于视线链的视觉-语言3D检测
为了在世界中行动,模型必须命名它所看到的并知道其在3D中的位置。今天的视觉-语言模型(VLM)在开放的2D描述和语义定位方面表现出色,但多对象3D检测仍然缺乏于VLM工具箱中。我们提出了LocateAnything3D,这是一种VLM原生的方法,将3D检测视为下一个标记预测问题。关键在于一个简短明确的视线链(CoS)序列,这反映了人类如何从图像中推理:先在2D中找到一个物体,然后推断其距离、大小和姿态。解码器首先以视觉链的方式发出2D检测,然后在容易到困难的课程中预测3D框:在对象之间,从近到远的顺序减少了早期的不确定性并匹配了以自我为中心的实用性;在每个对象内部,从相机中心、尺寸和旋转的分解按稳定性和可学习性排列信息。这种VLM原生的接口保留了开放词汇和视觉提示的能力,而无需专门的头部。在具有挑战性的Omni3D基准测试中,我们的模型达到了最先进的结果,38.90 AP_3D,即使基线提供了真实2D框,绝对改进也超过了前最佳值13.98。它还以强大的鲁棒性在零样本下推广到未见过的类别。通过将3D检测转化为一个有纪律的下一个标记问题,LocateAnything3D为模型提供了一个感知3D的实用基础。
Summary / 总结
The research aims to enable models to perform 3D detection, which is crucial for real-world actions. The method involves casting 3D detection as a next-token prediction problem using a Chain-of-Sight (CoS) sequence that mimics human reasoning. The model first generates 2D detections and then predicts 3D boxes in an easy-to-hard curriculum. On the Omni3D benchmark, the model achieves state-of-the-art results with 38.90 AP_3D, surpassing previous bests by 13.98 points even when given ground-truth 2D boxes. It also demonstrates strong zero-shot generalization to new categories.
LocateAnything3D通过将多对象3D检测问题转化为下一个标记预测任务来解决视觉语言模型中的挑战。它使用链式视线(CoS)序列引导模型从2D目标检测到3D框预测,提高了准确性和泛化能力。该模型在Omni3D基准测试中取得了最先进的结果,超越了之前的最佳方法13.98 AP_3D点,并且在新类别上实现了强大的零样本泛化能力。
Recurrent Equivariant Constraint Modulation: Learning Per-Layer Symmetry Relaxation from Data
Authors: Stefanos Pertigkiozoglou, Mircea Petrache, Shubhendu Trivedi, Kostas Daniilidis
First: 2026-02-02T21:59:35+00:00 · Latest: 2026-02-23T17:55:22+00:00
Abstract
Equivariant neural networks exploit underlying task symmetries to improve generalization, but strict equivariance constraints can induce more complex optimization dynamics that can hinder learning. Prior work addresses these limitations by relaxing strict equivariance during training, but typically relies on prespecified, explicit, or implicit target levels of relaxation for each network layer, which are task-dependent and costly to tune. We propose Recurrent Equivariant Constraint Modulation (RECM), a layer-wise constraint modulation mechanism that learns appropriate relaxation levels solely from the training signal and the symmetry properties of each layer's input-target distribution, without requiring any prior knowledge about the task-dependent target relaxation level. We demonstrate that under the proposed RECM update, the relaxation level of each layer provably converges to a value upper-bounded by its symmetry gap, namely the degree to which its input-target distribution deviates from exact symmetry. Consequently, layers processing symmetric distributions recover full equivariance, while those with approximate symmetries retain sufficient flexibility to learn non-symmetric solutions when warranted by the data. Empirically, RECM outperforms prior methods across diverse exact and approximate equivariant tasks, including the challenging molecular conformer generation on the GEOM-Drugs dataset.
中文标题/摘要
标题:循环不变量约束调制:从数据中学习每层的对称性放松
不变神经网络利用任务的潜在对称性来提高泛化能力,但严格的不变性约束可能会导致更复杂的优化动态,从而阻碍学习。先前的工作通过在训练过程中放松严格的不变性来解决这些限制,但通常依赖于每个网络层的预设、显式或隐式的特定放松水平,这些水平是任务相关的,并且调整起来成本高昂。我们提出了循环不变量约束调制(RECM),这是一种层间约束调制机制,仅从训练信号和每层输入-目标分布的对称性属性中学习适当的放松水平,而无需任何关于任务相关目标放松水平的先验知识。我们证明,在提出的RECM更新下,每个层的放松水平会收敛到其对称性差距的上界,即其输入-目标分布偏离完全对称的程度。因此,处理对称分布的层恢复了完全不变性,而具有近似对称性的层在数据需要时保留了足够的灵活性来学习非对称解。实验中,RECM在各种精确和近似不变任务中均优于先前的方法,包括GEOM-Drugs数据集上的具有挑战性的分子构象生成任务。
Summary / 总结
The paper addresses the challenge of strict equivariance constraints in neural networks by proposing Recurrent Equivariant Constraint Modulation (RECM), which learns appropriate relaxation levels for each layer based on the training signal and the symmetry properties of the input-target distribution. This method eliminates the need for task-dependent and costly tuning of relaxation levels. Empirical results show that RECM outperforms previous methods across various equivariant tasks, including molecular conformer generation on the GEOM-Drugs dataset.
论文提出了一种递归可变约束调节(RECM)方法,该方法根据输入-目标分布的对称性特性以及训练信号来学习每一层的适当松弛水平,从而解决了神经网络中严格的对称性约束问题。这种方法消除了对任务依赖性和成本高昂的松弛水平调优的需求。实验证明,RECM在包括GEOM-Drugs数据集上的分子构象生成在内的各种对称性任务中优于先前的方法。
Do Large Language Models Understand Data Visualization Principles?
Authors: Martin Sinnona, Valentin Bonas, Viviana Siless, Emmanuel Iarussi
First: 2026-02-23T17:51:06+00:00 · Latest: 2026-02-23T17:51:06+00:00
Abstract
Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.
中文标题/摘要
标题:大型语言模型理解数据可视化原则吗?
数据可视化原则源自数十年的设计和感知研究,确保了视觉传达的正确性。尽管先前的研究表明大型语言模型(LLMs)能够生成图表或标记误导性图表,但尚不清楚它们及其视觉-语言对应物(VLMs)是否能够直接推理和执行可视化原则。基于约束的系统将这些原则编码为逻辑规则以进行精确的自动化检查,但将其转化为形式规范需要专家知识。这促使我们利用LLMs和VLMs作为原则检查器,可以直接推理视觉设计,而无需指定符号规则。在本文中,我们首次系统地评估了LLMs和VLMs在推理可视化原则方面的能力,使用来自回答集编程(ASP)的严格验证地面真相。我们编译了一组用自然语言表达的可视化原则,并生成了一个包含约2,000个带有明确原则违反标注的Vega-Lite规范的受控数据集,同时还包括了超过300个真实世界的Vega-Lite图表。我们评估了检查和修复任务,评估模型检测原则违反情况和纠正有缺陷的图表规范的能力。我们的工作突显了大型(视觉-)语言模型作为灵活的可视化设计验证器和编辑器的潜力,同时也揭示了与符号求解器在视觉感知更微妙方面存在的持续差距。它们还揭示了一个有趣的不对称性:前沿模型在纠正违反方面比在可靠地检测它们方面更有效。
Summary / 总结
This paper evaluates whether large language models (LLMs) and vision-language models (VLMs) can reason about and enforce data visualization principles. Motivated by the need to bypass the requirement for expert knowledge in specifying formal rules, the study uses a dataset of approximately 2,000 Vega-Lite specifications annotated with principle violations and over 300 real-world charts. The models were tested on both checking and fixing tasks, revealing that while they can effectively correct violations, they struggle to reliably detect them, highlighting a gap with symbolic solvers in nuanced visual perception aspects.
本文评估了大型语言模型(LLMs)和视觉-语言模型(VLMs)在处理和执行数据可视化原则方面的能力。受无需专家知识进行符号规则指定的需求驱动,研究使用了大约2,000个带有原则违规标注的Vega-Lite规范数据集进行系统性评估。评估结果显示,尽管这些模型能够有效地纠正视觉设计缺陷,但在可靠检测违规方面存在困难,这表明在精细的视觉感知任务上,它们与符号求解器之间存在差距。
APEX-Agents
Authors: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski
First: 2026-01-20T18:53:44+00:00 · Latest: 2026-02-23T17:49:38+00:00
Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open source Archipelago, our infrastructure for agent execution and evaluation.
中文标题/摘要
标题:APEX-Agents
我们介绍了代理人工智能生产力指数(APEX-Agents),这是一个基准测试,用于评估AI代理是否能够执行由投资银行分析师、管理咨询顾问和公司律师创建的长期跨应用任务。APEX-Agents 要求代理在包含文件和工具的现实工作环境中导航。我们使用 Pass@1 测试了八种代理以确定排行榜。Gemini 3 Flash(思考=高)获得最高分为 24.0%,其次是 GPT-5.2(思考=高)、Claude Opus 4.5(思考=高)和 Gemini 3 Pro(思考=高)。我们开源了包含 480 个提示、评分标准、黄金输出、文件和元数据的 APEX-Agents 基准测试。我们还开源了我们的代理执行和评估基础设施 Archipelago。
Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
Authors: Shan Yang, Yang Liu
First: 2026-02-23T17:45:08+00:00 · Latest: 2026-02-23T17:45:08+00:00
Comments: 10 pages, 5 figures, 5 tables; plus 16 pages of appendices
Abstract
Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
中文标题/摘要
标题:基于下降引导的策略梯度方法实现可扩展的多智能体协同学习
协同多智能体强化学习(MARL)的可扩展性从根本上受到跨智能体噪声的限制:当智能体共享一个共同的奖励时,所有$N$个智能体的行动共同决定了每个智能体的学习信号,因此跨智能体噪声随着$N$的增长而增长。在策略梯度设置中,每个智能体的梯度估计方差按$Θ(N)$增长,导致样本复杂度为$\mathcal{O}(N/ε)$。我们观察到许多领域——云计算、交通、电力系统——具有可微分的分析模型,规定了高效系统状态。在本文中,我们提出了下降引导的策略梯度(DG-PG)框架,该框架从这些分析模型中构建无噪声的每个智能体引导梯度,解耦每个智能体的梯度与其他智能体的行动。我们证明DG-PG将梯度方差从$Θ(N)$降低到$\mathcal{O}(1)$,保持了合作博弈的均衡,并实现了智能体无关的样本复杂度$\mathcal{O}(1/ε)$。在具有最多200个智能体的异构云调度任务中,DG-PG在每个测试规模(从$N=5$到$N=200$)中均在10个回合内收敛,直接证实了预测的规模不变复杂性,而MAPPO和IPPO在相同架构下无法收敛。
Summary / 总结
The research addresses the challenge of cross-agent noise in cooperative multi-agent reinforcement learning (MARL) by proposing Descent-Guided Policy Gradient (DG-PG), which uses differentiable analytical models to provide noise-free guidance gradients for each agent. This method reduces the gradient variance from Θ(N) to Θ(1), achieving agent-independent sample complexity. Experiments on a cloud scheduling task with up to 200 agents show that DG-PG converges within 10 episodes across all scales, while other methods fail to converge under similar conditions.
本文提出了一种Descent-Guided Policy Gradient (DG-PG) 方法,通过使用可微分的分析模型为每个代理构建无噪声的指导梯度来解决合作多代理强化学习的扩展问题。该方法将梯度方差从Θ(N)降低到O(1),实现了代理无关的样本复杂度O(1/ε)。研究显示,DG-PG 在从5到200个代理的云调度任务中10个回合内即可收敛,而其他方法在相同条件下无法收敛。
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
Authors: Depanshu Sani, Saket Anand
Venue: CVPR 2026
First: 2025-03-10T20:59:41+00:00 · Latest: 2026-02-23T17:38:56+00:00
Comments: Accepted at CVPR 2026
Abstract
Traditional classifiers treat all labels as mutually independent, thereby considering all negative classes to be equally incorrect. This approach fails severely in many real-world scenarios, where a known semantic hierarchy defines a partial order of preferences over negative classes. While hierarchy-aware feature representations have shown promise in mitigating this problem, their performance is typically assessed using metrics like MS and AHD. In this paper, we highlight important shortcomings in existing hierarchical evaluation metrics, demonstrating that they are often incapable of measuring true hierarchical performance. Our analysis reveals that existing methods learn sub-optimal hierarchical representations, despite competitive MS and AHD scores. To counter these issues, we introduce Hier-COS, a novel framework for unified hierarchy-aware fine-grained and hierarchical multi-level classification. We show that Hier-COS is theoretically guaranteed to be consistent with the given hierarchy tree. Furthermore, our framework implicitly adapts the learning capacity for different classes based on their position within the hierarchy tree-a vital property absent in existing methods. Finally, to address the limitations of evaluation metrics, we propose HOPS, a ranking-based metric that demonstrably overcomes the deficiencies of current evaluation standards. We benchmark Hier-COS on four challenging datasets, including the deep and imbalanced tieredImageNet-H and iNaturalist-19. Through extensive experiments, we demonstrate that Hier-COS achieves SOTA across all hierarchical metrics for every dataset, while simultaneously beating the top-1 accuracy in all but one case. Lastly, we show that Hier-COS can effectively learn to transform the frozen features extracted from a pretrained backbone (ViT) to be hierarchy-aware, yielding substantial benefits for hierarchical classification performance.
中文标题/摘要
标题:Hier-COS:通过正交子空间组合使深层特征具有层次意识
传统分类器将所有标签视为相互独立,因此认为所有负类同样错误。这种方法在许多现实场景中表现不佳,因为已知的语义层次定义了负类的部分偏好顺序。虽然层次意识特征表示在缓解这一问题方面显示出潜力,但其性能通常通过MS和AHD等指标进行评估。在本文中,我们指出现有层次评估指标的重要缺陷,表明它们通常无法衡量真正的层次性能。我们的分析揭示了现有方法学习到的层次表示是次优的,尽管MS和AHD得分竞争。为解决这些问题,我们提出了Hier-COS,一种统一的层次意识细粒度和层次多级分类的新框架。我们证明Hier-COS理论上保证与给定的层次树一致。此外,我们的框架根据类在层次树中的位置隐式调整学习能力——这是现有方法中缺乏的重要特性。最后,为解决评估指标的局限性,我们提出了HOPS,一种基于排名的指标,能够克服当前评估标准的缺陷。我们在四个具有挑战性的数据集上对Hier-COS进行了基准测试,包括深度且不平衡的tieredImageNet-H和iNaturalist-19。通过大量实验,我们证明Hier-COS在所有层次指标上均达到SOTA,同时在所有但一个数据集上优于顶级准确率。最后,我们展示了Hier-COS能够有效学习将从预训练主干(ViT)中提取的冻结特征转换为层次意识特征,从而显著提高层次分类性能。
Summary / 总结
This paper addresses the limitations of traditional classifiers that treat all labels as equally incorrect, especially in scenarios with a known semantic hierarchy. It introduces Hier-COS, a framework that makes deep features hierarchy-aware by composing orthogonal subspaces. Hier-COS is theoretically consistent with the hierarchy tree and adapts learning capacity based on class position. Experimental results on four datasets show that Hier-COS outperforms existing methods in all hierarchical metrics and often improves top-1 accuracy as well.
该论文针对传统分类器将所有标签视为独立的问题,这在存在语义层次结构的场景中是不合适的。它引入了Hier-COS框架,通过组成正交子空间使深层特征表示具有层次意识。在四个数据集上的实验表明,Hier-COS在所有层次化指标上均优于现有方法,并且通常还能提高top-1准确率。此外,作者还提出了HOPS,这是一种新的评估指标,能够更好地捕捉层次性能,进一步验证了Hier-COS的有效性。
KINESIS: Motion Imitation for Human Musculoskeletal Locomotion
Authors: Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis
Venue: ICRA
First: 2025-03-18T18:37:49+00:00 · Latest: 2026-02-23T17:30:07+00:00
Comments: Accepted to ICRA. Here we include an appendix
Abstract
How do humans move? Advances in reinforcement learning (RL) have produced impressive results in capturing human motion using physics-based humanoid control. However, torque-controlled humanoids fail to model key aspects of human motor control such as biomechanical joint constraints \& non-linear and overactuated musculotendon control. We present KINESIS, a model-free motion imitation framework that tackles these challenges. KINESIS is trained on 1.8 hours of locomotion data and achieves strong motion imitation performance on unseen trajectories. Through a negative mining approach, KINESIS learns robust locomotion priors that we leverage to deploy the policy on several downstream tasks such as text-to-control, target point reaching, and football penalty kicks. Importantly, KINESIS learns to generate muscle activity patterns that correlate well with human EMG activity. We show that these results scale seamlessly across biomechanical model complexity, demonstrating control of up to 290 muscles. Overall, the physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control. Code, videos and benchmarks are available at https://github.com/amathislab/Kinesis.
中文标题/摘要
标题:KINESIS:模仿人类肌肉骨骼运动的运动模仿
人类是如何移动的?强化学习(RL)的进步在使用基于物理的人形控制捕捉人类运动方面取得了令人印象深刻的成果。然而,力矩控制的人形机器人无法模拟人类运动控制的关键方面,如生物力学关节约束和非线性、过度驱动的肌肉腱控制。我们提出了KINESIS,一种无模型的运动模仿框架,解决了这些挑战。KINESIS在1.8小时的运动数据上进行训练,并在未见过的轨迹上实现了强大的运动模仿性能。通过负样本挖掘方法,KINESIS学习到稳健的运动先验,我们利用这些先验将策略部署到多个下游任务,如文本到控制、目标点到达和足球点球。重要的是,KINESIS学会了生成与人类EMG活动相关良好的肌肉活动模式。我们展示了这些结果在生物力学模型复杂性方面的无缝扩展,展示了多达290块肌肉的控制。总体而言,生理上的合理性使KINESIS成为解决人类运动控制中具有挑战性问题的有前途的模型。代码、视频和基准可在https://github.com/amathislab/Kinesis/获取。
Summary / 总结
KINESIS is a model-free motion imitation framework that addresses the limitations of torque-controlled humanoids in capturing human motor control aspects. Trained on 1.8 hours of locomotion data, KINESIS achieves strong motion imitation performance on unseen trajectories and learns robust locomotion priors. These priors enable KINESIS to perform various downstream tasks such as text-to-control, target point reaching, and football penalty kicks. Notably, KINESIS generates muscle activity patterns that correlate well with human EMG activity and can control up to 290 muscles across different biomechanical models, making it a promising tool for human motor control research.
KINESIS 是一种无需模型的运动模仿框架,解决了扭矩控制的人形机器人在捕捉人类运动时的局限性。通过 1.8 小时的数据训练,KINESIS 在未见过的轨迹上表现出强大的性能,并学习到稳健的运动先验,使其能够执行诸如文本到控制和足球点球等下游任务。值得注意的是,KINESIS 生成的肌肉活动模式与人类的 EMG 活动高度一致,展示了其在不同生物力学复杂性下的潜在应用价值,以应对人类运动控制中的挑战。
The Invisible Gorilla Effect in Out-of-distribution Detection
Authors: Harry Anthony, Ziyun Liang, Hermione Warr, Konstantinos Kamnitsas
Venue: CVPR 2026
First: 2026-02-23T17:24:18+00:00 · Latest: 2026-02-23T17:24:18+00:00
Comments: Accepted at CVPR 2026
Abstract
Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.
中文标题/摘要
标题:看不见的大猩猩效应在离分布检测中的影响
深度神经网络在视觉任务中通过学习图像中感兴趣区域(ROI)的特征来实现高性能,但在部署到与训练数据不同的离分布(OOD)数据时,其性能会下降。这一挑战导致了离分布检测方法的发展,旨在识别并拒绝不可靠的预测。尽管先前的研究表明,离分布检测性能受不同类型的伪影影响,但其背后的原因仍被忽视。为此,我们发现了一种在离分布检测中未被报道的偏差:对于难以检测的伪影(近似离分布),当伪影与模型的ROI在视觉上相似(例如颜色)时,检测性能通常会提高,而不相似时则会下降——我们将其称为看不见的大猩猩效应。例如,在一个皮肤病变分类器中,红病变ROI,我们发现Mahalanobis得分在检测离分布红色墨水(与ROI相似)时的AUROC比黑色墨水(不相似)高出31.5%。我们对来自三个公开数据集(例如ISIC)的11,355张图像中的伪影按颜色进行了标注,并生成了颜色交换的反事实数据以排除数据集偏差。然后,我们在7个基准上评估了40种离分布方法,发现大多数方法在伪影与ROI不同时性能显著下降。我们的研究结果突显了离分布检测中一个被忽视的失败模式,并为更稳健的检测器提供了指导。代码和标注可在:https://github.com/HarryAnthony/Invisible_Gorilla_Effect/ 获取。
Summary / 总结
The study investigates the Invisible Gorilla Effect in out-of-distribution (OOD) detection, where the performance of OOD detection methods improves for artefacts similar to the model's region of interest (ROI) and degrades for dissimilar artefacts. By analyzing 11,355 images from three public datasets and evaluating 40 OOD methods across seven benchmarks, the research reveals that colour similarity between artefacts and ROI significantly impacts detection performance, with a 31.5% higher AUROC for similar artefacts compared to dissimilar ones. This finding highlights an overlooked failure mode in OOD detection and provides insights for developing more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.
研究探讨了在分布外(OOD)检测中看不见的猴子效应,即当异常特征与模型的兴趣区域(ROI)在视觉上相似时,OOD检测方法的性能会提高,而当它们不相似时则会下降。通过对三个公开数据集中的11,355张图像进行分析,并在七个基准上评估40种OOD方法,研究发现颜色相似性对异常特征与ROI之间的检测性能有显著影响,在皮肤病变分类器中,红墨水的AUROC比黑墨水高出31.5%。这一发现揭示了OOD检测中一个未被注意到的失败模式,并为开发更稳健的检测器提供了指导。代码和注释可在: https://github.com/HarryAnthony/Invisible_Gorilla_Effect 获取。