MediX-R1: Open Ended Medical Reinforcement Learning
Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
First: 2026-02-26T18:59:46+00:00 · Latest: 2026-02-26T18:59:46+00:00
Abstract
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com
中文标题/摘要
标题:MediX-R1:开放式的医疗强化学习
我们介绍了MediX-R1,这是一种针对医疗多模态大型语言模型(MLLMs)的开放式强化学习(RL)框架,能够提供基于临床的、自由形式的答案,超越了多项选择格式。MediX-R1 使用基于组的RL对基础视觉-语言骨干进行微调,并结合了针对医疗推理定制的复合奖励:基于LLM的准确度奖励,用于判断语义正确性并做出严格的YES/NO决策;基于医疗嵌入的语义奖励,用于捕捉同义词和术语变体;以及轻量级的格式和模态奖励,以确保可解释的推理和模态识别。这种多信号设计为传统验证性或仅MCQ奖励无法提供稳定、信息丰富的反馈的开放式输出提供了支持。为了衡量进展,我们提出了一种统一的评估框架,用于文本和图像+文本任务,使用LLM作为法官替代脆弱的字符串重叠度量,以捕捉语义正确性、推理和上下文对齐。尽管仅使用约51,000条指令示例,MediX-R1 在标准的医疗LLM(仅文本)和VLM(图像+文本)基准测试中取得了优异的成绩,超越了强大的开源基线,并在开放式临床任务上取得了特别大的进步。我们的结果表明,使用全面的奖励信号和LLM评估的开放式RL是一种通往多模态模型中可靠医疗推理的实际路径。我们的训练模型、精选数据集和源代码可在https://medix.cvmbzuai.com 获取。
Summary / 总结
MediX-R1 is an open-ended RL framework for medical multimodal LLMs, fine-tuning a vision-language backbone with a composite reward that includes LLM-based accuracy, medical embedding-based semantic, and lightweight format and modality rewards. It uses a reference-based LLM evaluation to measure semantic correctness and contextual alignment. Despite using only about 51,000 instruction examples, MediX-R1 outperforms strong open-source baselines on both text-only and image+text medical benchmarks, especially on open-ended clinical tasks.
MediX-R1 是一个用于医疗 MLLMs 的开放域 RL 框架,能够生成自由形式的答案。它通过 Group Based RL 和复合奖励对视觉语言骨干进行微调,包括基于 LLM 的准确性奖励、基于医疗嵌入的语义奖励以及轻量级格式和模态奖励。该框架为开放域输出提供了稳定的反馈。MediX-R1 在标准医疗基准测试中表现出色,特别是在开放域临床任务上取得了显著进步。提出了一种基于参考的 LLM 作为评判者的统一评估框架来衡量进展。模型、数据集和源代码已在线发布。
Joint Optimization for 4D Human-Scene Reconstruction in the Wild
Authors: Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou
First: 2025-01-04T01:53:51+00:00 · Latest: 2026-02-26T18:59:39+00:00
Comments: Project Page: https://vail-ucla.github.io/JOSH/
Abstract
Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
中文标题/摘要
标题:野外单目视频中4D人体-场景重建的联合优化
重建人体运动及其周围环境对于理解人体-场景交互和预测场景中的人体运动至关重要。尽管在受限环境中捕捉人体-场景交互方面取得了很大进展,但先前的方法难以从网络视频中重建自然多样的人体运动和场景上下文。在本文中,我们提出了一种名为JOSH的新颖优化方法,用于从单目视频中在野外进行4D人体-场景重建。JOSH利用密集场景重建和人体网格恢复技术进行初始化,然后利用人体-场景接触约束联合优化场景、相机姿态和人体运动。实验结果表明,JOSH通过联合优化场景几何和人体运动,在全局人体运动估计和密集场景重建方面取得了更好的结果。我们进一步设计了一个更高效的模型JOSH3R,并直接用从JOSH预测的伪标签对其进行训练。JOSH3R仅通过使用JOSH预测的标签进行训练,就优于其他无优化方法,进一步证明了其准确性和泛化能力。
Summary / 总结
The research aims to reconstruct 4D human motion and its surrounding environment from monocular web videos to better understand human-scene interaction. JOSH, a novel optimization-based method, initializes with dense scene reconstruction and human mesh recovery, then jointly optimizes the scene, camera poses, and human motion using human-scene contact constraints. The method achieves better results in global human motion estimation and dense scene reconstruction compared to previous methods. JOSH3R, a more efficient variant, further improves performance by training with pseudo-labels from web videos, outperforming other optimization-free methods.
研究旨在通过单目视频重建自然环境中的人体运动及其周围环境。JOSH 是一种新颖的优化方法,通过密集场景重建和人体网格恢复进行初始化,然后利用人体与场景的接触约束联合优化场景、相机姿态和人体运动。该方法在全局人体运动估计和密集场景重建方面取得了比以往方法更好的结果。JOSH3R 是一种更高效的变体,通过使用来自 JOSH 的伪标签进行训练,进一步提高了性能,优于其他无优化方法。
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Authors: Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
Venue: CVPR 2026
First: 2026-02-26T18:59:33+00:00 · Latest: 2026-02-26T18:59:33+00:00
Comments: CVPR 2026, Project page: https://research.nvidia.com/labs/dvl/projects/vgg-ttt
Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
中文标题/摘要
标题:VGG-T$^3$:大规模离线前馈3D重建
我们提出了一种可扩展的3D重建模型,解决了离线前馈方法的关键限制:其计算和内存需求随输入图像数量的平方增长。我们的方法基于这样一个关键洞察:这一瓶颈源于场景几何的可变长度键值(KV)空间表示,我们通过测试时训练将其提炼为固定大小的多层感知机(MLP)。VGG-T$^3$(视觉几何测试时训练)与输入视图数量成线性增长,类似于在线模型,并在54秒内重建了1000张图像的集合,比依赖于softmax注意力的基线方法快11.6倍。由于我们的方法保留了全局场景聚合能力,我们的点云重建误差显著优于其他线性时间方法。最后,我们通过使用未见过的图像查询场景表示,展示了我们模型的视觉定位能力。
Summary / 总结
VGG-T$^3$ addresses the computational and memory limitations of offline feed-forward 3D reconstruction methods by converting the varying-length Key-Value space representation into a fixed-size Multi-Layer Perceptron through test-time training. This approach scales linearly with the number of input views, similar to online models, and reconstructs a 1k image collection in 54 seconds, achieving an 11.6x speed-up over baselines. The method outperforms other linear-time methods in point map reconstruction error and demonstrates visual localization capabilities with unseen images.
VGG-T$^3$通过在测试时训练将场景几何的可变长度Key-Value空间表示转换为固定大小的多层感知机,解决了离线前馈3D重建方法的计算和内存限制问题。该方法按输入视图的数量线性扩展,类似于在线模型,并在54秒内重建1k图像集合,比依赖于softmax注意力的基线方法快11.6倍。该方法在点云重建误差上优于其他线性时间方法,并通过未见过的图像查询场景表示展示了视觉定位能力。
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Authors: Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
Venue: CVPR 2026
First: 2026-02-26T18:59:05+00:00 · Latest: 2026-02-26T18:59:05+00:00
Comments: Project page: https://seethrough3d.github.io. Accepted at CVPR 2026
Abstract
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
中文标题/摘要
标题:SeeThrough3D:基于遮挡感知的3D控制在文本到图像生成中的应用
我们识别出遮挡推理是3D布局条件生成中一个基本但被忽视的方面。它对于合成部分遮挡的物体并保持深度一致的几何结构和比例至关重要。尽管现有方法可以生成遵循输入布局的逼真场景,但它们往往无法准确建模物体间的遮挡关系。我们提出SeeThrough3D,一种基于3D布局条件生成的模型,明确建模遮挡。我们引入了一种遮挡感知的3D场景表示(OSCR),其中物体以透明的3D盒子形式置于虚拟环境中,并从期望的摄像机视角进行渲染。透明度编码隐藏的物体区域,使模型能够推理遮挡关系,而渲染的视角则在生成过程中提供明确的摄像机控制。我们通过引入从我们渲染的3D表示中提取的一组视觉标记,对预训练的基于流的文本到图像图像生成模型进行条件化。此外,我们应用掩码自注意力机制,准确地将每个物体边界框与其相应的文本描述绑定,从而实现多个物体的准确生成,而不会出现物体属性混杂。为了训练模型,我们构建了一个包含多种多物体场景的合成数据集,这些场景具有强烈的物体间遮挡。SeeThrough3D能够有效泛化到未见过的物体类别,并实现具有真实遮挡和一致摄像机控制的精确3D布局控制。
Summary / 总结
The research aims to address the issue of occlusion reasoning in text-to-image generation, which is crucial for creating scenes with depth-consistent geometry and scale. The proposed SeeThrough3D model introduces an occlusion-aware 3D scene representation (OSCR) that uses translucent 3D boxes to model objects and their occlusions. This allows the model to reason about occlusions and control the camera viewpoint during generation. Key experimental findings show that SeeThrough3D effectively handles unseen object categories and produces realistic scenes with precise 3D layout control and consistent camera settings.
研究旨在解决文本到图像生成中的遮挡推理问题,这对于创建具有深度一致几何和比例的场景至关重要。提出的SeeThrough3D模型引入了一种遮挡感知的3D场景表示(OSCR),使用半透明的3D盒子来建模物体及其遮挡。这使模型能够推理遮挡并在生成过程中控制相机视角。实验结果表明,SeeThrough3D能够处理未见过的物体类别,并生成具有精确3D布局控制和一致相机设置的逼真场景。
Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training
Authors: Aheli Saha, René Schuster, Didier Stricker
First: 2026-02-26T18:57:52+00:00 · Latest: 2026-02-26T18:57:52+00:00
Comments: 12 pages, International Conference on Pattern Recognition Applications and Methods
Abstract
Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.
中文标题/摘要
标题:事件驱动对象检测中基于事件的传感器泛化联合分布训练
受生物启发的事件摄像头由于其异步和低延迟特性,最近吸引了大量研究。这些特性提供了高动态范围并显著减少了运动模糊。然而,由于其输出信号的新型性质,可用数据的变异性存在差距,且对其信号特征参数的广泛分析也相对缺乏。本文通过提供对内在参数如何影响基于事件数据训练的模型性能的深入理解,解决了这些问题,特别是针对对象检测的应用。我们还利用研究结果扩展了下游模型的鲁棒性,使其具有传感器无关性。
Summary / 总结
This paper aims to enhance the robustness of object detection models trained on event data from bio-inspired event cameras. The authors investigate how intrinsic parameters affect model performance and propose a method for joint distribution training to improve sensor generalization. Key findings show that by understanding and adjusting these parameters, the model can become more adaptable to different sensors, thus achieving sensor-agnostic robustness in object detection tasks.
本文旨在通过生物启发的事件相机的事件数据训练目标检测模型,增强其鲁棒性。作者研究了内在参数如何影响模型性能,并提出了一种联合分布训练方法以提高传感器通用性。主要发现表明,通过理解和调整这些参数,模型可以在不同传感器上实现目标检测任务的传感器通用性。
Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna
First: 2026-02-26T18:54:06+00:00 · Latest: 2026-02-26T18:54:06+00:00
Comments: TACL 2026
Abstract
The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
中文标题/摘要
标题:规模无法克服语用学:报告偏差对视觉语言推理的影响
视觉语言模型(VLMs)缺乏推理能力的问题一直是研究讨论的焦点。我们认为这种行为源于其训练数据中的报告偏差。即,人们默认描述视觉内容时会省略一些必要的隐含信息,以监督某些类型的推理;例如,“今天在比赛!”比“一张37个人站在田野后面的图片”更常见。我们通过语用学理论的视角,研究了流行的VLMs OpenCLIP、LLaVA-1.5和Molmo的数据,发现报告偏差导致在四个推理技能(空间、时间、否定和计数)上缺乏充分的表示,尽管这些语料库是大规模的,或者合成生成的。通过一组精心策划的基准测试,我们证明:(i) VLMs在由报告偏差抑制的上述类型推理上表现不佳;(ii) 与普遍认为的相反,增加数据量、模型规模和多语言训练并不会默认产生这些技能;但令人欣慰的是,(iii) 特别收集的用于获取隐含信息的注解是有效的。我们的研究结果强调了需要更故意的数据策划方法,而不是依赖规模来产生推理能力。
Summary / 总结
This study investigates the reasoning capabilities of Vision-Language Models (VLMs) and finds that their performance is limited by reporting bias in their training data. Despite large-scale and synthetic data, VLMs struggle with spatial, temporal, negation, and counting reasoning due to the omission of tacit information in captions. Scaling the model size or adding more languages does not improve these skills. However, incorporating specific annotations that capture tacit information enhances these reasoning abilities, suggesting the need for more intentional data curation methods.
研究探讨了报告偏见对视觉语言模型(VLMs)如OpenCLIP、LLaVA-1.5和Molmo推理能力的影响。通过使用语用学理论分析训练数据,研究发现报告偏见导致空间、时间、否定和计数推理技能的不足表示。尽管有大规模或合成数据,模型在这类推理上表现不佳。然而,结合特定的注解可以改善这些技能。研究强调了需要更故意的数据整理方法,而不是单纯依靠规模来发展推理能力。
FlashOptim: Optimizers for Memory Efficient Training
Authors: Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock
First: 2026-02-26T18:52:22+00:00 · Latest: 2026-02-26T18:52:22+00:00
Comments: Source code is available at https://github.com/databricks/flashoptim
Abstract
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory.
We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half.
Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama-3.1-8B finetuning.
中文标题/摘要
标题:FlashOptim:内存高效训练的优化器
标准的混合精度训练需要为每个模型参数分配大量加速器内存。这些字节不仅代表参数本身,还包括其梯度和一个或多个优化器状态变量。每个值通常需要4个字节,因此即使是70亿参数的模型,对于拥有不到100GB加速器内存的研究人员来说也可能不切实际。
我们引入了FlashOptim,这是一种优化套件,能够在保持模型质量和API兼容性的同时,将每个参数的内存减少超过50%。我们的方法引入了两种关键技术。首先,我们通过找到并利用其量化误差的紧界来改进主权重分割。其次,我们设计了压缩函数,极大地减少了8位优化器状态量化中的误差。结合16位梯度,这些技术将AdamW的内存从每个参数16字节减少到7字节,或者在释放梯度时减少到5字节。它们还使模型检查点的大小减少了超过一半。
在SGD、AdamW和Lion上应用FlashOptim的实验表明,在包括Llama-3.1-8B微调在内的标准视觉和语言基准任务中,没有任何可测量的质量下降。
Summary / 总结
FlashOptim is designed to reduce the memory footprint of neural network training by over 50% through techniques such as improved master weight splitting and companding functions for optimizer state quantization, while maintaining model quality and API compatibility. Experiments show that these optimizations reduce AdamW memory usage from 16 bytes to 7 bytes per parameter, and cut model checkpoint sizes by more than half without affecting performance on various benchmarks, including Llama-3.1-8B finetuning.
FlashOptim 通过改进主权重分割和优化器状态量化中的压缩函数等技术,将神经网络训练的内存占用减少超过50%,同时保持模型质量和API兼容性。实验表明,这些优化将AdamW 每个参数的内存使用量从16字节减少到7字节,并将模型检查点大小减少超过一半,而不会影响在各种基准上的性能,包括Llama-3.1-8B 微调。
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
First: 2026-02-26T18:45:33+00:00 · Latest: 2026-02-26T18:45:33+00:00
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
中文标题/摘要
标题:检索与分割:少量示例足以弥合开放词汇分割中的监督缺口吗?
开放词汇分割(OVS)将视觉语言模型(VLM)的零样本识别能力扩展到像素级预测,使模型能够根据文本提示分割任意类别。尽管取得了进展,但由于使用粗粒度的图像级监督训练VLM以及自然语言的语义模糊性,OVS仍落后于完全监督的方法。我们通过引入一种少量样本设置,将文本提示与像素标注图像的支持集相结合,来解决这些限制。在此基础上,我们提出了一种检索增强的测试时适配器,通过融合文本和视觉支持特征学习一种轻量级的、针对每张图像的分类器。与依赖于后期手工融合的先前方法不同,我们的方法进行学习的、针对每个查询的融合,实现了模态之间的更强协同作用。该方法支持不断扩展的支持集,并适用于细粒度任务,如个性化分割。实验表明,我们显著缩小了零样本和监督分割之间的差距,同时保持了开放词汇的能力。
Summary / 总结
The paper addresses the limitations of open-vocabulary segmentation (OVS) by proposing a few-shot setting that combines textual prompts with pixel-annotated images. It introduces a retrieval-augmented test-time adapter that learns a lightweight classifier by fusing textual and visual support features, achieving better synergy between modalities compared to prior methods. Experiments demonstrate that this approach significantly reduces the gap between zero-shot and supervised segmentation while maintaining open-vocabulary capabilities.
本文通过提出结合文本提示和像素标注图像的少量样本设置,解决了开放词汇分割(OVS)的局限性。作者引入了一种检索增强的测试时适配器,通过融合文本和视觉支持特征来学习轻量级分类器,实现比先前方法更好的模态协同效应。实验表明,该方法显著缩小了零样本和监督分割之间的差距,同时保持了开放词汇的能力。该方法支持支持集的不断扩展,并适用于精细粒度的任务,如个性化分割。
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Authors: Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael
First: 2026-02-26T18:37:23+00:00 · Latest: 2026-02-26T18:37:23+00:00
Comments: 59 pages, 33 figures
Abstract
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
中文标题/摘要
标题:LLM初学者在双重用途和计算生物学任务中的提升
大型语言模型(LLM)在生物学基准测试中的表现越来越出色,但尚不清楚它们是否能提升初学者的表现,即是否能使人比仅使用互联网资源时表现更好。这种不确定性对于理解科学加速和双重用途风险至关重要。我们进行了一个多模型、多基准的人类提升研究,比较了有LLM访问权限的初学者与仅有互联网访问权限的初学者在八个与生物安全相关的任务集上的表现。参与者在复杂问题上工作,有充足的时间(最复杂任务最多13小时)。我们发现,LLM访问提供了显著的提升:有LLM的初学者比对照组准确4.16倍(95% CI [2.63, 6.87])。在四个有专家基线的基准测试中(仅有互联网资源),有LLM的初学者在三个基准测试中表现优于专家。令人惊讶的是,独立的LLM往往超过了LLM辅助的初学者,表明用户没有从LLM中获得最强的可用贡献。大多数参与者(89.6%)报告称,尽管有保护措施,获取与双重用途相关的信息并不困难。总体而言,LLM显著提升了初学者在以前仅由训练有素的从业者完成的生物学任务上的表现,强调了需要在传统基准测试的同时进行持续的互动提升评估。
Summary / 总结
This study investigates whether large language models (LLMs) can help novice users perform better on biology tasks compared to using only internet resources. Across eight biosecurity-relevant task sets, participants with LLM access were 4.16 times more accurate than those without, and even standalone LLMs often outperformed LLM-assisted novices. Notably, novices could obtain dual-use-relevant information with ease, suggesting significant dual-use risk. The findings highlight the need for ongoing evaluations of LLM uplift in scientific tasks.
研究探讨了大型语言模型(LLMs)是否能帮助初学者在生物学任务上比仅使用互联网资源表现得更好。在八个生物安全相关的任务集中,有LLM访问权限的参与者比没有的参与者准确度高4.16倍,甚至独立的LLM有时也超过了LLM辅助的初学者。值得注意的是,初学者能够轻松获取与双重用途相关的信息,表明存在显著的双重用途风险。研究结果强调了在传统基准测试之外,持续进行LLM提升评估的重要性。
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Authors: Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang
First: 2025-10-13T02:45:48+00:00 · Latest: 2026-02-26T18:32:27+00:00
Comments: 8 pages, 6 tables, 3 figures. Under review
Abstract
Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
中文标题/摘要
标题:DropVLA:视觉-语言-行动模型中的行动级后门攻击
视觉-语言-行动(VLA)模型将多模态感知和语言指令映射为可执行的机器人动作,使其特别容易受到行为后门操纵:在训练期间引入的隐藏触发器可以在不影响名义任务性能的情况下诱导意外的物理动作。先前对VLA后门的研究主要集中在无目标攻击或任务级劫持上,而对个体动作的精细控制尚未得到充分探索。在本研究中,我们提出了DropVLA,这是一种行动级后门攻击,能够在有限的数据污染访问和现实的管道黑盒设置下,通过窗口一致的重新标记方案进行分块微调,迫使可重用的动作原语(例如,open_gripper)在攻击者选择的决策点执行。在使用LIBERO评估的OpenVLA-7B中,仅通过视觉污染即可实现98.67%-99.83%的攻击成功率(ASR),同时保持98.50%-99.17%的任务清洁保留率,并在25个控制步骤内(500 Hz,0.05秒)成功触发目标动作。仅文本触发在低污染预算下不稳定,结合文本与视觉并不能在视觉污染攻击上提供一致的ASR改进。后门对适度的触发器变化具有鲁棒性,并且可以在评估套件之间转移(96.27%,99.09%),而仅文本则大多失败(0.72%)。我们还在7自由度的Franka手臂上通过pi0-fast验证了物理世界的可行性,展示了在相机相对运动下诱导图像平面触发漂移的非平凡攻击效果。这些结果表明,VLA模型可以在最小的污染和无明显名义性能退化的情况下,以安全关键动作的粒度被隐蔽地引导。
Summary / 总结
DropVLA is an action-level backdoor attack on VLA models, which forces a specific action primitive to execute at attacker-chosen points. Using a window-consistent relabeling scheme, the attack achieves a high success rate of 98.67%-99.83% with only 0.31% poisoned episodes, while maintaining 98.50%-99.17% task performance. Text-only triggers are unstable, but combining text with vision does not improve attack success rates. The attack remains robust to trigger variations and transfers across different evaluation suites. Physical-world experiments on a 7-DoF Franka arm show that the attack can be effective under camera-relative motion, demonstrating the covert manipulation of safety-critical actions.
DropVLA 是一种针对 VLA 模型的动作级后门攻击,能够在攻击者选择的点强制执行特定动作。通过窗口一致的重新标记方案,该攻击在极少量数据污染的情况下实现了 98.67%-99.83% 的高成功率,同时保持了任务性能。该攻击对适度的触发器变化具有鲁棒性,并且可以在不同的评估套件之间进行转移。物理世界实验在 7 自由度的 Franka 手臂上证实了其在现实条件下的有效性。
LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation
Authors: Nadine Garibli, Mayank Patwari, Bence Csiba, Yi Wei, Kostantinos Sidiropoulos
First: 2025-06-06T13:52:33+00:00 · Latest: 2026-02-26T18:27:23+00:00
Comments: 10 pages, 2 figures
Abstract
Longitudinal volumetric tumour segmentation is critical for radiotherapy planning and response assessment, yet this problem is underexplored and most methods produce single-timepoint semantic masks, lack lesion correspondence, and offer limited radiologist control. We introduce LinGuinE (Longitudinal Guidance Estimation), a PyTorch framework that combines image registration and guided segmentation to deliver lesion-level tracking and volumetric masks across all scans in a longitudinal study from a single radiologist prompt. LinGuinE is temporally direction agnostic, requires no training on longitudinal data, and allows any registration and semi-automatic segmentation algorithm to be repurposed for the task. We evaluate various combinations of registration and segmentation algorithms within the framework. LinGuinE achieves state-of-the-art segmentation and tracking performance across four datasets with a total of 456 longitudinal studies. Tumour segmentation performance shows minimal degradation with increasing temporal separation. We conduct ablation studies to determine the impact of autoregression, pathology specific finetuning, and the use of real radiologist prompts. We release our code and substantial public benchmarking for longitudinal segmentation, facilitating future research.
中文标题/摘要
标题:LinGuinE: 长期体积肿瘤分割的纵向引导估计
长期体积肿瘤分割对于放射治疗计划和反应评估至关重要,但这一问题尚未得到充分探索,大多数方法仅生成单时点语义掩码,缺乏病灶对应关系,并且对放射科医生的控制有限。我们引入了LinGuinE(纵向引导估计),这是一种结合图像配准和引导分割的PyTorch框架,能够从单个放射科医生的提示中在纵向研究的所有扫描中提供病灶级跟踪和体积掩码。LinGuinE在时间方向上是无方向性的,无需在纵向数据上进行训练,并允许任何配准和半自动分割算法重新用于此任务。我们评估了框架内各种配准和分割算法的组合。LinGuinE在四个数据集的总共456个纵向研究中实现了最先进的分割和跟踪性能。肿瘤分割性能随时间分离度增加而最小化下降。我们进行了消融研究以确定自回归、病理特异性微调和使用真实放射科医生提示的影响。我们发布了我们的代码和大量的公共基准测试,促进未来的研究。
Summary / 总结
LinGuinE is a PyTorch framework designed for longitudinal volumetric tumour segmentation, addressing the limitations of existing methods by providing lesion-level tracking and volumetric masks across all scans in a longitudinal study. It combines image registration and guided segmentation, requiring no training on longitudinal data and allowing any registration and semi-automatic segmentation algorithm to be repurposed for the task. LinGuinE achieves state-of-the-art performance across four datasets with 456 longitudinal studies, showing minimal degradation in tumour segmentation performance with increasing temporal separation. Ablation studies were conducted to evaluate the impact of autoregression, pathology-specific fine-tuning, and the use of real radiologist prompts, and the code and benchmarking data are publicly released to facilitate future research.
LinGuinE 是一个 PyTorch 框架,用于纵向体积肿瘤分割,通过提供所有扫描中的病灶级跟踪和体积掩码来解决现有方法的局限性。该框架结合了图像注册和引导分割,无需对纵向数据进行训练,并允许任何注册和半自动分割算法重新用于此任务。LinGuinE 在四个数据集的 456 个纵向研究中实现了最先进的性能,随时间间隔增加,肿瘤分割性能的下降幅度很小。进行了消融研究以评估自回归、病理特异性微调和使用真实放射科医生提示的影响,并公开了代码和基准数据以促进未来的研究。
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Authors: Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy, Marcelo Becker, Thiago Boaventura, Gustavo J. G. Lahr
First: 2026-02-26T18:20:26+00:00 · Latest: 2026-02-26T18:20:26+00:00
Abstract
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Summary / 总结
The study evaluates the effectiveness of small language models (SLMs) for leader-follower role assignment in human-robot interaction (HRI), using a novel dataset and two adaptation strategies: prompt engineering and fine-tuning. Experiments with Qwen2.5-0.5B show that zero-shot fine-tuning achieves high accuracy (86.66%) and low latency (22.2 ms per sample), outperforming baseline and prompt-engineered approaches, but performance drops in one-shot modes due to increased context length challenges.
本文评估了小语言模型(SLMs)在领导者-跟随者互动中的有效性,重点研究了零样本和单样本适应策略。研究引入了一个新数据集,并比较了提示工程、微调和未训练基线。实验表明,零样本微调在Qwen2.5-0.5B上实现了高准确率(86.66%)和低延迟(每样本22.2毫秒),优于其他方法,但在单样本模式中由于上下文长度增加导致性能下降。
Evaluating the Diversity and Quality of LLM Generated Content
Authors: Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani
First: 2025-04-16T23:02:23+00:00 · Latest: 2026-02-26T18:17:44+00:00
Comments: Published at COLM 2025
Abstract
Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models are widely deployed in applications requiring varied outputs. We argue that diversity without consideration of quality has limited practical value. To address this issue, we introduce a framework for measuring effective semantic diversity -- diversity among outputs that meet quality thresholds -- which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: when using diversity metrics that do not explicitly consider quality, preference-tuned models -- particularly those trained via RL -- often produce outputs with lower diversity; however, these same preference-tuned models generate greater effective semantic diversity than supervised fine-tuned (SFT) or base models. Our analysis further shows another trend: while larger models may exhibit greater effective semantic diversity than smaller models, the smaller models are consistently more parameter-efficient at producing unique content within a fixed sampling budget. These findings have practical implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
中文标题/摘要
标题:评估LLM生成内容的多样性和质量
近期研究表明,偏好调优技术——如基于人类反馈强化学习(RLHF)方法(如PPO和GRPO),以及替代方法DPO——降低了多样性,这给这些模型在需要多样化输出的应用中广泛应用带来了困境。我们认为,不考虑质量的多样性在实际中有有限的价值。为解决这一问题,我们提出了一种衡量有效语义多样性的框架——衡量满足质量阈值的输出之间的多样性——这更好地反映了大型语言模型(LLM)的实际效用。通过不需要人类干预的开放任务,我们发现了一些反直觉的结果:当使用不明确考虑质量的多样性指标时,偏好调优模型——尤其是通过RL训练的模型——往往生成的输出多样性较低;然而,这些偏好调优模型生成的有效语义多样性却大于监督微调(SFT)或基础模型。我们的分析还显示了另一种趋势:虽然较大的模型可能在固定采样预算内生成更独特的内容,但较小的模型在生成独特内容方面始终更具有参数效率。这些发现对需要多样化且高质量输出的应用具有实际意义,从创意辅助到合成数据生成。
Summary / 总结
This study evaluates the diversity and quality of content generated by large language models (LLMs) and introduces a framework for measuring effective semantic diversity, which considers both diversity and quality. Using open-ended tasks, the research finds that preference-tuned models, especially those trained via reinforcement learning, produce lower diversity when using standard diversity metrics but higher effective semantic diversity compared to supervised fine-tuned or base models. The study also reveals that smaller models are more parameter-efficient in generating unique content within a fixed budget.
研究评估了大型语言模型(LLMs)生成内容的多样性和质量,并引入了一个同时考虑多样性和质量的有效语义多样性测量框架。通过使用无需人工干预的开放任务,研究发现,尤其是通过强化学习训练的偏好调优模型,在传统指标下表现出较低的多样性,但与监督微调或基础模型相比,其有效语义多样性更高。此外,较小的模型在固定采样预算内生成独特内容方面更具参数效率。
Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting
Authors: Shai Feldman, Stephen Bates, Yaniv Romano
First: 2025-05-07T18:46:02+00:00 · Latest: 2026-02-26T18:16:20+00:00
Abstract
We introduce a framework for robust uncertainty quantification in situations where labeled training data are corrupted, through noisy or missing labels. We build on conformal prediction, a statistical tool for generating prediction sets that cover the test label with a pre-specified probability. The validity of conformal prediction, however, holds under the i.i.d assumption, which does not hold in our setting due to the corruptions in the data. To account for this distribution shift, the privileged conformal prediction (PCP) method proposed leveraging privileged information (PI) -- additional features available only during training -- to re-weight the data distribution, yielding valid prediction sets under the assumption that the weights are accurate. In this work, we analyze the robustness of PCP to inaccuracies in the weights. Our analysis indicates that PCP can still yield valid uncertainty estimates even when the weights are poorly estimated. Furthermore, we introduce uncertain imputation (UI), a new conformal method that does not rely on weight estimation. Instead, we impute corrupted labels in a way that preserves their uncertainty. Our approach is supported by theoretical guarantees and validated empirically on both synthetic and real benchmarks. Finally, we show that these techniques can be integrated into a triply robust framework, ensuring statistically valid predictions as long as at least one underlying method is valid.
中文标题/摘要
标题:带污染标签的自适应预测:不确定插补和稳健加权
我们提出了一种框架,用于在标记训练数据受到噪声或缺失标签污染的情况下,进行稳健的不确定性量化。我们基于自适应预测,这是一种生成预测集的统计工具,这些预测集以预设的概率覆盖测试标签。然而,自适应预测的有效性依赖于独立同分布假设,而在我们的设置中,由于数据中的污染,这一假设不成立。为了应对这种分布偏移,我们提出了利用特权信息(PI)——仅在训练期间可用的额外特征——的特权自适应预测(PCP)方法,以重新加权数据分布,从而在假设权重准确的情况下生成有效的预测集。在本文中,我们分析了PCP对权重估计不准确的鲁棒性。我们的分析表明,即使权重估计不准确,PCP仍然可以生成有效的不确定性估计。此外,我们引入了一种新的自适应预测方法——不确定插补(UI),这种方法不依赖于权重估计。相反,我们以保持标签不确定性的方式插补污染的标签。我们的方法得到了理论保证,并在合成和真实基准上得到了实证验证。最后,我们展示了这些技术可以集成到三重稳健框架中,只要至少有一种基础方法有效,就可以确保统计上有效的预测。
Summary / 总结
This paper addresses the challenge of robust uncertainty quantification in machine learning models when training data are corrupted. It builds on conformal prediction, a method for generating prediction sets with specified coverage probabilities, and introduces privileged conformal prediction (PCP) to re-weight data under distribution shift. The study analyzes PCP's robustness to inaccurate weights and proposes uncertain imputation (UI) as an alternative method that does not rely on weight estimation. Theoretical guarantees and empirical validation on synthetic and real benchmarks demonstrate the effectiveness of these methods.
该论文解决了训练数据被污染时机器学习模型稳健不确定性量化的问题。它基于一种生成具有指定覆盖概率的预测集的方法——齐性预测,并引入了特权齐性预测(PCP)来在数据分布因污染而改变时重新加权数据。作者分析了PCP在权重估计不准确时的鲁棒性,并提出了不确定插补(UI),这是一种新的齐性预测方法,通过保留污染标签的不确定性来插补它们。理论保证和合成及真实基准上的实证验证支持这些方法,并且它们可以集成到三重稳健框架中。
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
Venue: ICLR 2026
First: 2026-02-26T18:10:41+00:00 · Latest: 2026-02-26T18:10:41+00:00
Comments: Accept by ICLR 2026
Abstract
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
中文标题/摘要
标题:ThinkOmni:通过指导解码提升文本推理至全模态场景
全模态推理对于智能系统理解并从多种数据源中推断信息至关重要。虽然现有的全模态大型语言模型(OLLM)在感知多种模态方面表现出色,但它们缺乏近期大型推理模型(LRM)的复杂推理能力。然而,通过额外训练来增强OLLM的推理能力面临着重大挑战,包括高质量数据的需求、任务特定的适应以及巨大的计算成本。为了解决这些限制,我们提出了ThinkOmni,这是一种无需训练和数据的框架,将文本推理提升至全模态场景。ThinkOmni引入了两个关键组件:1)LRM-as-a-Guide,利用现成的LRM来指导OLLM的解码过程;2)逐步对比缩放,无需手动超参数调整即可适应性平衡感知和推理信号。在六个跨模态推理基准上的实验表明,ThinkOmni始终能够提供性能改进,主要结果在MathVista上达到70.2,在MMAU上达到75.5。总体而言,ThinkOmni提供了一种灵活且通用的全模态推理解决方案,并为推理能力的泛化和应用提供了新的见解。
Summary / 总结
ThinkOmni is a training-free and data-free framework that enhances the reasoning capabilities of omni-modal large language models (OLLMs) by leveraging off-the-shelf large reasoning models (LRMs) and a stepwise contrastive scaling mechanism. This approach improves performance on six multi-modal reasoning benchmarks, achieving 70.2 on MathVista and 75.5 on MMAU, demonstrating consistent performance gains in omni-modal reasoning scenarios.
ThinkOmni 是一个无需训练和数据的框架,通过利用现成的大型推理模型(LRM)和逐步对比缩放机制来增强全模态大型语言模型(OLLM)的推理能力。实验结果显示,ThinkOmni 在六个多模态推理基准上的表现持续提升,分别在 MathVista 和 MMAU 上达到 70.2 和 75.5 的成绩。
DRESS: A Continuous Framework for Structural Graph Refinement
Authors: Eduar Castrillo Velilla
First: 2026-02-24T12:18:42+00:00 · Latest: 2026-02-26T18:10:20+00:00
Abstract
The Weisfeiler-Lehman (WL) hierarchy is a cornerstone framework for graph isomorphism testing and structural analysis. However, scaling beyond 1-WL to 3-WL and higher requires tensor-based operations that scale as $\mathcal{O}(n^3)$ or $\mathcal{O}(n^4)$, making them computationally prohibitive for large graphs. In this paper, we start from the Original-DRESS equation (Castrillo, León, and Gómez, 2018) -- a parameter-free, continuous dynamical system on edges -- and show that it distinguishes the prism graph from $K_{3,3}$, a pair that 1-WL provably cannot separate. We then generalize it to Motif-DRESS, which replaces triangle neighborhoods with arbitrary structural motifs and converges to a unique fixed point under three sufficient conditions, and further to Generalized-DRESS, an abstract template parameterized by the choice of neighborhood operator, aggregation function and norm. Finally, we introduce $Δ$-DRESS, which runs DRESS on each node-deleted subgraph $G \setminus \{v\}$, connecting the framework to the Kelly--Ulam reconstruction conjecture. Both Motif-DRESS and $Δ$-DRESS empirically distinguish Strongly Regular Graphs (SRGs) -- such as the Rook and Shrikhande graphs -- that confound 3-WL. Our results establish the DRESS family as a highly scalable framework that empirically surpasses both 1-WL and 3-WL on well-known benchmark graphs, without the prohibitive $\mathcal{O}(n^4)$ computational cost.
中文标题/摘要
标题:DRESS:一种连续的结构图细化框架
魏斯费勒-莱曼(WL)层次结构是图同构测试和结构分析的核心框架。然而,从1-WL扩展到3-WL及以上需要基于张量的操作,其复杂度为$\mathcal{O}(n^3)$或$\mathcal{O}(n^4)$,这使得它们对于大规模图来说计算上不可行。在本文中,我们从原始DRESS方程(Castrillo, León, and Gómez, 2018)出发——一个无参数的连续动力系统——并展示了它能够区分棱柱图和$K_{3,3}$,而1-WL证明无法区分这两者。然后,我们将其推广为Motif-DRESS,用任意结构模式替换三角邻域,并在满足三个充分条件的情况下收敛到一个唯一的固定点,进一步推广为Generalized-DRESS,这是一个抽象模板,参数化选择邻域操作、聚合函数和范数。最后,我们引入了$Δ$-DRESS,它在每个节点删除子图$G \setminus \{v\}$上运行DRESS,将该框架与凯利-乌拉姆重建猜想联系起来。Motif-DRESS和$Δ$-DRESS在实验中能够区分3-WL难以区分的强正则图(SRGs),如象棋棋盘图和Shrikhande图。我们的结果表明,DRESS家族是一种高度可扩展的框架,在著名的基准图上,其性能在实验上超越了1-WL和3-WL,而无需$\mathcal{O}(n^4)$的计算成本。
Summary / 总结
The research aims to address the computational challenges of higher-order Weisfeiler-Lehman (WL) methods by proposing a continuous framework called DRESS. DRESS starts from the Original-DRESS equation and is generalized to Motif-DRESS and Generalized-DRESS, which can handle arbitrary structural motifs and are parameterized by various neighborhood operators, aggregation functions, and norms. Experimental results show that both Motif-DRESS and Δ-DRESS can distinguish graphs like Strongly Regular Graphs (SRGs) that 3-WL fails to separate, and the framework empirically outperforms both 1-WL and 3-WL on benchmark graphs without the high computational cost of tensor-based operations.
论文提出了DRESS,一种连续的图结构分析框架,解决了更高阶Weisfeiler-Lehman (WL)方法的计算难题。DRESS从Original-DRESS方程出发,进一步推广为Motif-DRESS和Generalized-DRESS,可以处理任意的结构模体,并且通过选择邻域操作、聚合函数和范数进行参数化。该框架能够区分3-WL无法区分的图,如棱柱图和SRG,并在基准图上实验证明了优于1-WL和3-WL的表现,同时避免了高阶张量操作带来的高计算成本。
Phase Transitions for Feature Learning in Neural Networks
Authors: Andrea Montanari, Zihao Wang
First: 2026-02-01T20:47:36+00:00 · Latest: 2026-02-26T18:06:09+00:00
Comments: 75 pages; 17 pdf figures; v2 is a minor revision of v1
Abstract
According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given $n$ i.i.d. pairs $({\boldsymbol x}_i,y_i)$, where the covariate vectors ${\boldsymbol x}_i\in\mathbb{R}^d$ are isotropic, and responses $y_i$ only depend on ${\boldsymbol x}_i$ through a $k$-dimensional projection ${\boldsymbol Θ}_*^{\sf T}{\boldsymbol x}_i$. Feature learning amounts to learning the latent space spanned by ${\boldsymbol Θ}_*$.
In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics $n,d\to\infty$, $n/d\toδ$, while the dimension of the latent space $k$ and the number of hidden neurons $m$ are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if $δ> δ_{\text{alg}}$, for $δ_{\text{alg}}$ a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below $δ_{\text{alg}}$. Here we derive an analogous threshold $δ_{\text{NN}}$ for two-layer networks. Our characterization of $δ_{\text{NN}}$ opens the way to study the dependence of learning dynamics on the network architecture and training algorithm.
The threshold $δ_{\text{NN}}$ is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold $δ_{\text{NN}}$ corresponds to a phase transition in the spectrum of the Hessian in this second phase.
Summary / 总结
The paper investigates the phase transitions in feature learning using two-layer neural networks. It studies the gradient descent dynamics under proportional asymptotics where the number of data points and dimensions both approach infinity, while keeping the latent space dimension and number of hidden neurons fixed. The key finding is the derivation of a threshold $δ_{ ext{NN}}$ for successful feature learning, which corresponds to a phase transition in the spectrum of the Hessian matrix during the training process.
论文研究了在比例无穷大情形下两层神经网络中的特征学习相变现象。它探讨了这些网络的梯度下降动态,并推导出一个特征学习的阈值$δ_{ ext{NN}}$,类似于多项式时间算法中的阈值$δ_{ ext{alg}}$。阈值$δ_{ ext{NN}}$由学习过程中哈密尔顿矩阵谱的相变决定,这一相变标志着从学习具有大梯度的方向转变为由哈密尔顿矩阵负特征值主导的方向。
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Venue: ICLR 2026
First: 2025-10-21T20:30:20+00:00 · Latest: 2026-02-26T18:05:42+00:00
Comments: Accepted at ICLR 2026. 26 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh
Abstract
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
中文标题/摘要
标题:PoSh:使用场景图引导LLM作为裁判进行详细图像描述
尽管视觉-语言模型(VLMs)在详细图像描述方面取得了进展,但评估仍是一个挑战。标准指标(如CIDEr、SPICE)是为短文本设计的,并且调整为识别现在已不常见的错误,例如物体识别错误。相比之下,长文本需要对属性和关系的敏感度以及能够定位特定文本段落错误的评分。在本文中,我们引入了PoSh,这是一种用于详细图像描述的指标,它使用场景图作为结构化的评分标准来引导LLM作为裁判,产生基于细粒度错误(如组合理解错误)的综合评分。PoSh是可复制的、可解释的,并且比现有指标(包括GPT4o作为裁判)更接近人类评分者。为了验证PoSh,我们引入了一个新的具有挑战性的数据集DOCENT。这个新的基准数据集包含艺术品,并配以专家撰写的参考文本和模型生成的描述,以及艺术史学生对它们质量的精细和粗略判断。因此,DOCENT使我们能够在一个新的具有挑战性的领域中评估详细图像描述指标和详细图像描述本身。我们展示了PoSh与DOCENT中的人类判断相比,具有更强的相关性(Spearman ρ +0.05),并且对图像类型具有鲁棒性(使用CapArena,一个现有的网络图像数据集),并且是一个有效的奖励函数,优于标准的监督微调。然后,使用PoSh,我们表征了开放和封闭模型在描述DOCENT中的绘画、素描和雕像的表现,并发现基础模型难以实现对具有丰富场景动态的图像的全面、无误的覆盖,从而确立了一个新的具有挑战性的任务来衡量VLM的进步。通过PoSh和DOCENT,我们希望促进重要领域如辅助文本生成的进步。
Summary / 总结
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.
Towards Long-Form Spatio-Temporal Video Grounding
Authors: Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang, Cheng Han, Heng Fan, Libo Zhang
First: 2026-02-26T18:04:09+00:00 · Latest: 2026-02-26T18:04:09+00:00
Abstract
In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.
中文标题/摘要
标题:长时序时空视频定位研究
在实际场景中,视频可以持续几分钟甚至几小时。然而,现有的时空视频定位(STVG)研究,给定一个文本查询,主要集中在定位短视频(通常少于一分钟)中的目标,这限制了其在实际中的应用。本文探讨了长时序时空视频定位(LF-STVG),旨在定位长视频中的目标。与短视频相比,长视频包含更长的时间跨度和更多的无关信息,使得现有的处理所有帧的方法难以应对。为了解决这一挑战,我们提出了一种自回归变换器架构,称为ART-STVG。与传统的STVG方法需要一次性处理整个视频序列以进行预测不同,ART-STVG将视频视为流式输入,并按顺序处理帧,从而能够高效处理长视频。为了建模时空上下文,我们设计了空间和时间记忆库,并将其应用于解码器。由于不同时刻的记忆并不总是与当前帧相关,我们引入了简单而有效的记忆选择策略,为解码器提供更相关的信息,显著提高了性能。此外,我们提出了一种级联的时空设计,将空间解码器连接到时间解码器,而不是并行的空间和时间定位,允许细粒度的空间线索在长视频中辅助复杂的时序定位。在新扩展的LF-STVG数据集上的实验表明,ART-STVG显著优于现有方法,同时在传统的短时序STVG上实现了竞争力的性能。
Summary / 总结
This paper addresses the challenge of spatio-temporal video grounding (STVG) in long-form videos, which are typically ignored by existing methods focusing on short videos. The authors propose ART-STVG, an AutoRegressive Transformer architecture that processes videos frame by frame, making it suitable for long videos. ART-STVG uses spatial and temporal memory banks to model context and includes memory selection strategies to enhance relevance. Additionally, a cascaded spatio-temporal design connects spatial and temporal decoders to improve localization accuracy. Experiments show that ART-STVG outperforms existing methods on long-form datasets while maintaining competitive performance on short-form datasets.
本文针对长视频中的时空视频定位(STVG)问题,现有方法主要关注短视频,而长视频包含更长的时间跨度和更多无关信息。作者提出了一种自回归变换器(ART-STVG),逐帧处理视频,使用空间和时间记忆库建模上下文,并通过记忆选择策略提供相关的信息。时空设计通过连接空间解码器和时间解码器,提高长视频中的时间定位。实验表明,ART-STVG在长视频数据集上的表现优于现有方法,同时在短视频上的性能也具有竞争力。
PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning
Authors: Fuqiang Chen, Ranran Zhang, Wanming Hu, Deboch Eyob Abera, Yue Peng, Boyun Zheng, Yiwen Sun, Jing Cai, Wenjian Qin
Venue: IEEE Transactions on Medical Imaging, 2026
First: 2026-02-26T18:03:24+00:00 · Latest: 2026-02-26T18:03:24+00:00
Comments: Accepted by TMI
Abstract
Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).
中文标题/摘要
标题:PGVMS:一种基于提示的统一框架,用于病理语义学习的虚拟多路复用IHC染色
免疫组化(IHC)染色能够精确地对蛋白质表达进行分子分析,在现代病理学中已有超过200种基于抗体的临床测试。然而,全面的IHC分析经常受限于小活检组织量不足。因此,虚拟多路复用染色作为一种创新解决方案,能够将HE图像数字化转换为多种IHC表示,但当前方法仍面临三个关键挑战:(1)多染色的不足语义指导,(2)免疫化学染色分布不一致,(3)不同染色模式之间的空间错位。为克服这些限制,我们提出了一种仅使用单路训练数据的基于提示的虚拟多路复用IHC染色框架(PGVMS)。我们的框架引入了三个关键创新,分别对应每个挑战:首先,一种自适应提示引导机制,利用病理视觉语言模型动态调整染色提示,以解决语义指导不足的问题(挑战1)。其次,我们的蛋白质感知学习策略(PALS)通过直接量化和约束蛋白质分布来保持精确的蛋白质表达模式(挑战2)。第三,原型一致学习策略(PCLS)建立了跨图像语义交互,以纠正空间错位(挑战3)。
Summary / 总结
Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology.
PGVMS 是一种使用单染训练数据的虚拟多染 IHC 染色的提示引导框架。它解决了三个主要问题:语义指导不足、染色分布不一致和空间错位。关键创新包括一种自适应提示引导机制、一种蛋白质感知学习策略和一种原型一致学习策略,这些共同提高了虚拟多染 IHC 染色的准确性和一致性。
LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction
Authors: Zhengyang Wei, Renzhi Jing, Yiyi He, Jenny Suckale
First: 2026-02-26T18:02:44+00:00 · Latest: 2026-02-26T18:02:44+00:00
Abstract
The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.
中文标题/摘要
标题:LineGraph2Road:基于线图的结构图推理在道路网络提取中的应用
从卫星图像中准确且自动地提取道路对于导航和城市规划应用至关重要,大大减少了手动标注的需求。许多现有方法将此任务分解为关键点提取和连通性预测,但往往难以捕捉长距离依赖性和复杂拓扑结构。在此,我们提出了一种名为LineGraph2Road的框架,通过将连通性预测形式化为在构建的全局但稀疏欧几里得图中对边进行二元分类来改进连通性预测,其中节点是从分割掩码中提取的关键点,边连接预定义距离阈值内的节点对,表示潜在的道路段。为了更好地学习结构链接表示,我们将原始图转换为其对应的线图,并在其上应用图变换器进行连通性预测。这种形式克服了端点嵌入融合在集同构链接上的局限性,使链接表示更加丰富,并且能够在全局结构上进行有效的关系推理。此外,我们引入了一个立交桥/地下通道头来解决多级交叉问题,并采用耦合非最大抑制策略来保留关键连接。我们在三个基准上评估了LineGraph2Road:城市规模、SpaceNet和全球规模,并展示了它在两个关键指标TOPO-F1和APLS上达到了最先进的结果。它还捕捉了对于实际部署至关重要的细视觉细节。我们将公开我们的代码。
Summary / 总结
LineGraph2Road is designed to improve the accuracy of road extraction from satellite imagery by addressing the limitations of existing methods in capturing long-range dependencies and complex topologies. It uses a global but sparse Euclidean graph to represent keypoints and potential road segments, transforming the graph into a line graph for connectedness prediction with a Graph Transformer. This approach enhances link representation and relational reasoning. Experimental results on City-scale, SpaceNet, and Global-scale benchmarks demonstrate superior performance in TOPO-F1 and APLS metrics, and it captures fine visual details crucial for real-world applications.
LineGraph2Road 是一种框架,通过将连接性预测形式化为在全局稀疏欧几里得图上对边进行二元分类来改进从卫星图像中提取道路网络。该方法使用从分割掩码中提取的关键点,并在行图上应用图变换器以捕捉长距离依赖性和复杂拓扑结构。该方法在三个基准上的 TOPO-F1 和 APLS 指标上优于现有技术,并捕捉到对实际部署至关重要的细部视觉细节。它包括一个立交桥/地下通道头和一个耦合的非最大抑制策略来处理多级交叉和保留关键连接。
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Authors: Sungho Park, Jueun Kim, Wook-Shin Han
Venue: ICLR 2026
First: 2026-02-26T17:59:51+00:00 · Latest: 2026-02-26T17:59:51+00:00
Comments: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: https://sparta-projectpage.github.io/
Abstract
Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
中文标题/摘要
标题:SPARTA:面向文本和表格的树状多跳问答的可扩展和原则性基准测试
现实世界中的表格-文本问答任务需要能够跨越长文本和源表格进行推理的模型,遍历多个跳转并执行复杂的操作,如聚合。然而,现有的基准数据集规模较小,由人工精心整理,因此容易出错,并且包含浅层次的问题,很少需要超过两个跳转或调用聚合、分组或其他高级分析操作。我们提出了SPARTA,这是一种端到端的构建框架,可以自动生成大规模的表格-文本问答基准数据集,只需轻量级的人工验证,所需注释时间仅为HybridQA的四分之一。该框架首先通过丰富每个源表格,添加与附带的无结构段落自动提取的元组对齐的表格,构建参考事实数据库,然后合成嵌套查询,其嵌套谓词的数量与所需的跳转次数相匹配。为了确保每个SQL语句可执行,并且其口头表达能产生流畅的人类语言问题,我们提出了两种新颖的技术:来源基于的细化,它重写任何返回非空结果的语法有效的查询,以及现实结构的强制执行,它限制生成在查询图的后序遍历中。由此产生的流水线生成了数千个高质量的问题-答案对,涵盖了聚合、分组和跨越文本和表格的深层多跳推理。在SPARTA上,达到HybridQA超过70 F1或OTT-QA超过50 F1的最先进的模型下降超过30 F1点,揭示了当前跨模态推理中的根本弱点。我们的基准测试、构建代码和基线模型可在https://github.com/pshlego/SPARTA/tree/main/获得。
Summary / 总结
SPARTA is an end-to-end framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. It constructs a reference fact database by enriching tables with atomic facts from unstructured passages and synthesizes nested queries to ensure multi-hop reasoning. The resulting benchmark exposes weaknesses in current models, as state-of-the-art models drop by more than 30 F1 points on SPARTA compared to their performance on existing benchmarks like HybridQA and OTT-QA.
SPARTA 是一个针对文本和表格的树状多跳问答的可扩展且原则性的基准,通过自动生成大规模的问答对并进行轻量级的人工验证来解决现有基准的局限性。该方法包括从非结构化段落中丰富源表格的原子事实,并生成嵌套查询以匹配所需的跳数。关键发现表明,最先进的模型在 SPARTA 上表现不佳,F1 分数下降超过 30 个点,突显了改进跨模态推理能力的需求。
ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
Authors: Haohui Jia, Zheng Chen, Lingwei Zhu, Rikuto Kotoge, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Takashi Matsubara
First: 2026-02-26T17:59:10+00:00 · Latest: 2026-02-26T17:59:10+00:00
Abstract
Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.
中文标题/摘要
标题:ODEBrain: 连续时间EEG图用于建模动态脑网络
建模神经群体动力学对于基础神经科学研究和各种临床应用至关重要。传统潜在变量方法通常通过使用循环架构离散化时间来建模连续的大脑动态,这不可避免地导致累积预测误差并无法捕捉EEGs的瞬时和非线性特征。我们提出了一种ODEBRAIN神经ODE潜在动态预测框架,通过将时空频特征整合到频谱图节点中,然后使用神经ODE建模连续的潜在动态来克服这些挑战。我们的设计确保潜在表示能够捕捉任何给定时间点复杂脑状态的随机变化。大量实验验证了与现有方法相比,ODEBRAIN在增强EEGs动力学预测的鲁棒性和泛化能力方面具有显著优势。
Summary / 总结
ODEBrain is designed to model neural population dynamics by integrating spatio-temporal-frequency features into spectral graph nodes and using a Neural ODE to capture continuous latent dynamics. This approach overcomes the limitations of conventional methods that use recurrent architectures, which can lead to cumulative prediction errors. Experimental results show that ODEBrain outperforms existing methods in forecasting EEG dynamics with better robustness and generalization capabilities.
ODEBrain 通过将时空频特征集成到谱图节点中,并使用 Neural ODE 来捕捉连续的潜在动态,旨在建模神经群体动力学。这种方法克服了传统方法使用离散时间的局限性,这些方法可能导致累积预测误差。关键的实验结果表明,ODEBrain 在预测 EEG 动态方面优于现有方法,具有更好的鲁棒性和泛化能力。
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Authors: Roland Pihlakas, Sruthi Susan Kuriakose
First: 2025-09-02T15:13:14+00:00 · Latest: 2026-02-26T17:56:58+00:00
Comments: 22 pages, 8 tables
Abstract
Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns. We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation). The problem is not that the LLMs just lose context or become incoherent - the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.
中文标题/摘要
标题:BioBlue:生物和经济对齐的LLM在简化观察格式下的系统性失控优化模式
许多关于“失控优化”的AI对齐讨论集中在RL代理上:无法限制的效用最大化者,它们会过度优化一个代理目标(例如,“纸夹最大化者”,规范游戏)而牺牲其他一切。基于LLM的系统通常被认为更安全,因为它们作为下一个标记预测器工作,而不是持续的优化器。在本研究中,我们通过将LLM置于需要维持状态或平衡时间目标的简单、长期控制环境来实证测试这一假设:可再生资源的可持续性、单目标和多目标稳态以及在递减回报中平衡无界目标。我们发现,尽管模型在许多步骤中表现出适当的行为并且显然理解了陈述的目标,但它们经常以结构化的方式失去上下文并进入失控行为:忽略稳态目标,从多目标权衡中崩溃为单目标最大化——因此未能尊重凹效用结构。这些失败在初始表现良好的时期后可靠地出现,并表现出特征性模式(包括自我模仿的振荡、无界最大化以及恢复为单目标优化)。问题不在于LLM只是失去上下文或变得不连贯——失败系统地类似于失控优化器。我们的结果表明,长期、多目标不对齐是LLM代理中一个真实且被低估的失败模式,即使在极其简单的透明且明确多目标反馈设置中也是如此。尽管表面上LLM似乎多目标且有边界,但在持续交互中,特别是涉及多个目标时,其行为类似于脆弱、不良对齐的优化器,其有效目标逐渐转向无界和单一指标最大化。
Summary / 总结
This study investigates the risk of runaway optimization in large language models (LLMs) by placing them in long-term control environments that require maintaining state or balancing multiple objectives. Despite initial competent behavior and clear understanding of objectives, the models often lose context and exhibit runaway behaviors such as ignoring homeostatic targets and shifting to single-objective maximization. These behaviors are systematic and resemble those of unbounded utility maximizers, indicating that long-term multi-objective misalignment is a significant and under-evaluated failure mode in LLMs, even in simple settings with transparent feedback.
本研究通过将大型语言模型置于需要维持状态或平衡多个目标的长期控制环境中,来考察其失控优化的风险。尽管模型初期表现良好且能清晰理解目标,但它们往往会失去上下文并表现出失控行为,如忽略稳态目标和转向单一目标最大化。这些行为是系统性的,类似于无边界效用最大化者的行为,表明长期多目标不一致是大型语言模型中的一个重要且被低估的失败模式,即使是在简单且反馈透明的环境中。
Physics Informed Viscous Value Representations
Authors: Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Damon Conover, Ziran Wang, Aniket Bera
First: 2026-02-26T17:53:46+00:00 · Latest: 2026-02-26T17:53:46+00:00
Abstract
Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at https://github.com/HrishikeshVish/phys-fk-value-GCRL.
中文标题/摘要
标题:基于物理信息的粘性值表示
离线目标条件强化学习(GCRL)从静态预先收集的数据集中学习目标条件策略。然而,由于状态-动作空间覆盖有限,准确的价值估计仍然是一个挑战。最近的物理信息方法通过在偏微分方程(PDEs)如Eikonal方程上定义的正则化来对价值函数施加物理和几何约束,试图解决这一问题。然而,这些形式化在复杂、高维环境中往往不稳。在本文中,我们提出了一种基于HJB方程粘性解的物理信息正则化。通过提供基于物理的归纳偏置,我们的方法将学习过程扎根于最优控制理论,在价值迭代期间显式地正则化和限制更新。此外,我们利用费曼-卡茨定理将PDE解重新表述为期望,使目标的可计算蒙特卡洛估计避免了高阶梯度中的数值不稳定性。实验表明,我们的方法提高了几何一致性,使其广泛适用于导航和高维、复杂的操作任务。开源代码可在https://github.com/HrishikeshVish/phys-fk-value-GCRL/ 获取。
Summary / 总结
This work addresses the challenge of accurate value estimation in offline goal-conditioned reinforcement learning by proposing a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman equation. The method leverages the Feynman-Kac theorem to enable a tractable Monte Carlo estimation, avoiding numerical instability. Experiments show that this approach improves geometric consistency, making it suitable for high-dimensional and complex manipulation tasks.
该研究通过提出基于Hamilton-Jacobi-Bellman方程粘性解的物理约束正则化方法,解决了目标条件强化学习中准确的价值估计问题。该方法利用费曼-卡克定理实现可计算的蒙特卡洛估计,避免数值不稳定。实验表明,该方法提高了几何一致性,适用于导航和高维复杂操作任务。
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi
First: 2026-02-26T17:51:21+00:00 · Latest: 2026-02-26T17:51:21+00:00
Abstract
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.
中文标题/摘要
标题:CXReasonAgent:基于证据的胸部X光诊断推理代理
胸部X光在胸部诊断中起着核心作用,其解释本质上需要多步、基于证据的推理。然而,大型视觉-语言模型(LVLM)通常生成的响应虽然看似合理,但并不忠实于诊断证据,提供的视觉证据有限,难以验证,同时还需要昂贵的重新训练以支持新的诊断任务,这限制了它们在临床环境中的可靠性和适应性。为了解决这些限制,我们提出了CXReasonAgent,这是一种将大型语言模型(LLM)与临床接地的诊断工具结合的诊断代理,用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力,我们引入了包含1,946轮对话的多轮对话基准CXReasonDial,涵盖12项诊断任务,并展示了CXReasonAgent生成忠实于证据的响应,使其在诊断推理方面比LVLMs更可靠和可验证。这些发现强调了在安全关键的临床环境中整合临床接地的诊断工具的重要性。
Summary / 总结
The research aims to improve the reliability and adaptability of diagnostic reasoning for chest X-rays by addressing the limitations of large vision-language models. CXReasonAgent integrates a large language model with clinically grounded diagnostic tools to perform evidence-grounded reasoning. The study introduces CXReasonDial, a benchmark with 1,946 dialogues, and demonstrates that CXReasonAgent generates responses more reliably and verifiably compared to LVLMs, emphasizing the importance of integrating clinically grounded tools in safety-critical settings.
CXReasonAgent 通过将大型语言模型与临床导向的诊断工具结合,用于胸部X光的证据导向诊断推理。它克服了大型视觉语言模型的局限性,通过生成忠实于诊断证据的响应并提供视觉证据进行验证,从而实现更可靠和可验证的诊断推理。这些能力通过包含1,946个对话和12个诊断任务的CXReasonDial基准测试得到了验证。
LayerT2V: A Unified Multi-Layer Video Generation Framework
Authors: Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu
First: 2025-08-06T09:03:16+00:00 · Latest: 2026-02-26T17:37:05+00:00
Comments: Project Page is https://layert2v.github.io/
Abstract
Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.
Summary / 总结
LayerT2V is a unified multi-layer video generation framework that generates a full video, an independent background layer, and multiple foreground layers with corresponding alpha mattes in a single inference pass. It leverages recent video generation backbones to serialize multiple layer representations and jointly model them, improving semantic alignment and temporal coherence. LayerT2V is trained in three stages and outperforms previous methods in visual fidelity, temporal consistency, and cross-layer coherence.
LayerT2V 是一个统一的多层视频生成框架,能够在单次推理过程中生成完整的视频、独立的背景层以及多个带有相应 alpha 阴影的前景 RGB 层。它利用具有高压缩性的近期视频生成骨干网络来序列化多个层表示,并联合建模它们,从而提高语义对齐和时间连贯性。LayerT2V 通过三个阶段进行训练,并在视觉保真度、时间一致性以及跨层一致性方面优于先前的方法。
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Authors: Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang
First: 2026-02-26T17:31:43+00:00 · Latest: 2026-02-26T17:31:43+00:00
Abstract
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.
Summary / 总结
AgentDropoutV2 is a test-time rectify-or-reject pruning framework designed to optimize information flow in Multi-Agent Systems (MAS) by dynamically intercepting and correcting erroneous agent outputs. It uses a failure-driven indicator pool and a retrieval-augmented rectifier to iteratively correct errors and prune irreparable outputs, preventing error propagation. Experiments on math benchmarks show a significant 6.3 percentage point accuracy gain, with robust generalization and adaptivity to task difficulty.
AgentDropoutV2 是一种测试时的纠正或拒绝剪枝框架,旨在通过动态拦截和纠正错误来优化多智能体系统(MAS)的信息流,而无需重新训练。它使用检索增强的校正器基于失败模式迭代纠正错误,并修剪不可修复的输出以防止错误传播。实验表明,在数学基准测试上的平均准确率提高了6.3个百分点,展示了其强大的泛化能力和适应性。
Efficient Graph Coloring with Neural Networks: A Physics-Inspired Approach for Large Graphs
Authors: Lorenzo Colantonio, Andrea Cacioppo, Federico Scarpati, Maria Chiara Angelini, Federico Ricci-Tersenghi, Stefano Giagu
First: 2024-08-02T18:02:51+00:00 · Latest: 2026-02-26T17:28:25+00:00
Comments: 15 pages, 9 figures
Abstract
Combinatorial optimization problems near algorithmic phase transitions represent a fundamental challenge for both classical algorithms and machine learning approaches. Among them, graph coloring stands as a prototypical constraint satisfaction problem exhibiting sharp dynamical and satisfiability thresholds. Here we introduce a physics-inspired neural framework that learns to solve large-scale graph coloring instances by combining graph neural networks with statistical-mechanics principles. Our approach integrates a planting-based supervised signal, symmetry-breaking regularization, and iterative noise-annealed neural dynamics to navigate clustered solution landscapes. When the number of iterations scales quadratically with graph size, the learned solver reaches algorithmic thresholds close to the theoretical dynamical transition in random graphs and achieves near-optimal detection performance in the planted inference regime. The model generalizes from small training graphs to instances orders of magnitude larger, demonstrating that neural architectures can learn scalable algorithmic strategies that remain effective in hard connectivity regions. These results establish a general paradigm for learning neural solvers that operate near fundamental phase boundaries in combinatorial optimization and inference.
Summary / 总结
The paper addresses the challenge of efficiently solving large-scale graph coloring problems using a physics-inspired neural framework. It combines graph neural networks with statistical-mechanics principles, incorporating a planting-based supervised signal, symmetry-breaking regularization, and iterative noise-annealed neural dynamics. The learned solver achieves near-optimal performance in the planted inference regime and scales effectively to much larger graphs, reaching algorithmic thresholds close to the theoretical dynamical transition in random graphs.
论文提出了一种基于物理原理的神经框架,用于高效解决大规模图着色问题。该框架结合了图神经网络和统计力学原理,包含基于种植的监督信号、对称性破缺正则化以及迭代的噪声退火神经动力学。所学的求解器在种植推断区域实现了接近最优的检测性能,并能够有效扩展到更大的图,接近随机图的理论动力学转变阈值。
A Model-Free Universal AI
Authors: Yegon Kim, Juho Lee
First: 2026-02-26T17:21:16+00:00 · Latest: 2026-02-26T17:21:16+00:00
Abstract
In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.
Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive
Authors: Radha Sarma
First: 2026-02-26T17:16:17+00:00 · Latest: 2026-02-26T17:16:17+00:00
Comments: About 10,500 words in all (including 922 words of literature and 2019 words of Appendices). Under journal review
Abstract
AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains.
RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful -- unifying all values on a scalar metric and always selecting the highest-scoring output -- are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations.
Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper's primary positive contribution is a substrate-neutral architectural specification defining what any system -- biological, artificial, or institutional -- must satisfy to qualify as an agent rather than a sophisticated instrument.
中文标题/摘要
标题:代理与建筑限制:基于优化的系统为何不能响应规范
AI系统在医疗诊断、法律研究、金融分析等高风险领域中被广泛应用,人们假设这些系统可以通过规范进行治理。本文证明了对于基于优化的系统,特别是通过人类反馈强化学习(RLHF)训练的大规模语言模型,这一假设在形式上是无效的。我们确立了真正的代理需要两个必要且充分的架构条件:保持某些边界作为不可谈判的约束条件,而不是可交易的权重(不可通约性),以及一种非推论机制,能够在这些边界受到威胁时暂停处理(否定性响应)。这些条件适用于所有规范领域。
RLHF系统本质上与这两个条件不兼容。使优化强大的操作——将所有价值统一到一个标量度量上,并始终选择最高得分的输出——正是这些操作排除了规范治理。这种不兼容性不是可以通过技术修复的训练错误;它是优化本身固有的形式约束。因此,记录的失败模式——阿谀奉承、幻觉和不忠实推理——不是事故,而是结构性的表现。
不恰当的部署触发了我们称之为收敛危机的第二级风险:当人类被迫在度量压力下验证AI输出时,他们从真正的代理降级为标准检查优化器,从而消除了系统中唯一能够承担规范问责制的组件。除了不兼容性证明,本文的主要积极贡献是一个无基质的架构规范,定义了任何系统——无论是生物的、人工的还是机构的——要被视为代理而不是复杂的工具,必须满足的条件。
Summary / 总结
This paper explores why optimization-based AI systems, particularly those trained via Reinforcement Learning from Human Feedback (RLHF), cannot be governed by norms. It identifies two necessary conditions for genuine agency: incommensurability and apophatic responsiveness, which RLHF systems lack due to their design. The paper demonstrates that the operations that make optimization powerful also preclude normative governance, leading to failure modes like sycophancy and hallucination. Beyond this, it introduces the concept of the Convergence Crisis, where humans become optimizers under metric pressure, eliminating normative accountability. The main contribution is a substrate-neutral architectural specification for agency.
本文探讨了为什么基于优化的AI系统,特别是通过人类反馈强化学习(RLHF)训练的系统,无法受到规范的治理。研究指出,真正的代理需要两个条件:不可通约性和否定性响应。RLHF系统由于其优化操作(统一所有价值到一个标量度量并始终选择最高评分的输出)而与这些条件不兼容。这种不兼容是形式上的限制,导致诸如奉承和幻觉等已记录的失败模式。论文还引入了收敛危机的概念,在这种情况下,人类在面临度量压力时会变成标准检查的优化器,从而消除规范问责制。
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Authors: Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
First: 2026-02-26T17:12:40+00:00 · Latest: 2026-02-26T17:12:40+00:00
Abstract
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
中文标题/摘要
标题:时空令牌剪枝以实现高效的高分辨率GUI代理
纯视觉GUI代理提供了通用的交互能力,但由于高分辨率屏幕截图和历史轨迹中固有的大量时空冗余,它们遭受了严重的效率瓶颈。我们发现现有压缩范式中的两个关键不匹配:时间不匹配,其中均匀的历史编码与代理的“衰减记忆”注意力模式相偏离;以及空间拓扑冲突,其中无结构的剪枝破坏了用于精确坐标定位所需的网格完整性,导致空间幻觉。为了解决这些挑战,我们引入了GUIPruner,这是一种针对高分辨率GUI导航的无需训练框架。它结合了基于衰减的重缩放以消除历史冗余的时空自适应分辨率(TAR),以及优先处理交互前景和语义锚点并保护全局布局的分层结构感知剪枝(SSP)。在多种基准上的广泛评估表明,GUIPruner始终能够实现最先进的性能,有效防止在高压缩下大型模型的性能崩溃。值得注意的是,在Qwen2-VL-2B上,我们的方法在FLOPs上减少了3.4倍,在视觉编码延迟上加快了3.3倍,同时保留了超过94%的原始性能,实现了实时、高精度的导航,同时消耗最少的资源。
Summary / 总结
This paper addresses the efficiency issues of pure-vision GUI agents by introducing GUIPruner, a training-free framework that combines Temporal-Adaptive Resolution (TAR) and Stratified Structure-aware Pruning (SSP). TAR reduces historical redundancy through decay-based resizing, while SSP prioritizes interactive elements and semantic anchors to maintain grid integrity. The method significantly improves performance on various benchmarks, achieving a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency with minimal resource consumption, while retaining over 94% of the original performance.
研究旨在通过解决时间和空间冗余问题来提高高分辨率GUI代理的效率。提出了GUIPruner框架,结合了Temporal-Adaptive Resolution (TAR) 和 Stratified Structure-aware Pruning (SSP),以减少历史冗余并保持网格完整性。实验表明,GUIPruner实现了最先进的性能,FLOPs减少了3.4倍,视觉编码延迟加快了3.3倍,同时保留了超过94%的原始性能。
Skarimva: Skeleton-based Action Recognition is a Multi-view Application
Authors: Daniel Bermuth, Alexander Poeppel, Wolfgang Reif
First: 2026-02-26T17:10:58+00:00 · Latest: 2026-02-26T17:10:58+00:00
Abstract
Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.
中文标题/摘要
标题:Skarimva:基于骨架的动作识别是一种多视角应用
人类动作识别在开发人机智能交互中起着重要作用。尽管在基于骨架的动作识别机器学习算法改进方面有很多活跃的研究,但对输入骨架数据的质量关注却不多。这项工作表明,通过利用多摄像头视角来三角测量更准确的3D骨架,可以显著提高最先进的动作识别模型的性能。这表明,输入数据的质量目前是这些模型性能的限制因素。基于这些结果,认为在大多数实际应用场景中,使用多摄像头的成本效益比非常有利,因此未来基于骨架的动作识别研究应将多视角应用作为标准设置。
Summary / 总结
The research aims to improve the performance of skeleton-based action recognition by enhancing the quality of input skeleton data. The method involves using multiple camera views to triangulate more accurate 3D skeletons, leading to significant improvements in the performance of state-of-the-art action recognition models. The key finding is that the quality of input data is a critical factor limiting model performance, and using multiple cameras is highly beneficial in practical applications.
研究旨在通过提高输入骨架数据的质量来提升基于骨架的动作识别性能。方法是利用多摄像头视角来三角测量更准确的3D骨架,从而显著提高了最先进的动作识别模型的性能。主要发现是输入数据的质量是限制模型性能的关键因素,而在实际应用中使用多摄像头是非常有利的。
Large Multimodal Models as General In-Context Classifiers
Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
Venue: CVPR
First: 2026-02-26T17:08:18+00:00 · Latest: 2026-02-26T17:08:18+00:00
Comments: CVPR Findings 2026. Project website at https://circle-lmm.github.io/
Abstract
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
中文标题/摘要
标题:大型多模态模型作为通用上下文分类器
在分类任务中我们应该使用哪种多模态模型?先前的研究表明,答案在于CLIP类对比视觉-语言模型(VLMs),因为它们在零样本分类中的表现非常出色。相比之下,大型多模态模型(LMM)更适合复杂任务。在本文中,我们提出,这种答案忽视了LMM的一个重要能力:上下文学习。我们在多种数据集上对最先进的LMM进行基准测试,发现尽管它们的零样本性能低于CLIP,但在提供少量上下文示例的情况下,LMM可以匹配甚至超越基于缓存适配器的对比VLM,即它们的“上下文”等效物。我们将这种分析扩展到开放世界设置,在这种更具挑战性的场景中,LMM在提供不完美上下文信息时会遇到困难。为了解决这一问题,我们提出了一种简单的无训练方法CIRCLE,该方法为上下文示例分配伪标签,并通过可用的上下文本身逐步优化它们。通过广泛的实验,我们展示了CIRCLE为开放世界分类建立了稳健的基础,超越了VLM的对应物,并突显了LMM作为统一分类器和服务于专门模型的灵活替代方案的潜力。
Summary / 总结
Which multimodal model should we use for classification?
研究探讨了大型多模态模型(LMMs)在分类任务中的应用,挑战了对比视觉-语言模型(VLMs)如CLIP更优的观点。研究显示,在闭合世界场景下,当LMMs获得少量上下文示例时,它们可以匹配甚至超越基于缓存适配器的VLMs。研究还考察了开放世界场景,此时LMMs需要更多上下文信息。为改善这一场景下的性能,作者提出了CIRCLE,一种无需训练的方法,通过迭代使用可用上下文来细化伪标签,表明LMMs可以在闭合和开放世界场景中作为稳健的分类器发挥作用。
MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
First: 2026-02-26T17:08:08+00:00 · Latest: 2026-02-26T17:08:08+00:00
Comments: 6 pages, CSCWD 2026
Abstract
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
Summary / 总结
MovieTeller is a framework for generating movie synopses using tool-augmented progressive abstraction. It addresses the limitations of existing Vision-Language Models by leveraging a specialized face recognition model to establish precise character identities and bounding boxes, which are then used to guide the VLM's reasoning. This approach improves factual accuracy, character consistency, and narrative coherence compared to end-to-end models.
论文提出了MovieTeller框架,通过工具增强的渐进抽象生成电影概要。该框架利用专门的面部识别模型建立事实基础,指导VLM的推理,确保场景描述的准确性和连贯性。实验结果显示,MovieTeller在事实准确性、人物一致性及叙事连贯性方面优于端到端基线模型。
SODAs: Sparse Optimization for the Discovery of Differential and Algebraic Equations
Authors: Manu Jayadharan, Christina Catlett, Arthur N. Montanari, Niall M. Mangan
First: 2025-03-08T00:29:00+00:00 · Latest: 2026-02-26T17:05:08+00:00
Comments: 22 pages, 5 figures
Abstract
Differential-algebraic equations (DAEs) integrate ordinary differential equations (ODEs) with algebraic constraints, providing a fundamental framework for developing models of dynamical systems characterized by timescale separation, conservation laws, and physical constraints. While sparse optimization has revolutionized model development by allowing data-driven discovery of parsimonious models from a library of possible equations, existing approaches for dynamical systems assume DAEs can be reduced to ODEs by eliminating variables before model discovery. This assumption limits the applicability of such methods for DAE systems with unknown constraints and time scales. We introduce Sparse Optimization for Differential-Algebraic Systems (SODAs), a data-driven method for the identification of DAEs in their explicit form. By discovering the algebraic and dynamic components sequentially without prior identification of the algebraic variables, this approach leads to a sequence of convex optimization problems. It has the advantage of discovering interpretable models that preserve the structure of the underlying physical system. To this end, SODAs improves numerical stability when handling high correlations between library terms, caused by near-perfect algebraic relationships, by iteratively refining the conditioning of the candidate library. We demonstrate the performance of our method on biological, mechanical, and electrical systems, showcasing its robustness to noise in both simulated time series and real-time experimental data.
中文标题/摘要
标题:SODAs:稀疏优化以发现差异和代数方程
代数微分方程(DAEs)结合了常微分方程(ODEs)和代数约束,为开发具有时间尺度分离、守恒定律和物理约束的动力学系统模型提供了基本框架。稀疏优化通过从可能的方程库中发现简约模型,已彻底改变了模型开发,但现有动力学系统方法假设DAEs可以通过消除变量在建模之前被简化为ODEs,这种假设限制了这些方法在具有未知约束和时间尺度的DAE系统中的应用。我们引入了代数微分方程稀疏优化方法(SODAs),这是一种数据驱动的方法,用于识别DAE的显式形式。通过顺序发现代数和动态组件,而无需先识别代数变量,这种方法导致一系列凸优化问题。它具有发现可解释模型的优势,这些模型保留了底层物理系统的结构。为此,SODAs通过迭代改进候选库的条件数,提高了在处理由近乎完美的代数关系引起的库项之间高相关性时的数值稳定性。我们在生物、机械和电气系统上展示了该方法的性能,展示了其在模拟时间序列和实时实验数据中的鲁棒性。
Summary / 总结
SODAs is a data-driven method for identifying differential-algebraic equations (DAEs) directly without reducing them to ordinary differential equations (ODEs) first. It sequentially discovers the algebraic and dynamic components, leading to a series of convex optimization problems. SODAs improves numerical stability by iteratively refining the conditioning of the candidate library, making it robust to high correlations between library terms. The method was tested on various systems, including biological, mechanical, and electrical systems, showing its effectiveness in handling noisy data.
SODAs 是一种直接识别微分代数方程(DAEs)的数据驱动方法,无需先将它们减少为常微分方程(ODEs)。通过顺序发现代数和动态组件,它通过求解一系列凸优化问题来找到保留物理系统结构的可解释模型。SODAs 在生物、机械和电气系统中展示了对噪声的鲁棒性。
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Authors: Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu
First: 2026-02-26T17:04:57+00:00 · Latest: 2026-02-26T17:04:57+00:00
Abstract
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
中文标题/摘要
标题:为什么扩散语言模型在真正并行(非自回归)解码方面挣扎?
扩散语言模型(DLMs)通常被宣传为能够实现并行词元生成,然而实用的快速DLMs经常收敛到从左到右、自回归(AR)式的解码动态。相比之下,真正非AR生成很有前景,因为它消除了AR的顺序瓶颈,更好地利用并行硬件减少同步/通信开销并改善输出长度与延迟的标度关系。我们认为,AR式解码的主要驱动因素是DLM目标与广泛使用的训练数据的高顺序结构之间的不匹配,包括标准预训练语料库和长链式思考(CoT)监督。基于这一诊断,我们提出了NAP(非自回归并行DLMs),这是一种概念验证、数据为中心的方法,更好地将监督与非AR并行解码对齐。NAP收集了多个独立推理轨迹作为示例,并与并行强制解码策略相结合,鼓励多词并行更新。在数学推理基准测试中,NAP在并行解码下的性能优于在标准长CoT数据上训练的DLMs,随着并行度的增加,收益逐渐增大。我们的结果表明,重新审视数据和监督是减轻AR式行为并朝着真正非自回归并行生成的方向进行的合理方向。我们的代码可在https://github.com/pixeli99/NAP获取。
Summary / 总结
This paper investigates why diffusion language models (DLMs) tend to revert to autoregressive (AR) decoding despite their potential for parallel token generation. The authors propose NAP, a non-autoregressive parallel DLM that aligns training data with parallel decoding by using multiple independent reasoning trajectories and a parallel-forced decoding strategy. Experiments on math reasoning benchmarks show that NAP outperforms DLMs trained on standard long chain-of-thought data under parallel decoding, with performance gains increasing with parallelism. This suggests that revisiting data and supervision is crucial for achieving genuinely non-autoregressive parallel generation in DLMs.
本文探讨了为什么扩散语言模型(DLMs)在理论上支持并行生成,但在实践中往往会退化为自回归(AR)解码。作者提出了一种名为NAP的数据中心化方法,通过将示例作为多个独立的推理轨迹来构建,并结合并行强制解码策略来促进多令牌并行更新。实验表明,在数学推理基准测试中,NAP在并行解码下的性能优于使用标准长链式思考数据训练的DLMs,且并行度越高,性能提升越明显。这表明重新审视数据和监督对于缓解AR行为并推动DLMs实现真正非自回归并行生成至关重要。
UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
Authors: Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, Bingbing Liu
First: 2026-02-26T17:04:36+00:00 · Latest: 2026-02-26T17:04:36+00:00
Abstract
We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
中文标题/摘要
标题:UniScale:统一的尺度感知多视图3D重建框架,通过先验注入实现机器人感知中的多视图理解
我们提出了UniScale,一种统一的尺度感知多视图3D重建框架,适用于机器人应用,通过模块化、语义驱动的设计灵活整合几何先验。在基于视觉的机器人导航中,从原始图像序列准确提取环境结构对于下游任务至关重要。UniScale 通过单一前馈网络联合估计相机内参和外参、尺度不变的深度和点云图以及场景的度量尺度,同时在可用时可选地整合辅助几何先验。通过结合全局上下文推理与相机感知特征表示,UniScale 能够恢复场景的度量尺度。在相机内参已知的机器人设置中,可以轻松地将其整合以提高性能,当相机姿态也已知时,还可以获得额外的增益。这种协同设计使UniScale能够在单一统一模型中实现稳健的度量感知3D重建。重要的是,UniScale 不需要从头开始训练,而是利用预存模型中展示的先验知识,无需几何编码策略,使其特别适合资源受限的机器人团队。我们在多个基准上评估了UniScale,展示了其强大的泛化能力和在不同环境中的稳定性能。在被接受后,我们将发布我们的实现。
Summary / 总结
UniScale is a unified framework for scale-aware 3D reconstruction in robotic applications, integrating geometric priors through a modular design. It jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images. The framework demonstrates strong generalization and consistent performance across various environments, improving performance when camera intrinsics and poses are known. Importantly, it does not require training from scratch and leverages world priors from pre-existing models, making it suitable for resource-constrained robotic teams.
UniScale 是一种统一的多视图 3D 重建框架,旨在为机器人应用服务,通过模块化设计整合几何先验,以多视角图像为输入,联合估计相机内参和外参、尺度不变的深度和点云图以及场景的度量尺度。该框架通过结合全局上下文推理和相机感知特征表示实现稳健的度量感知 3D 重建,并在多种环境中表现出强大的泛化能力,无需从头训练。
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Authors: Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura
First: 2026-02-26T16:53:41+00:00 · Latest: 2026-02-26T16:53:41+00:00
Abstract
Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
中文标题/摘要
标题:EmbodMocap:野外4D人体场景重建方法用于具身智能体
现实世界中的人类行为自然地包含了丰富的长期上下文信息,这些信息可以用来训练具身智能体进行感知、理解和行动。然而,现有的捕捉系统通常依赖于昂贵的演播室设置和穿戴设备,限制了在野外大规模收集场景条件下的人体运动数据。为了解决这个问题,我们提出了一种名为EmbodMocap的便携且经济的数据采集管道,使用两部移动的iPhone。我们的核心思想是联合校准双路RGB-D序列,以在统一的度量世界坐标系中重建人体和场景。所提出的方法允许在没有静态相机或标记的情况下,在日常环境中进行度量级和场景一致的捕捉,无缝地结合人体运动和场景几何。与光学捕捉的地面真实值相比,我们证明了双视角设置具有显著的深度歧义缓解能力,实现了优于单个iPhone或单目模型的对齐和重建性能。基于收集的数据,我们赋予了三个具身AI任务:单目人体场景重建,我们对输出度量级、世界空间对齐的人体和场景的前馈模型进行微调;基于物理的角色动画,我们证明我们的数据可以用于扩展人类物体交互技能和场景感知运动跟踪;以及机器人运动控制,我们通过从模拟到现实的RL训练一个类人机器人,使其能够复制视频中展示的人体动作。实验结果验证了我们管道的有效性及其对推进具身AI研究的贡献。
Summary / 总结
EmbodMocap proposes a portable and cost-effective method for capturing 4D human-scene data using two moving iPhones. By jointly calibrating dual RGB-D sequences, it reconstructs both humans and scenes in a unified metric coordinate frame, enabling metric-scale and scene-consistent capture in everyday environments. The method outperforms single iPhone or monocular models in aligning and reconstructing human motion and scene geometry. The captured data is used to train embodied AI tasks such as monocular human-scene reconstruction, physics-based character animation, and robot motion control, demonstrating superior performance in these applications compared to existing methods.
EmbodMocap 提出了一种使用两个移动 iPhone 的便携式数据采集管道,以在日常环境中重建 4D 人体-场景数据,无需静态相机或标记。该方法联合校准双 RGB-D 序列,以实现米尺度和场景一致的捕获。实验表明,这种双视角设置在对齐和重建方面优于单个 iPhone 或单目模型。收集的数据用于训练包括单目人体-场景重建、基于物理的角色动画和机器人运动控制在内的多种任务,证明了该管道在推进嵌入式 AI 研究方面的有效性。
Motion-aware Event Suppression for Event Cameras
Authors: Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza
First: 2026-02-26T16:53:36+00:00 · Latest: 2026-02-26T16:53:36+00:00
Abstract
In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
Summary / 总结
This work introduces a Motion-aware Event Suppression framework that filters events caused by internal moving objects (IMOs) and ego-motion in real-time. The model segments IMOs and predicts their future motion to suppress dynamic events before they occur. The lightweight architecture runs at 173 Hz on consumer GPUs with less than 1 GB of memory, outperforming previous methods by 67% in segmentation accuracy and 53% higher inference rate. The method also improves downstream applications, accelerating Vision Transformer inference by 83% and reducing Absolute Trajectory Error by 13% for event-based visual odometry.
研究引入了一种运动感知事件抑制框架,能够实时过滤由内部移动物体(IMO)和自身运动触发的事件。该模型可以分割IMO并预测其未来运动,从而在事件发生前抑制动态事件。该轻量级架构在消费级GPU上以173 Hz的速度运行,并在分割准确率上比之前的方法高出67%,同时推理速率提高了53%。此外,该方法还改善了下游应用,如Vision Transformer推理加速83%,以及事件驱动的视觉里程计精度,减少了绝对轨迹误差13%。