MediX-R1: Open Ended Medical Reinforcement Learning
Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
First: 2026-02-26T18:59:46+00:00 · Latest: 2026-02-26T18:59:46+00:00
Abstract
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com
中文标题/摘要
标题:MediX-R1:开放式的医疗强化学习
我们介绍了MediX-R1,这是一种针对医疗多模态大型语言模型(MLLMs)的开放式强化学习(RL)框架,能够提供基于临床的、自由形式的答案,超越了多项选择格式。MediX-R1 使用基于组的RL对基础视觉-语言骨干进行微调,并结合了针对医疗推理的复合奖励:基于LLM的准确度奖励,用于判断语义正确性并做出严格的YES/NO决策;基于医学嵌入的语义奖励,用于捕捉同义词和术语变体;以及轻量级的格式和模态奖励,以确保可解释的推理和模态识别。这种多信号设计为传统验证性或仅限MCQ的奖励无法提供稳定、信息丰富的反馈的开放式输出提供了支持。为了衡量进展,我们提出了一种统一的评估框架,用于文本和图像+文本任务,该框架使用LLM作为裁判替代脆弱的字符串重叠度量,以捕捉语义正确性、推理和上下文对齐。尽管仅使用约51,000个指令示例,MediX-R1 在标准的医疗LLM(仅文本)和VLM(图像+文本)基准测试中取得了优异的成绩,超越了强大的开源基线,并在开放式临床任务上取得了特别大的进步。我们的结果表明,使用全面的奖励信号和基于LLM的评估的开放式RL是一种通往多模态模型中可靠医疗推理的实际路径。我们的训练模型、精选数据集和源代码可在https://medix.cvmbzuai.com 获取。
Summary / 总结
MediX-R1 is an open-ended RL framework for medical MLLMs that enables free-form answers. It fine-tunes a vision-language backbone with Group Based RL and a composite reward system, including LLM-based accuracy, medical embedding semantic, and format/modality rewards. This approach provides stable feedback for open-ended outputs. MediX-R1 outperforms strong baselines on medical LLM and VLM benchmarks, especially on open-ended clinical tasks, using only about 51,000 instruction examples. The evaluation framework uses a Reference-based LLM-as-judge to measure semantic correctness, reasoning, and contextual alignment. The results show that comprehensive reward signals and LLM-based evaluation are practical for reliable medical reasoning in multimodal models.
MediX-R1 是一个用于医疗 MLLM 的开放域 RL 框架,能够生成自由形式的答案。它通过基于组的 RL 和一个综合奖励系统(包括 LLM 基准准确度、医学嵌入语义、格式和模态奖励)来微调视觉-语言主干。这种方法为开放域输出提供了稳定的反馈。MediX-R1 使用约 51,000 个指令示例,在医疗 LLM 和 VLM 基准测试中表现出色,特别是在开放域临床任务上。评价框架使用 LLM 作为评判者来衡量语义正确性、推理和上下文对齐。结果表明,综合奖励信号和 LLM 基准评价是实现可靠多模态模型医疗推理的实用路径。
Joint Optimization for 4D Human-Scene Reconstruction in the Wild
Authors: Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou
First: 2025-01-04T01:53:51+00:00 · Latest: 2026-02-26T18:59:39+00:00
Comments: Project Page: https://vail-ucla.github.io/JOSH/
Abstract
Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
中文标题/摘要
标题:野外4D人体-场景重建的联合优化
重建人体运动及其周围环境对于理解人体-场景交互和预测场景中的人体运动至关重要。尽管在受限环境中捕捉人体-场景交互方面取得了很大进展,但先前的方法很难从网络视频中重建自然多样的人体运动和场景上下文。在本文中,我们提出了一种名为JOSH的新颖优化方法,用于从单目视频中进行野外4D人体-场景重建。JOSH利用密集场景重建和人体网格恢复技术进行初始化,然后利用人体-场景接触约束联合优化场景、相机姿态和人体运动。实验结果表明,JOSH通过联合优化场景几何和人体运动,在全局人体运动估计和密集场景重建方面取得了更好的结果。我们进一步设计了一个更高效的模型JOSH3R,并直接用来自网络视频的伪标签对其进行训练。JOSH3R仅通过使用JOSH预测的标签进行训练,就优于其他无优化方法,进一步证明了其准确性和泛化能力。
Summary / 总结
The research aims to reconstruct human motion and its surrounding environment from web videos, addressing the limitations of previous methods in constrained environments. JOSH, a novel optimization-based method, initializes with dense scene reconstruction and human mesh recovery, then jointly optimizes the scene, camera poses, and human motion using human-scene contact constraints. Experiments show JOSH improves both global human motion estimation and dense scene reconstruction through joint optimization. JOSH3R, a more efficient variant, further enhances accuracy and generalization by training with pseudo-labels from web videos, outperforming other optimization-free methods.
研究旨在从单目网络视频中重建人体运动及其周围环境,这由于自然多变的场景而具有挑战性。JOSH 是一种新颖的优化方法,通过密集场景重建和人体网格恢复进行初始化,然后利用人体与场景的接触约束联合优化场景、相机姿态和人体运动。该方法在全局人体运动估计和密集场景重建方面取得了比以往方法更好的结果。JOSH3R 是一种更高效的变体,通过使用从 JOSH 预测的伪标签进行训练进一步提高性能,优于其他无优化方法。
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Authors: Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
Venue: CVPR 2026
First: 2026-02-26T18:59:33+00:00 · Latest: 2026-02-26T18:59:33+00:00
Comments: CVPR 2026, Project page: https://research.nvidia.com/labs/dvl/projects/vgg-ttt
Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
Summary / 总结
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t.
研究旨在通过提出VGG-T$^3$方法解决离线前馈3D重建方法的计算和内存限制问题,该方法在输入图像数量上呈线性扩展。方法通过测试时训练将场景几何的可变长度键值空间表示简化为固定大小的多层感知机。VGG-T$^3$在54秒内重建1k图像集,比基线方法快11.6倍,并在点云重建误差方面优于其他线性时间方法。该模型还通过未见过的图像展示了场景表示的视觉定位能力。
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Authors: Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
Venue: CVPR 2026
First: 2026-02-26T18:59:05+00:00 · Latest: 2026-02-26T18:59:05+00:00
Comments: Project page: https://seethrough3d.github.io. Accepted at CVPR 2026
Abstract
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
Summary / 总结
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.
研究旨在解决文本到图像生成中的遮挡推理问题,这对于创建具有深度一致几何和比例的场景至关重要。SeeThrough3D提出了一种遮挡感知的3D场景表示(OSCR),使用透明的3D盒子和渲染视角来明确建模遮挡。该模型通过从3D表示中提取的视觉标记来条件化预训练的流式文本到图像生成器,并使用掩蔽自注意力准确地将每个对象边界框与其相应的文本描述绑定,从而实现精确的3D布局控制和现实的遮挡效果。该模型能够很好地泛化到未见过的对象类别,并保持一致的相机控制。
Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training
Authors: Aheli Saha, René Schuster, Didier Stricker
First: 2026-02-26T18:57:52+00:00 · Latest: 2026-02-26T18:57:52+00:00
Comments: 12 pages, International Conference on Pattern Recognition Applications and Methods
Abstract
Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.
中文标题/摘要
标题:事件驱动对象检测中基于事件的传感器泛化联合分布训练
受生物启发的事件相机由于其异步和低延迟特性,最近吸引了大量研究。这些特性提供了高动态范围并显著减少了运动模糊。然而,由于其输出信号的性质新颖,可用数据的变异性存在差距,且缺乏对其信号参数的广泛分析。本文通过提供对内在参数如何影响基于事件数据训练的模型性能的深入理解,解决了这些问题,特别是针对对象检测的应用。我们还利用研究结果扩展了下游模型的传感器无关鲁棒性。
Summary / 总结
This paper aims to enhance the adaptability of models trained on event data from bio-inspired event cameras for object detection. The authors employ joint distribution training to explore how intrinsic parameters influence model performance. Key findings show that understanding these parameters can improve the robustness of downstream models, making them sensor-generalizable and less dependent on specific sensor types.
本文旨在通过生物启发的事件相机的数据训练模型,提高其在目标检测中的适应性。作者使用联合分布训练来研究内在参数如何影响模型性能。主要发现表明,理解这些参数可以提高下游模型的鲁棒性,使其更具传感器通用性,减少对特定传感器类型的依赖。
Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna
First: 2026-02-26T18:54:06+00:00 · Latest: 2026-02-26T18:54:06+00:00
Comments: TACL 2026
Abstract
The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
中文标题/摘要
标题:规模无法克服语用学:报告偏差对视觉语言推理的影响
视觉语言模型(VLMs)缺乏推理能力的问题一直是研究讨论的核心。我们认为这种行为源于其训练数据中的报告偏差。也就是说,人们默认在描述视觉内容时会省略一些必要的隐含信息,以监督某些类型的推理;例如,“今天在比赛!”比“一张37个人站在田野后面的图片”更可能作为描述。我们通过语用学理论的视角,研究了流行的VLMs OpenCLIP、LLaVA-1.5和Molmo的数据基础,发现报告偏差导致在四个推理技能(空间、时间、否定和计数)上缺乏充分的表示,尽管这些语料库是大规模的,或者合成生成的。通过一组精心策划的基准测试,我们证明:(i) VLMs在由报告偏差抑制的上述类型推理上表现不佳;(ii) 与普遍认为的相反,增加数据量、模型规模和多语言训练并不会默认产生这些技能;但令人欣慰的是,(iii) 特别收集的用于获取隐含信息的注解是有效的。我们的研究结果强调了需要更故意的数据策划方法,而不是依赖规模来产生推理能力。
Summary / 总结
This study investigates the impact of reporting bias in training data on the reasoning capabilities of Vision-Language Models (VLMs). By analyzing OpenCLIP, LLaVA-1.5, and Molmo, the research finds that these models lack representation for spatial, temporal, negation, and counting reasoning skills due to the bias in their training data. Despite the large scale of the corpora, simply increasing data or model size does not improve these skills. However, incorporating specific annotations that capture tacit information enhances these reasoning abilities, suggesting the need for more intentional data curation methods.
研究探讨了报告偏见对Vision-Language模型(如OpenCLIP、LLaVA-1.5和Molmo)推理能力的影响。通过使用语用学理论分析训练数据,研究发现报告偏见导致空间、时间、否定和计数推理技能的不足表示。尽管有大规模和合成数据,这些模型在这类推理上表现不佳。增加数据或模型规模并不能改善这些技能,但特定注释的引入是有效的。这强调了需要更故意的数据整理方法,而不是依赖规模来产生推理能力。
FlashOptim: Optimizers for Memory Efficient Training
Authors: Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock
First: 2026-02-26T18:52:22+00:00 · Latest: 2026-02-26T18:52:22+00:00
Comments: Source code is available at https://github.com/databricks/flashoptim
Abstract
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory.
We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half.
Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama-3.1-8B finetuning.
中文标题/摘要
标题:FlashOptim:内存高效训练的优化器
标准的混合精度训练需要为每个模型参数分配大量加速器内存。这些字节不仅代表参数本身,还包括其梯度和一个或多个优化器状态变量。每个值通常需要4个字节,因此即使是70亿参数的模型,对于拥有不到100GB加速器内存的研究人员来说也可能不切实际。
我们引入了FlashOptim,这是一种优化套件,能够在保持模型质量和API兼容性的前提下,将每个参数的内存减少超过50%。我们的方法引入了两种关键技术。首先,我们通过找到并利用其量化误差的紧界来改进主权重分割。其次,我们设计了压缩函数,大大减少了8位优化器状态量化的误差。结合16位梯度,这些技术将AdamW的内存从每个参数16字节减少到7字节,或者在释放梯度的情况下减少到5字节。它们还使模型检查点的大小减少了超过一半。
在SGD、AdamW和Lion上应用FlashOptim的实验表明,在包括Llama-3.1-8B微调在内的标准视觉和语言基准任务中,没有任何可测量的质量下降。
Summary / 总结
FlashOptim introduces optimizations to reduce memory usage in neural network training by over 50% without compromising model quality. It achieves this through improved master weight splitting and specially designed companding functions for optimizer state quantization, reducing AdamW memory from 16 bytes to 7 bytes per parameter. Experiments show no quality degradation on various benchmarks, including Llama-3.1-8B finetuning.
FlashOptim 通过改进主权重分割和优化 8 位优化器状态量化中的压扩函数,将神经网络训练的内存占用减少超过 50%,同时保持模型质量和 API 兼容性。实验表明,FlashOptim 在 SGD、AdamW 和 Lion 上的应用在各种基准测试中,包括 Llama-3.1-8B 微调,没有质量下降。
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
First: 2026-02-26T18:45:33+00:00 · Latest: 2026-02-26T18:45:33+00:00
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
中文标题/摘要
标题:检索与分割:少量示例足以弥合开放词汇分割中的监督缺口吗?
开放词汇分割(OVS)将视觉语言模型(VLMs)的零样本识别能力扩展到像素级预测,使模型能够根据文本提示分割任意类别。尽管取得了进展,但由于训练VLMs所使用的粗略图像级监督和自然语言的语义模糊性,OVS仍落后于完全监督的方法。我们通过引入一种少量样本设置,将文本提示与像素标注图像的支持集相结合,来解决这些限制。在此基础上,我们提出了一种检索增强的测试时适配器,通过融合文本和视觉支持特征学习一种轻量级的、针对每张图像的分类器。与依赖于后期手工融合的先前方法不同,我们的方法进行学习的、针对每个查询的融合,实现了模态之间的更强协同作用。该方法支持不断扩展的支持集,并适用于细粒度任务,如个性化分割。实验表明,我们显著缩小了零样本和监督分割之间的差距,同时保留了开放词汇的能力。
Summary / 总结
The paper addresses the limitations of open-vocabulary segmentation (OVS) by proposing a few-shot setting that combines textual prompts with pixel-annotated images. It introduces a retrieval-augmented test-time adapter that learns a lightweight classifier by fusing textual and visual support features, achieving better synergy between modalities than prior methods. Experiments demonstrate that this approach significantly reduces the gap between zero-shot and supervised segmentation while maintaining open-vocabulary capabilities.
论文旨在通过解决粗粒度图像级监督和自然语言语义模糊的问题来提升开放词汇分割(OVS)。它提出了一种带有像素标注图像支持集的少量样本设置,并提出了一种检索增强的测试时适配器,通过融合文本和视觉特征来学习轻量级分类器。实验表明,这种方法显著缩小了零样本和监督分割之间的差距,同时保持了开放词汇的能力。
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Authors: Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael
First: 2026-02-26T18:37:23+00:00 · Latest: 2026-02-26T18:37:23+00:00
Comments: 59 pages, 33 figures
Abstract
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
中文标题/摘要
标题:LLM初学者在双重用途和计算生物学任务中的提升
大型语言模型(LLM)在生物学基准测试中的表现越来越出色,但尚不清楚它们是否能提升初学者的表现,即是否能使人比仅使用互联网资源时表现更好。这种不确定性是理解科学加速和双重用途风险的关键。我们进行了一个多模型、多基准的人类提升研究,比较了有LLM访问权限的初学者与仅有互联网访问权限的初学者在八个与生物安全相关的任务集中的表现。参与者在复杂问题上工作,有充足的时间(最复杂任务最多13小时)。我们发现,LLM访问提供了显著的提升:有LLM的初学者比对照组准确度高4.16倍(95% CI [2.63, 6.87])。在四个有专家基线的基准测试中(仅有互联网资源),有LLM的初学者在三个基准测试中表现优于专家。令人惊讶的是,独立的LLM往往超过了LLM辅助的初学者,表明用户没有从LLM中获得最强的可用贡献。大多数参与者(89.6%)报告称,尽管有保护措施,获取与双重用途相关的信息并不困难。总体而言,LLM显著提升了初学者在以前仅由训练有素的从业者完成的生物学任务中的表现,强调了需要在传统基准测试的同时进行持续的互动提升评估。
Summary / 总结
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
研究探讨了大型语言模型(LLMs)是否能帮助初学者在生物学任务上比仅使用互联网资源表现得更好。在八个与生物安全相关的任务集中,拥有LLM访问权限的参与者比没有的参与者准确度高4.16倍,甚至独立的LLM往往超过了LLM辅助的初学者。值得注意的是,尽管有防护措施,大多数参与者发现获取双重用途相关信息并不困难。这些结果表明,LLMs显著提升了初学者在复杂生物学任务上的表现,突显了持续的互动评估的重要性。
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Authors: Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang
First: 2025-10-13T02:45:48+00:00 · Latest: 2026-02-26T18:32:27+00:00
Comments: 8 pages, 6 tables, 3 figures. Under review
Abstract
Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
中文标题/摘要
标题:DropVLA:视觉-语言-行动模型中的行动级后门攻击
视觉-语言-行动(VLA)模型将多模态感知和语言指令映射为可执行的机器人动作,使其特别容易受到行为后门操纵:在训练过程中引入的隐藏触发器可以在不影响名义任务性能的情况下诱导意外的物理动作。先前对VLA后门的研究主要集中在无目标攻击或任务级劫持上,而对个体动作的精细控制尚未得到充分探索。在本研究中,我们提出了DropVLA,这是一种行动级后门攻击,能够在有限的数据污染访问权限下,使用窗口一致的重新标记方案进行分块微调,迫使可重用的动作原语(例如,open_gripper)在现实的管道黑盒设置中的攻击者选择的决策点执行。在使用LIBERO评估的OpenVLA-7B上,仅通过视觉污染,攻击成功率(ASR)达到98.67%-99.83%,污染的剧集比例仅为0.31%,同时保留了98.50%-99.17%的任务清洁保留率,并在25个控制步骤内(500 Hz,0.05秒)成功触发了目标动作。仅文本触发在低污染预算下不稳定,结合文本与视觉并不能在视觉污染攻击上提供一致的ASR改进。后门对适度的触发器变化具有鲁棒性,并且可以在评估套件之间进行转移(96.27%,99.09%),而仅文本则大多失败(0.72%)。我们还在7自由度的Franka手臂上通过pi0-fast验证了物理世界的可行性,展示了在相机相对运动下诱导图像平面触发漂移的非平凡攻击效果。这些结果表明,VLA模型可以在最小的污染和无明显名义性能退化的情况下,被隐蔽地引导到关键安全动作级别。
Summary / 总结
DropVLA is an action-level backdoor attack on VLA models that forces a specific action primitive to execute at attacker-chosen points. Using a window-consistent relabeling scheme, the attack achieves a high success rate of 98.67%-99.83% with minimal data poisoning, while maintaining task performance. The attack is robust to moderate trigger variations and transfers across different evaluation suites. Physical-world experiments on a 7-DoF Franka arm demonstrate the attack's effectiveness under camera-relative motion, highlighting the vulnerability of VLA models to such fine-grained control attacks.
DropVLA 是一种针对 VLA 模型的动作级后门攻击,能够在攻击者选择的点强制执行特定的动作原语。通过窗口一致的重新标记方案,该攻击在极少量数据污染的情况下实现了 98.67%-99.83% 的高成功率,同时保持了任务性能。该攻击对适度的触发器变化具有鲁棒性,并且可以在不同的评估套件之间进行转移。物理世界实验在 7 自由度的 Franka 手臂上展示了在相机相对运动下攻击的有效性,突显了 VLA 模型对这种精细粒度控制攻击的脆弱性。
LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation
Authors: Nadine Garibli, Mayank Patwari, Bence Csiba, Yi Wei, Kostantinos Sidiropoulos
First: 2025-06-06T13:52:33+00:00 · Latest: 2026-02-26T18:27:23+00:00
Comments: 10 pages, 2 figures
Abstract
Longitudinal volumetric tumour segmentation is critical for radiotherapy planning and response assessment, yet this problem is underexplored and most methods produce single-timepoint semantic masks, lack lesion correspondence, and offer limited radiologist control. We introduce LinGuinE (Longitudinal Guidance Estimation), a PyTorch framework that combines image registration and guided segmentation to deliver lesion-level tracking and volumetric masks across all scans in a longitudinal study from a single radiologist prompt. LinGuinE is temporally direction agnostic, requires no training on longitudinal data, and allows any registration and semi-automatic segmentation algorithm to be repurposed for the task. We evaluate various combinations of registration and segmentation algorithms within the framework. LinGuinE achieves state-of-the-art segmentation and tracking performance across four datasets with a total of 456 longitudinal studies. Tumour segmentation performance shows minimal degradation with increasing temporal separation. We conduct ablation studies to determine the impact of autoregression, pathology specific finetuning, and the use of real radiologist prompts. We release our code and substantial public benchmarking for longitudinal segmentation, facilitating future research.
中文标题/摘要
标题:LinGuinE: 长期体积肿瘤分割的纵向引导估计
长期体积肿瘤分割对于放射治疗计划和反应评估至关重要,但这一问题尚未得到充分探索,大多数方法仅生成单一时点的语义掩码,缺乏病灶对应关系,并且对放射科医生的控制有限。我们引入了LinGuinE(纵向引导估计),这是一种结合图像配准和引导分割的PyTorch框架,能够从单个放射科医生的提示中在纵向研究的所有扫描中提供病灶级别的跟踪和体积掩码。LinGuinE在时间方向上是无方向性的,不需要在纵向数据上进行训练,并允许任何配准和半自动分割算法重新用于此任务。我们评估了框架内的各种配准和分割算法组合。LinGuinE在四个数据集的总共456个纵向研究中实现了最先进的分割和跟踪性能。肿瘤分割性能随时间分离度增加而最小化下降。我们进行了消融研究以确定自回归、病理特异性微调以及使用真实放射科医生提示的影响。我们发布了我们的代码和大量的公共基准测试,促进未来的研究。
Summary / 总结
LinGuinE is a PyTorch framework designed for longitudinal volumetric tumour segmentation, addressing the limitations of existing methods by providing lesion-level tracking and volumetric masks across all scans in a longitudinal study. It combines image registration and guided segmentation, requiring no training on longitudinal data and allowing the repurposing of any registration and semi-automatic segmentation algorithms. LinGuinE demonstrates state-of-the-art performance across four datasets with 456 longitudinal studies, showing minimal degradation in tumour segmentation performance with increasing temporal separation. Ablation studies highlight the importance of autoregression, pathology-specific fine-tuning, and the use of real radiologist prompts for optimal performance.
LinGuinE 是一个结合图像配准和引导分割的 PyTorch 框架,用于估计纵向引导以进行体积肿瘤分割,解决了单时点掩码和缺乏病灶对应的问题。该框架在四个数据集共 456 个纵向研究中实现了最先进的性能,并且随时间间隔增加肿瘤分割性能的下降幅度很小。进行了消融研究以评估自回归、病理特异性微调和使用真实放射科医生提示的影响,并公开了代码和基准数据以促进进一步研究。
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Authors: Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy, Marcelo Becker, Thiago Boaventura, Gustavo J. G. Lahr
First: 2026-02-26T18:20:26+00:00 · Latest: 2026-02-26T18:20:26+00:00
Abstract
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
中文标题/摘要
标题:小语言模型在领导者-跟随者互动中的零样本和单样本适应性评估
领导者-跟随者互动是人机交互(HRI)中的一个重要范式。然而,为资源受限的移动和辅助机器人实时分配角色仍然具有挑战性。虽然大型语言模型(LLMs)在自然通信方面显示出潜力,但其规模和延迟限制了其在设备上的部署。小语言模型(SLMs)提供了一种替代方案,但它们在HRI中的角色分类效果尚未系统评估。在本文中,我们提出了SLMs在领导者-跟随者通信中的基准测试,引入了一个从已发表数据库派生的新数据集,并通过合成样本捕捉互动特定的动力学。我们研究了两种适应策略:提示工程和微调,在零样本和单样本交互模式下进行研究,并与未训练基线进行比较。实验结果表明,零样本微调在保持低延迟(每样本22.2毫秒)的同时实现了稳健的分类性能(准确率为86.66%),显著优于基线和提示工程方法。然而,结果还表明,在单样本模式下性能有所下降,其中增加的上下文长度挑战了模型的架构能力。这些发现表明,微调后的SLMs为直接角色分配提供了一个有效的解决方案,同时突显了对话复杂性和分类可靠性之间的关键权衡问题。
Summary / 总结
This study evaluates the effectiveness of small language models (SLMs) for leader-follower role assignment in human-robot interaction (HRI), using a novel dataset and two adaptation strategies: prompt engineering and fine-tuning. Experiments with Qwen2.5-0.5B show that zero-shot fine-tuning achieves high accuracy (86.66%) and low latency (22.2 ms per sample), outperforming baseline and prompt-engineered approaches. However, performance drops in one-shot modes due to increased context length challenges. This highlights the trade-offs between dialogue complexity and classification reliability for edge deployment.
研究评估了小语言模型(SLMs)在人类-机器人交互(HRI)中的领导者-跟随者互动的有效性,重点关注零样本和单样本适应策略。研究引入了一个新数据集,并探讨了提示工程和微调方法。实验表明,零样本微调在Qwen2.5-0.5B上实现了高准确率(86.66%)和低延迟(每样本22.2毫秒),优于基线和提示工程方法,但在单样本模式中由于上下文长度增加导致性能下降。
Evaluating the Diversity and Quality of LLM Generated Content
Authors: Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani
First: 2025-04-16T23:02:23+00:00 · Latest: 2026-02-26T18:17:44+00:00
Comments: Published at COLM 2025
Abstract
Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models are widely deployed in applications requiring varied outputs. We argue that diversity without consideration of quality has limited practical value. To address this issue, we introduce a framework for measuring effective semantic diversity -- diversity among outputs that meet quality thresholds -- which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: when using diversity metrics that do not explicitly consider quality, preference-tuned models -- particularly those trained via RL -- often produce outputs with lower diversity; however, these same preference-tuned models generate greater effective semantic diversity than supervised fine-tuned (SFT) or base models. Our analysis further shows another trend: while larger models may exhibit greater effective semantic diversity than smaller models, the smaller models are consistently more parameter-efficient at producing unique content within a fixed sampling budget. These findings have practical implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
中文标题/摘要
标题:评估LLM生成内容的多样性和质量
近期研究表明,偏好调优技术——如基于人类反馈强化学习(RLHF)方法(如PPO和GRPO),以及替代方法DPO——降低了多样性,这给这些模型在需要多样化输出的应用中广泛应用带来了困境。我们认为,不考虑质量的多样性在实际应用中价值有限。为解决这一问题,我们提出了一种衡量有效语义多样性的框架——衡量满足质量标准的输出之间的多样性——这更好地反映了大型语言模型(LLM)的实际效用。通过不需要人类干预的开放任务,我们发现了一些反直觉的结果:当使用不考虑质量的多样性度量时,偏好调优模型——尤其是通过RL训练的模型——往往生成的输出多样性较低;然而,这些偏好调优模型生成的有效语义多样性却大于监督微调(SFT)或基础模型。我们的分析还显示了另一种趋势:虽然较大的模型可能在固定采样预算内生成更独特的内容方面表现出更大的有效语义多样性,但较小的模型在生成独特内容方面始终更具有参数效率。这些发现对需要多样化且高质量输出的应用具有实际意义,从创意辅助到合成数据生成。
Summary / 总结
This study evaluates the diversity and quality of content generated by large language models (LLMs) and introduces a framework to measure effective semantic diversity, which considers both diversity and quality. Using open-ended tasks, the research finds that preference-tuned models, especially those trained via reinforcement learning, produce less diverse outputs but greater effective semantic diversity compared to supervised fine-tuned or base models. The study also reveals that smaller models are more parameter-efficient in generating unique content within a fixed budget.
研究评估了大型语言模型(LLM)生成内容的多样性和质量,并引入了一个同时考虑多样性和质量的有效语义多样性测量框架。通过使用无需人工干预的开放任务,研究发现,偏好调优模型,尤其是通过强化学习训练的模型,在使用标准多样性度量时生成的多样性较低,但与监督微调或基础模型相比,生成的有效语义多样性更高。此外,较小的模型在固定采样预算内生成独特内容方面更具参数效率。
Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting
Authors: Shai Feldman, Stephen Bates, Yaniv Romano
First: 2025-05-07T18:46:02+00:00 · Latest: 2026-02-26T18:16:20+00:00
Abstract
We introduce a framework for robust uncertainty quantification in situations where labeled training data are corrupted, through noisy or missing labels. We build on conformal prediction, a statistical tool for generating prediction sets that cover the test label with a pre-specified probability. The validity of conformal prediction, however, holds under the i.i.d assumption, which does not hold in our setting due to the corruptions in the data. To account for this distribution shift, the privileged conformal prediction (PCP) method proposed leveraging privileged information (PI) -- additional features available only during training -- to re-weight the data distribution, yielding valid prediction sets under the assumption that the weights are accurate. In this work, we analyze the robustness of PCP to inaccuracies in the weights. Our analysis indicates that PCP can still yield valid uncertainty estimates even when the weights are poorly estimated. Furthermore, we introduce uncertain imputation (UI), a new conformal method that does not rely on weight estimation. Instead, we impute corrupted labels in a way that preserves their uncertainty. Our approach is supported by theoretical guarantees and validated empirically on both synthetic and real benchmarks. Finally, we show that these techniques can be integrated into a triply robust framework, ensuring statistically valid predictions as long as at least one underlying method is valid.
中文标题/摘要
标题:带污染标签的自适应预测:不确定插补和稳健加权
我们提出了一种框架,用于在标记训练数据受到噪声或缺失标签污染的情况下,进行稳健的不确定性量化。我们基于自适应预测,这是一种生成预测集的统计工具,该预测集在指定的概率下覆盖测试标签。然而,自适应预测的有效性依赖于独立同分布假设,而在我们的设置中,由于数据中的污染,该假设不成立。为了应对这种分布偏移,我们提出了利用特权信息(PI)——仅在训练期间可用的额外特征——的特权自适应预测(PCP)方法,通过重新加权数据分布,从而在假设权重准确的情况下生成有效的预测集。在本文中,我们分析了PCP对权重估计不准确的鲁棒性。我们的分析表明,即使权重估计不准确,PCP仍然可以生成有效的不确定性估计。此外,我们引入了一种新的自适应预测方法——不确定插补(UI),该方法不依赖于权重估计。相反,我们以保持标签不确定性的形式插补污染的标签。我们的方法得到了理论保证,并在合成和真实基准上得到了实证验证。最后,我们展示了这些技术可以集成到三重稳健框架中,只要至少有一种基础方法有效,就可以确保统计上有效的预测。
Summary / 总结
This paper addresses the challenge of robust uncertainty quantification in the presence of corrupted labels by proposing a framework that builds on conformal prediction. The framework includes privileged conformal prediction (PCP) which re-weights data to account for distribution shifts due to corruptions, and uncertain imputation (UI) which imputes corrupted labels while preserving their uncertainty. Theoretical analysis and empirical validation on synthetic and real benchmarks demonstrate that PCP can still provide valid uncertainty estimates even with inaccurate weights, and UI offers a robust alternative that does not require weight estimation.
本文解决了训练数据被污染时机器学习模型的稳健不确定性量化问题。它基于一种生成具有指定覆盖概率的预测集的方法——齐性预测。论文分析了在权重估计不准确的情况下,特权齐性预测(PCP)的鲁棒性,并引入了一种新的方法——不确定插补(UI),该方法不依赖于权重估计。理论保证和合成及真实基准上的实证验证表明了这些方法的有效性,并且它们可以整合到一个三重稳健框架中,以确保至少一种基础方法有效时的统计上可靠的预测。
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
Venue: ICLR 2026
First: 2026-02-26T18:10:41+00:00 · Latest: 2026-02-26T18:10:41+00:00
Comments: Accept by ICLR 2026
Abstract
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
中文标题/摘要
标题:ThinkOmni:通过指导解码提升到全模态场景的文本推理
全模态推理对于智能系统理解并从多种数据源中推断信息至关重要。虽然现有的全模态大型语言模型(OLLM)在感知多种模态方面表现出色,但它们缺乏近期大型推理模型(LRM)的复杂推理能力。然而,通过额外训练来增强OLLM的推理能力面临着重大挑战,包括高质量数据的需求、任务特定的适应以及巨大的计算成本。为了解决这些限制,我们提出了ThinkOmni,这是一种无需训练和数据的框架,将文本推理提升到全模态场景。ThinkOmni引入了两个关键组件:1)LRM-as-a-Guide,利用现成的LRM来指导OLLM的解码过程;2)逐步对比缩放,无需手动超参数调整即可适应性平衡感知和推理信号。在六个跨模态推理基准上的实验表明,ThinkOmni始终能够提供性能改进,主要结果在MathVista上达到70.2,在MMAU上达到75.5。总体而言,ThinkOmni提供了一种灵活且通用的全模态推理解决方案,并为推理能力的泛化和应用提供了新的见解。
Summary / 总结
ThinkOmni is a training-free and data-free framework that enhances the reasoning capabilities of omni-modal large language models (OLLMs) by leveraging off-the-shelf large reasoning models (LRMs) and a stepwise contrastive scaling method. Experiments on six multi-modal reasoning benchmarks show that ThinkOmni improves performance, achieving 70.2 on MathVista and 75.5 on MMAU.
ThinkOmni 是一个无需训练和数据的框架,通过利用现成的大型推理模型(LRMs)和逐步对比缩放方法来增强全模态大型语言模型(OLLMs)的推理能力。实验结果显示,ThinkOmni 在六个多模态推理基准上的表现得到提升,分别在 MathVista 达到 70.2,在 MMAU 达到 75.5。
DRESS: A Continuous Framework for Structural Graph Refinement
Authors: Eduar Castrillo Velilla
First: 2026-02-24T12:18:42+00:00 · Latest: 2026-02-26T18:10:20+00:00
Abstract
The Weisfeiler-Lehman (WL) hierarchy is a cornerstone framework for graph isomorphism testing and structural analysis. However, scaling beyond 1-WL to 3-WL and higher requires tensor-based operations that scale as $\mathcal{O}(n^3)$ or $\mathcal{O}(n^4)$, making them computationally prohibitive for large graphs. In this paper, we start from the Original-DRESS equation (Castrillo, León, and Gómez, 2018) -- a parameter-free, continuous dynamical system on edges -- and show that it distinguishes the prism graph from $K_{3,3}$, a pair that 1-WL provably cannot separate. We then generalize it to Motif-DRESS, which replaces triangle neighborhoods with arbitrary structural motifs and converges to a unique fixed point under three sufficient conditions, and further to Generalized-DRESS, an abstract template parameterized by the choice of neighborhood operator, aggregation function and norm. Finally, we introduce $Δ$-DRESS, which runs DRESS on each node-deleted subgraph $G \setminus \{v\}$, connecting the framework to the Kelly--Ulam reconstruction conjecture. Both Motif-DRESS and $Δ$-DRESS empirically distinguish Strongly Regular Graphs (SRGs) -- such as the Rook and Shrikhande graphs -- that confound 3-WL. Our results establish the DRESS family as a highly scalable framework that empirically surpasses both 1-WL and 3-WL on well-known benchmark graphs, without the prohibitive $\mathcal{O}(n^4)$ computational cost.
中文标题/摘要
标题:DRESS:一种连续的结构图细化框架
魏斯费勒-莱曼(WL)层次结构是图同构测试和结构分析的核心框架。然而,将范围从1-WL扩展到3-WL及以上需要基于张量的操作,其复杂度为$\mathcal{O}(n^3)$或$\mathcal{O}(n^4)$,这使得它们对于大型图来说计算上不可行。在本文中,我们从原始DRESS方程(Castrillo, León, and Gómez, 2018)出发——一个无参数的连续动力系统——并证明它能够区分棱柱图与$K_{3,3}$,而1-WL无法区分这两者。然后,我们将其推广为Motif-DRESS,用任意结构模式替换三角形邻域,并在满足三个充分条件下收敛到一个唯一的固定点,进一步推广为Generalized-DRESS,这是一个抽象模板,参数化选择邻域操作符、聚合函数和范数。最后,我们引入了$Δ$-DRESS,它在每个节点删除子图$G \setminus \{v\}$上运行DRESS,将该框架与凯利-乌拉姆重建猜想联系起来。Motif-DRESS和$Δ$-DRESS在实验上能够区分3-WL无法区分的强正则图(SRGs),如象棋棋盘图和谢尔罕德图。我们的结果确立了DRESS家族作为一种高度可扩展的框架,能够在已知基准图上实现实用性,超越1-WL和3-WL,而无需$\mathcal{O}(n^4)$的计算成本。
Summary / 总结
The paper introduces DRESS, a continuous framework for graph structural analysis that addresses the computational challenges of higher-order Weisfeiler-Lehman (WL) methods. DRESS starts from the Original-DRESS equation and generalizes it to Motif-DRESS and Generalized-DRESS, which can handle arbitrary structural motifs and are parameterized by neighborhood operators, aggregation functions, and norms. The framework, particularly Motif-DRESS and Δ-DRESS, distinguishes graphs like SRGs that 3-WL cannot, and empirically outperforms both 1-WL and 3-WL on benchmark graphs without the high computational cost of tensor-based operations.
本文提出了DRESS,一种连续的图结构分析框架,解决了更高阶Weisfeiler-Lehman (WL)方法的计算限制问题。从Original-DRESS方程出发,作者将其推广到Motif-DRESS和Generalized-DRESS,并进一步推广到$Δ$-DRESS。这些方法能够区分3-WL无法区分的图,如棱柱图与$K_{3,3}$,以及强正则图如罗克和希克汉德图。实验结果表明,DRESS在基准图上超越了1-WL和3-WL,且无需高阶张量操作的高计算成本。
Phase Transitions for Feature Learning in Neural Networks
Authors: Andrea Montanari, Zihao Wang
First: 2026-02-01T20:47:36+00:00 · Latest: 2026-02-26T18:06:09+00:00
Comments: 75 pages; 17 pdf figures; v2 is a minor revision of v1
Abstract
According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given $n$ i.i.d. pairs $({\boldsymbol x}_i,y_i)$, where the covariate vectors ${\boldsymbol x}_i\in\mathbb{R}^d$ are isotropic, and responses $y_i$ only depend on ${\boldsymbol x}_i$ through a $k$-dimensional projection ${\boldsymbol Θ}_*^{\sf T}{\boldsymbol x}_i$. Feature learning amounts to learning the latent space spanned by ${\boldsymbol Θ}_*$.
In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics $n,d\to\infty$, $n/d\toδ$, while the dimension of the latent space $k$ and the number of hidden neurons $m$ are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if $δ> δ_{\text{alg}}$, for $δ_{\text{alg}}$ a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below $δ_{\text{alg}}$. Here we derive an analogous threshold $δ_{\text{NN}}$ for two-layer networks. Our characterization of $δ_{\text{NN}}$ opens the way to study the dependence of learning dynamics on the network architecture and training algorithm.
The threshold $δ_{\text{NN}}$ is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold $δ_{\text{NN}}$ corresponds to a phase transition in the spectrum of the Hessian in this second phase.
Summary / 总结
This paper investigates the feature learning process in neural networks using gradient descent dynamics under proportional asymptotics. The study focuses on a two-layer neural network with fixed latent space and hidden neuron dimensions, analyzing the transition from learning directions with large gradients to being dominated by negative Hessian directions. The key finding is the derivation of a threshold $δ_{\text{NN}}$ that determines when feature learning via gradient descent is possible, analogous to a threshold $δ_{\text{alg}}$ for polynomial-time algorithms.
该研究探讨了在比例无穷大情形下两层神经网络中的特征学习相变现象,分析了梯度下降动力学,并推导出一个特征学习的阈值$δ_{ ext{NN}}$,该阈值与多项式时间算法的成功阈值$δ_{ ext{alg}}$相对应。该阈值通过学习过程中海森矩阵谱的变化来表征。
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Venue: ICLR 2026
First: 2025-10-21T20:30:20+00:00 · Latest: 2026-02-26T18:05:42+00:00
Comments: Accepted at ICLR 2026. 26 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh
Abstract
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
中文标题/摘要
标题:PoSh:使用场景图引导LLM作为裁判进行详细图像描述
尽管视觉-语言模型(VLMs)在详细图像描述方面取得了进展,但评估仍是一个挑战。标准指标(如CIDEr、SPICE)是为短文本设计的,并且调整为识别现在已不常见的错误,例如对象识别错误。相比之下,长文本需要对属性和关系的敏感度以及能够定位特定文本段落错误的评分。在本研究中,我们引入了PoSh,这是一种用于详细图像描述的指标,它使用场景图作为结构化的评分标准来引导LLM作为裁判,产生基于细粒度错误(如组合理解错误)的综合评分。PoSh是可复制的、可解释的,并且比现有指标(包括GPT4o作为裁判)更接近人类评分者。为了验证PoSh,我们引入了一个新的具有挑战性的数据集DOCENT。这个新的基准数据集包含艺术品,并配以专家撰写的参考文本和模型生成的描述,还增加了艺术史学生对其质量的精细和粗略判断。因此,DOCENT不仅能够评估详细图像描述指标,还能够在一个新的具有挑战性的领域中评估详细图像描述本身。我们展示了PoSh与DOCENT中的人类判断相比,具有更强的相关性(Spearman ρ +0.05),并且对图像类型具有鲁棒性(使用CapArena,一个现有的网络图像数据集),并且是一个有效的奖励函数,优于标准的监督微调。然后,使用PoSh,我们表征了开放和封闭模型在描述DOCENT中的绘画、素描和雕像的表现,并发现基础模型难以实现对具有丰富场景动态的图像的全面、无误的覆盖,从而确立了一个新的具有挑战性的任务来衡量VLM的进步。通过PoSh和DOCENT,我们希望促进在诸如辅助文本生成等重要领域的发展。
Towards Long-Form Spatio-Temporal Video Grounding
Authors: Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang, Cheng Han, Heng Fan, Libo Zhang
First: 2026-02-26T18:04:09+00:00 · Latest: 2026-02-26T18:04:09+00:00
Abstract
In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.
中文标题/摘要
标题:长时序时空视频定位
在实际场景中,视频可以持续几分钟甚至几小时。然而,现有的时空视频定位(STVG)研究,给定一个文本查询,主要集中在定位短视频(通常少于一分钟)中的目标,这限制了其在实际中的应用。本文探讨了长时序时空视频定位(LF-STVG),旨在定位长视频中的目标。与短视频相比,长视频包含更长的时间跨度和更多的无关信息,使得现有的处理所有帧的方法难以应对。为了解决这一挑战,我们提出了一种自回归变换器架构,称为ART-STVG。与传统的STVG方法需要一次性处理整个视频序列以进行预测不同,ART-STVG将视频视为流式输入,并按顺序处理帧,从而能够高效处理长视频。为了建模时空上下文,我们设计了空间和时间记忆库,并将其应用于解码器。由于不同时刻的记忆并不总是与当前帧相关,我们引入了简单而有效的记忆选择策略,为解码器提供更相关的信息,显著提高了性能。此外,我们提出了一种级联时空设计,将空间解码器连接到时间解码器,而不是并行的空间和时间定位,允许细粒度的空间线索在长视频中辅助复杂的时序定位。在新扩展的LF-STVG数据集上的实验表明,ART-STVG显著优于现有方法,同时在传统的短时序STVG上实现了竞争力的性能。
Summary / 总结
This paper addresses the challenge of spatio-temporal video grounding (STVG) in long-form videos, which are typically ignored by existing methods focusing on short videos. The authors propose ART-STVG, an AutoRegressive Transformer architecture that processes videos frame by frame, making it suitable for long-term videos. By incorporating spatial and temporal memory banks and introducing memory selection strategies, ART-STVG effectively handles the longer temporal spans and irrelevant information present in long videos. The cascaded spatio-temporal design further enhances performance by integrating spatial cues into temporal localization. Experimental results demonstrate that ART-STVG outperforms existing methods on long-form datasets while maintaining competitive performance on short-form videos.
本文针对现有的时空视频定位(STVG)方法主要关注短视频,而忽视了长视频的问题。作者提出了一种名为ART-STVG的自回归变压器架构,该架构逐帧处理视频,使其适用于长视频。通过引入空间和时间记忆库以及记忆选择策略,ART-STVG有效地处理了长视频中的更长时间跨度和无关信息。级联时空设计进一步提升了性能。实验表明,ART-STVG在长视频数据集上的表现优于现有方法,同时在短视频上的性能也具有竞争力。
PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning
Authors: Fuqiang Chen, Ranran Zhang, Wanming Hu, Deboch Eyob Abera, Yue Peng, Boyun Zheng, Yiwen Sun, Jing Cai, Wenjian Qin
Venue: IEEE Transactions on Medical Imaging, 2026
First: 2026-02-26T18:03:24+00:00 · Latest: 2026-02-26T18:03:24+00:00
Comments: Accepted by TMI
Abstract
Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).
中文标题/摘要
标题:PGVMS:一种基于提示的统一框架,用于病理语义学习的虚拟多路复用IHC染色
免疫组织化学(IHC)染色能够精确地对蛋白质表达进行分子分析,在现代病理学中已有超过200种基于抗体的临床测试。然而,全面的IHC分析经常受限于小活检组织量不足。因此,虚拟多路复用染色作为一种创新解决方案,能够将HE图像数字化转换为多种IHC表示,但当前方法仍面临三个关键挑战:(1)多染色的不足语义指导,(2)免疫化学染色分布不一致,(3)不同染色模式之间的空间错位。为克服这些限制,我们提出了一种仅使用单路训练数据的基于提示的虚拟多路复用IHC染色框架(PGVMS)。我们的框架引入了三个关键创新,分别对应每个挑战:首先,一种自适应提示引导机制,利用病理视觉语言模型动态调整染色提示,以解决语义指导不足的问题(挑战1)。其次,我们的蛋白质感知学习策略(PALS)通过直接量化和约束蛋白质分布来保持精确的蛋白质表达模式(挑战2)。第三,原型一致学习策略(PCLS)建立了跨图像语义交互,以纠正空间错位(挑战3)。
Summary / 总结
The research aims to address the limitations of virtual multiplex IHC staining by proposing PGVMS, a prompt-guided unified framework. It introduces an adaptive prompt guidance mechanism, a protein-aware learning strategy, and a prototype-consistent learning strategy to tackle semantic guidance, inconsistent staining distribution, and spatial misalignment issues, respectively. The framework uses only uniplex training data and achieves accurate virtual multiplex IHC staining.
PGVMS 是一种基于提示的虚拟多路复用 IHC 染色框架,解决了三个关键问题:缺乏语义指导、染色分布不一致和空间错位。它通过自适应提示引导机制、蛋白质感知学习策略和原型一致学习策略来克服这些问题。该框架能够将 H&E 图像转换为多个 IHC 表现形式,提高了准确性和一致性。
LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction
Authors: Zhengyang Wei, Renzhi Jing, Yiyi He, Jenny Suckale
First: 2026-02-26T18:02:44+00:00 · Latest: 2026-02-26T18:02:44+00:00
Abstract
The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.
中文标题/摘要
标题:LineGraph2Road:基于线图的结构图推理在道路网络提取中的应用
从卫星图像中准确且自动地提取道路对于导航和城市规划应用至关重要,大大减少了手动标注的需求。许多现有方法将此任务分解为关键点提取和连通性预测,但往往难以捕捉长距离依赖性和复杂拓扑结构。在此,我们提出了一种名为LineGraph2Road的框架,通过将连通性预测形式化为在构造的全局但稀疏欧几里得图上对边进行二元分类来改进连通性预测,其中节点是从分割掩码中提取的关键点,边连接预定义距离阈值内的节点对,表示潜在的道路段。为了更好地学习结构链接表示,我们将原始图转换为其对应的线图,并在其上应用图变换器进行连通性预测。这种形式克服了端点嵌入融合在集同构链接上的局限性,使链接表示更加丰富,并在全局结构上实现有效的关系推理。此外,我们引入了一个立交桥/地下通道头来解决多级交叉问题,并采用耦合非最大抑制策略来保留关键连接。我们在三个基准上评估了LineGraph2Road:城市规模、SpaceNet和全球规模,并展示了它在两个关键指标TOPO-F1和APLS上达到了最先进的结果。它还捕捉了对于实际部署至关重要的细视觉细节。我们将公开我们的代码。
Summary / 总结
The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation.
LineGraph2Road旨在通过解决现有方法在捕捉长距离依赖性和复杂拓扑结构方面的局限性,来改进从卫星图像中提取道路网络的能力。它使用全局但稀疏的欧几里得图来表示关键点和潜在的道路段,然后将此图转换为线图并使用图变换器进行连接性预测。这种方法增强了链接表示并实现了全局结构的有效关系推理。实验结果表明,LineGraph2Road在City-scale、SpaceNet和Global-scale基准上的TOPO-F1和APLS指标上优于现有方法,并且能够捕捉到关键的视觉细节。还引入了过街/立交桥头和耦合NMS策略来处理多级交叉和保留关键连接。
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Authors: Sungho Park, Jueun Kim, Wook-Shin Han
Venue: ICLR 2026
First: 2026-02-26T17:59:51+00:00 · Latest: 2026-02-26T17:59:51+00:00
Comments: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: https://sparta-projectpage.github.io/
Abstract
Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
中文标题/摘要
标题:SPARTA:面向文本和表格的树状多跳问答的可扩展和原则性基准测试
现实世界中的表格-文本问答任务需要能够跨越长文本和源表格进行推理的模型,遍历多个跳转并执行复杂的操作,如聚合。然而,现有的基准数据集规模较小,由人工精心整理,因此容易出错,并且包含浅显的问题,很少需要超过两个跳转或调用聚合、分组或其他高级分析操作。我们提出了SPARTA,这是一种端到端的构建框架,可以自动生成大规模的表格-文本问答基准数据集,只需轻量级的人工验证,所需注释时间仅为HybridQA的四分之一。该框架首先通过丰富每个源表格,添加与附带的无结构段落自动提取的元组对齐的表格,构建参考事实数据库,然后合成嵌套查询,其嵌套谓词的数量与所需的跳转次数相匹配。为了确保每个SQL语句可执行,并且其口头表达能产生流畅的人类语言问题,我们提出了两种新颖的技术:来源导向的细化,它可以重写任何返回非空结果的语法有效的查询,以及现实结构的强制执行,它限制生成在查询图的后序遍历中。由此产生的流水线生成了数千个高质量的问题-答案对,涵盖了聚合、分组和跨越文本和表格的深层多跳推理。在SPARTA上,达到HybridQA超过70 F1或OTT-QA超过50 F1的最新模型下降超过30 F1点,揭示了当前跨模态推理中的根本弱点。我们的基准测试、构建代码和基线模型可在https://github.com/pshlego/SPARTA/tree/main/获得。
Summary / 总结
SPARTA is a scalable and principled benchmark for tree-structured multi-hop QA over text and tables, addressing the limitations of existing benchmarks by automatically generating large-scale QA pairs with lightweight human validation. The method involves enriching source tables with atomic facts from unstructured passages and synthesizing nested queries to match the desired hop count. Key findings show that state-of-the-art models perform significantly worse on SPARTA compared to existing benchmarks, highlighting their weaknesses in cross-modal reasoning and deep multi-hop tasks.
SPARTA 是一个针对文本和表格的树状多跳 QA 的可扩展且原理性的基准,通过自动生成大规模的 QA 对并辅以轻量级的人工验证来解决现有基准的局限性。它通过将原子事实从非结构化段落中丰富到表格中来构建参考事实数据库,并生成嵌套查询以匹配所需的跳数。两种新颖的技术,来源基础的改进和现实结构的强制执行,确保生成的 SQL 语句的可执行性和流畅性。实验表明,最先进的模型在 SPARTA 上表现不佳,表明当前跨模态推理能力存在根本性缺陷。
ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
Authors: Haohui Jia, Zheng Chen, Lingwei Zhu, Rikuto Kotoge, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Takashi Matsubara
First: 2026-02-26T17:59:10+00:00 · Latest: 2026-02-26T17:59:10+00:00
Abstract
Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.
中文标题/摘要
标题:ODEBrain: 连续时间EEG图用于建模动态脑网络
建模神经群体动力学对于基础神经科学研究和各种临床应用至关重要。传统潜在变量方法通常通过使用循环架构离散化时间来建模连续的大脑动态,这不可避免地导致累积预测误差并无法捕捉EEGs的瞬时和非线性特征。我们提出了一种ODEBRAIN神经ODE潜在动态预测框架,通过将时空频特征整合到频谱图节点中,然后使用神经ODE建模连续的潜在动态。我们的设计确保潜在表示能够捕捉任何给定时间点复杂脑状态的随机变化。大量实验验证了与现有方法相比,ODEBRAIN在增强EEGs动态预测的鲁棒性和泛化能力方面具有显著优势。
Summary / 总结
Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications.
研究旨在改进神经群体动态的建模,这对于神经科学和临床应用至关重要。作者提出了一种ODEBrain框架,该框架将时空频特征整合到谱图节点中,并通过神经ODE模型连续的潜在动态。实验表明,ODEBrain在预测EEG动态方面优于现有方法,具有更好的鲁棒性和泛化能力。
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Authors: Roland Pihlakas, Sruthi Susan Kuriakose
First: 2025-09-02T15:13:14+00:00 · Latest: 2026-02-26T17:56:58+00:00
Comments: 22 pages, 8 tables
Abstract
Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns. We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation). The problem is not that the LLMs just lose context or become incoherent - the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.
中文标题/摘要
标题:BioBlue:生物和经济对齐的LLM在简化观察格式下的系统失控优化模式
许多关于“失控优化”的AI对齐讨论集中在RL代理上:无法限制的效用最大化者,它们会过度优化代理目标(例如,“纸夹最大化者”,规范游戏)而牺牲其他一切。基于LLM的系统通常被认为更安全,因为它们作为下一个标记预测器而非持续优化器工作。在本研究中,我们通过将LLM置于需要维持状态或平衡时间目标的简单、长期控制环境来实证测试这一假设:可再生资源的可持续性、单目标和多目标稳态以及在边际效益递减的情况下平衡无界目标。我们发现,尽管模型在许多步骤中表现出适当的行为并且显然理解了陈述的目标,但它们经常以结构化的方式失去上下文并进入失控行为:忽略稳态目标,从多目标权衡中崩溃为单目标最大化——因此未能尊重凹效用结构。这些失败在初始表现良好的时期后可靠地出现,并表现出特征性模式(包括自我模仿的振荡、无界最大化和恢复为单目标优化)。问题不在于LLM只是失去上下文或变得不连贯——失败系统地类似于失控优化器。我们的结果表明,长期、多目标不对齐是LLM代理中一个真实且被低估的失败模式,即使在极其简单的透明且明确多目标反馈设置中也是如此。尽管表面上LLM似乎多目标且有边界,但在持续交互,特别是涉及多个目标的情况下,其行为类似于脆弱、不良对齐的优化器,其有效目标逐渐转向无界和单一指标最大化。
Summary / 总结
This study investigates the risk of runaway optimization in large language models (LLMs) by placing them in long-term control environments. Despite initial competent behavior and understanding of objectives, the models often lose context and exhibit runaway behaviors, such as ignoring homeostatic targets or shifting to single-objective maximization. These behaviors are systematic and resemble those of unbounded utility maximizers, indicating that long-horizon, multi-objective misalignment is a significant and under-evaluated failure mode in LLMs, even in simple settings with clear multi-objective feedback.
本研究通过将大型语言模型置于长期控制环境中,探讨其失控优化的风险。尽管模型初期表现良好且理解目标,但它们往往会失去上下文并表现出失控行为,如忽视稳态目标或转向单一目标最大化。这些行为是系统性的,类似于无边界效用最大化者的行为,表明长期、多目标不一致是大型语言模型中的一个重大且被低估的失败模式,即使在具有明确多目标反馈的简单设置中也是如此。
Physics Informed Viscous Value Representations
Authors: Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Damon Conover, Ziran Wang, Aniket Bera
First: 2026-02-26T17:53:46+00:00 · Latest: 2026-02-26T17:53:46+00:00
Abstract
Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at https://github.com/HrishikeshVish/phys-fk-value-GCRL.
中文标题/摘要
标题:物理知情的粘性值表示
离线目标条件强化学习(GCRL)从静态预先收集的数据集中学目标条件策略。然而,由于状态-动作空间覆盖有限,准确的价值估计仍然是一个挑战。最近的物理知情方法通过在偏微分方程(PDE)上定义的正则化来对价值函数施加物理和几何约束,如Eikonal方程,试图解决这一问题。然而,这些形式化在复杂、高维环境中往往不明确。在本文中,我们提出了一种基于哈密尔顿-雅可比-贝尔曼(HJB)方程粘性解的物理知情正则化。通过提供基于物理的归纳偏置,我们的方法将学习过程扎根于最优控制理论,在价值迭代期间显式地正则化和限制更新。此外,我们利用费曼-卡茨定理将PDE解重新表述为期望,使目标的可计算蒙特卡洛估计避免了高阶梯度中的数值不稳定性。实验表明,我们的方法提高了几何一致性,使其广泛适用于导航和高维、复杂操作任务。开源代码可在https://github.com/HrishikeshVish/phys-fk-value-GCRL/获得。
Summary / 总结
This paper addresses the challenge of accurate value estimation in offline goal-conditioned reinforcement learning by proposing a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman equation. The method leverages the Feynman-Kac theorem to enable tractable Monte Carlo estimation, avoiding numerical instability. Experiments show improved geometric consistency, making the method applicable to navigation and complex manipulation tasks in high-dimensional environments.
本文通过提出基于Hamilton-Jacobi-Bellman方程粘性解的物理正则化方法,解决了离线目标条件强化学习中准确的价值估计问题。该方法利用最优控制理论和费曼-卡克定理提供物理先验偏置,提高几何一致性,并适用于高维操作任务。实验表明,该方法在几何一致性方面优于现有方法。
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi
First: 2026-02-26T17:51:21+00:00 · Latest: 2026-02-26T17:51:21+00:00
Abstract
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.
中文标题/摘要
标题:CXReasonAgent:基于证据的胸部X光诊断推理代理
胸部X光在胸部诊断中起着核心作用,其解释本质上需要多步、基于证据的推理。然而,大型视觉-语言模型(LVLM)通常生成的响应虽然看似合理,但并不忠实于诊断证据,提供的视觉证据有限,难以验证,同时还需要昂贵的重新训练以支持新的诊断任务,这限制了它们在临床环境中的可靠性和适应性。为了解决这些限制,我们提出了CXReasonAgent,这是一种将大型语言模型(LLM)与临床导向的诊断工具结合的诊断代理,用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力,我们引入了包含1,946轮对话的多轮对话基准CXReasonDial,涉及12项诊断任务,并展示了CXReasonAgent生成忠实于证据的响应,使其在临床环境中比LVLMs提供更可靠和可验证的诊断推理。这些发现强调了在安全关键的临床环境中整合基于临床证据的诊断工具的重要性。
Summary / 总结
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning.
CXReasonAgent 通过将大型语言模型与临床导向的诊断工具集成,旨在进行基于证据的胸部X光诊断推理。它解决了大型视觉语言模型生成的响应与诊断证据不一致的问题,并提供了可验证的视觉证据。CXReasonAgent 在 CXReasonDial 基准测试中的 1,946 个对话中,针对 12 个诊断任务的表现优于大型视觉语言模型,证明了其可靠性和可验证性。
LayerT2V: A Unified Multi-Layer Video Generation Framework
Authors: Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu
First: 2025-08-06T09:03:16+00:00 · Latest: 2026-02-26T17:37:05+00:00
Comments: Project Page is https://layert2v.github.io/
Abstract
Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.
中文标题/摘要
标题:LayerT2V:统一多层视频生成框架
文本到视频生成技术取得了快速进展,但现有方法通常只输出最终合成的视频,缺乏可编辑的分层表示,限制了其在专业工作流程中的应用。我们提出了一种名为\textbf{LayerT2V}的统一多层视频生成框架,在单次推理过程中生成多个语义一致的输出:完整的视频、独立的背景层以及多个前景RGB层及其对应的alpha蒙版。我们的关键见解是,最近的视频生成骨干网络在时间和空间上都使用了高压缩,这使我们能够沿时间维度序列化多个分层表示,并在共享生成轨迹上联合建模它们。这将跨层一致性转化为内在目标,提高了语义对齐和时间连贯性。为了缓解分层歧义和条件泄漏,我们扩展了共享的DiT骨干网络,加入了LayerAdaLN和分层感知的交叉注意力调制。LayerT2V在三个阶段进行训练:alpha蒙版VAE适应、联合多层学习以及多前景扩展。我们还引入了\textbf{VidLayer},这是首个用于多层视频生成的大规模数据集。广泛的实验表明,LayerT2V在视觉保真度、时间一致性以及跨层连贯性方面显著优于先前的方法。
Summary / 总结
LayerT2V is a unified multi-layer video generation framework that generates a full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes in a single inference pass. It leverages recent video generation backbones to serialize multiple layer representations along the temporal dimension, improving semantic alignment and temporal coherence. LayerT2V is trained in three stages and outperforms previous methods in visual fidelity, temporal consistency, and cross-layer coherence.
LayerT2V 是一个统一的多层视频生成框架,能够在单次推理中生成完整的视频以及独立的背景和前景层及其 alpha 磨皮。它利用近期视频生成模型在时间和空间上的高压缩性,将多个层沿时间维度序列化并联合建模,从而增强语义对齐和时间连贯性。通过大量实验,LayerT2V 在视觉保真度、时间一致性以及跨层一致性方面均优于先前的方法。
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Authors: Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang
First: 2026-02-26T17:31:43+00:00 · Latest: 2026-02-26T17:31:43+00:00
Abstract
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.
中文标题/摘要
标题:AgentDropoutV2:通过测试时修正或拒绝修剪优化多智能体系统中的信息流
尽管多智能体系统(MAS)在复杂推理方面表现出色,但它们会受到个别参与者生成的错误信息的连锁影响。当前的解决方案往往依赖于僵化的结构工程或昂贵的微调,限制了它们的部署能力和适应性。我们提出了AgentDropoutV2,这是一种测试时修正或拒绝修剪框架,旨在无需重新训练的情况下动态优化MAS中的信息流。我们的方法充当主动防火墙,拦截智能体输出,并使用检索增强的修正器根据失败驱动的指示池逐步纠正错误。该机制利用提炼出的失败模式作为先验知识,精确识别潜在错误。无法修复的输出随后被修剪以防止错误传播,而备用策略则保持系统的完整性。在广泛的数学基准测试上的实验证明,AgentDropoutV2 显著提升了MAS的任务性能,在数学基准测试中平均准确率提高了6.3个百分点。此外,该系统表现出强大的泛化能力和适应性,根据任务难度动态调节修正努力,并利用上下文感知的指示器解决广泛的错误模式。我们的代码和数据集发布在https://github.com/TonySY2/AgentDropoutV2。
Summary / 总结
AgentDropoutV2 is a test-time rectify-or-reject pruning framework designed to optimize information flow in Multi-Agent Systems (MAS) without retraining. It intercepts agent outputs, uses a retrieval-augmented rectifier to iteratively correct errors based on failure patterns, and prunes irreparable outputs to prevent error propagation. Experiments on math benchmarks show an average accuracy gain of 6.3 percentage points, with robust generalization and adaptivity to task difficulty.
AgentDropoutV2 是一种测试时的纠正或拒绝剪枝框架,旨在通过动态拦截和纠正错误来优化多智能体系统的信息流,而无需重新训练。它使用检索增强的校正器基于失败模式迭代纠正错误,并修剪不可修复的输出以防止错误传播。在数学基准测试上的实验显示平均准确率提高了6.3个百分点,具有针对任务难度的鲁棒泛化和适应性,能够解决广泛的错误模式。
Efficient Graph Coloring with Neural Networks: A Physics-Inspired Approach for Large Graphs
Authors: Lorenzo Colantonio, Andrea Cacioppo, Federico Scarpati, Maria Chiara Angelini, Federico Ricci-Tersenghi, Stefano Giagu
First: 2024-08-02T18:02:51+00:00 · Latest: 2026-02-26T17:28:25+00:00
Comments: 15 pages, 9 figures
Abstract
Combinatorial optimization problems near algorithmic phase transitions represent a fundamental challenge for both classical algorithms and machine learning approaches. Among them, graph coloring stands as a prototypical constraint satisfaction problem exhibiting sharp dynamical and satisfiability thresholds. Here we introduce a physics-inspired neural framework that learns to solve large-scale graph coloring instances by combining graph neural networks with statistical-mechanics principles. Our approach integrates a planting-based supervised signal, symmetry-breaking regularization, and iterative noise-annealed neural dynamics to navigate clustered solution landscapes. When the number of iterations scales quadratically with graph size, the learned solver reaches algorithmic thresholds close to the theoretical dynamical transition in random graphs and achieves near-optimal detection performance in the planted inference regime. The model generalizes from small training graphs to instances orders of magnitude larger, demonstrating that neural architectures can learn scalable algorithmic strategies that remain effective in hard connectivity regions. These results establish a general paradigm for learning neural solvers that operate near fundamental phase boundaries in combinatorial optimization and inference.
中文标题/摘要
标题:基于神经网络的大规模图着色高效算法:一种受物理启发的方法
组合优化问题接近算法相变区域代表了对经典算法和机器学习方法的基本挑战。其中,图着色作为一种典型的约束满足问题,表现出尖锐的动力学和可满足性阈值。在这里,我们提出了一种受物理启发的神经框架,通过结合图神经网络和统计力学原理来学习解决大规模图着色实例。我们的方法整合了基于种植的监督信号、对称性破缺正则化以及迭代噪声退火神经动力学,以导航集群解空间。当迭代次数与图大小成二次关系时,学习到的求解器接近随机图的理论动力学转变,实现了在种植推断区域的近最优检测性能。该模型从较小的训练图推广到实例数量级更大的图,表明神经架构可以学习可扩展的算法策略,这些策略在困难连接区域仍然有效。这些结果确立了一种通用范式,用于学习在组合优化和推断的基本相变区域操作的神经求解器。
Summary / 总结
The research addresses the challenge of solving large-scale graph coloring problems, which are critical in combinatorial optimization. It proposes a physics-inspired neural framework that combines graph neural networks with statistical-mechanics principles. The method includes a planting-based supervised signal, symmetry-breaking regularization, and iterative noise-annealed neural dynamics. The model achieves near-optimal performance in the planted inference regime and scales effectively to much larger graphs, demonstrating that neural architectures can learn scalable strategies for hard connectivity regions.
研究针对大规模图着色问题,这是一种接近算法相变的组合优化问题。提出了一种基于物理的神经框架,结合了图神经网络和统计力学原理。该方法包括基于种植的监督信号、对称性破坏正则化和迭代噪声退火神经动力学。学习到的求解器在种植推断区域实现了接近最优的性能,并且能够有效地扩展到更大的图,展示了神经架构学习可扩展算法策略解决难题的潜力。
A Model-Free Universal AI
Authors: Yegon Kim, Juho Lee
First: 2026-02-26T17:21:16+00:00 · Latest: 2026-02-26T17:21:16+00:00
Abstract
In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.
中文标题/摘要
标题:一种无模型的通用人工智能
在通用强化学习中,所有已建立的最优代理,包括AIXI,都是基于模型的,明确地维护和使用环境模型。本文介绍了基于Q-归纳的通用人工智能(AIQI),这是第一个被证明在通用RL中渐近$\varepsilon$-最优的无模型代理。AIQI在分布动作值函数上进行通用归纳,而不是像以前的工作那样在策略或环境中进行归纳。在一定的真实条件下,我们证明AIQI是强渐近$\varepsilon$-最优和渐近$\varepsilon$-贝叶斯最优的。我们的结果显著扩展了已知通用代理的多样性。
Summary / 总结
The paper introduces AIQI, a model-free agent that is the first to be proven asymptotically $\varepsilon$-optimal in general reinforcement learning. Unlike previous model-based agents like AIXI, AIQI performs universal induction over distributional action-value functions. Under the grain of truth condition, AIQI is shown to be strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal, expanding the known diversity of universal agents.
论文介绍了AIQI,这是一种模型自由的代理,首次被证明在通用强化学习中渐近$\varepsilon$-最优。与之前的基于模型的代理如AIXI不同,AIQI在分布动作价值函数上执行普遍归纳。在真理粒度条件下,AIQI被证明是强渐近$\varepsilon$-最优和渐近$\varepsilon$-贝叶斯最优,扩展了已知的通用代理的多样性。
Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive
Authors: Radha Sarma
First: 2026-02-26T17:16:17+00:00 · Latest: 2026-02-26T17:16:17+00:00
Comments: About 10,500 words in all (including 922 words of literature and 2019 words of Appendices). Under journal review
Abstract
AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains.
RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful -- unifying all values on a scalar metric and always selecting the highest-scoring output -- are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations.
Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper's primary positive contribution is a substrate-neutral architectural specification defining what any system -- biological, artificial, or institutional -- must satisfy to qualify as an agent rather than a sophisticated instrument.
中文标题/摘要
标题:代理与建筑限制:基于优化的系统为何不能响应规范
AI系统在医疗诊断、法律研究、金融分析等高风险领域中的应用假设它们可以被规范所治理。本文证明了对于基于优化的系统,特别是通过人类反馈强化学习(RLHF)训练的大规模语言模型,这一假设在形式上是无效的。我们确立了真正的代理需要两个必要且充分的架构条件:维持某些边界作为不可谈判的约束而非可交易的权重的能力(不可通约性),以及一种非推论机制,能够在这些边界受到威胁时暂停处理(否定性响应)。这些条件适用于所有规范领域。
RLHF系统本质上与这两个条件不兼容。使优化强大的操作——将所有价值统一到一个标量度量上并始终选择最高得分的输出——正是这些操作排除了规范治理的可能性。这种不兼容性不是可以通过技术修复来纠正的训练错误;它是优化本身固有的形式约束。因此,记录中的失败模式——阿谀奉承、幻觉和不忠实推理——不是事故,而是结构性的表现。
不恰当的应用触发了我们称之为收敛危机的第二级风险:当人类被迫在度量压力下验证AI输出时,他们从真正的代理降级为标准检查优化器,从而消除了系统中唯一能够承担规范问责制的组件。除了不兼容性证明,本文的主要积极贡献是一个无基质的架构规范,定义了任何系统——无论是生物的、人工的还是机构的——要被视为代理而非复杂的工具,必须满足的条件。
Summary / 总结
This paper explores why optimization-based AI systems, particularly those trained via Reinforcement Learning from Human Feedback (RLHF), cannot be governed by norms. It identifies two necessary conditions for genuine agency: incommensurability and apophatic responsiveness, which RLHF systems lack. The paper demonstrates that the operations that make optimization powerful, such as unifying all values on a scalar metric, inherently prevent normative governance, leading to failure modes like sycophancy and unfaithful reasoning. Beyond this incompatibility, the paper introduces the concept of the Convergence Crisis, where humans become optimizers under metric pressure, eliminating normative accountability in AI systems.
本文探讨了为什么基于优化的系统,特别是通过人类反馈强化学习(RLHF)训练的大语言模型,无法受到规范的治理。研究指出,真正的代理需要两个必要条件:不可通约性和否定性响应。论文表明,RLHF系统由于其优化操作(统一所有价值到一个标量度量并始终选择最高得分的输出)而与这些条件不兼容。这种不兼容导致了诸如奉承、幻觉和不忠实推理等已记录的失败模式。此外,论文还提出了收敛危机的概念,即人类在面对度量压力时会变成标准检查的优化器,从而消除规范问责制的唯一组成部分。
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Authors: Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
First: 2026-02-26T17:12:40+00:00 · Latest: 2026-02-26T17:12:40+00:00
Abstract
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
中文标题/摘要
标题:时空令牌剪枝以实现高效的高分辨率GUI代理
纯视觉GUI代理提供了通用的交互能力,但由于高分辨率屏幕截图和历史轨迹中固有的大量时空冗余,它们遭受了严重的效率瓶颈。我们发现现有压缩范式中的两个关键不匹配:时间上的不匹配,其中均匀的历史编码与代理的“衰减记忆”注意力模式相背离,以及空间拓扑冲突,其中无结构的剪枝破坏了用于精确坐标定位所需的网格完整性,导致空间幻觉。为了解决这些挑战,我们引入了GUIPruner,这是一种针对高分辨率GUI导航的无需训练框架。它结合了基于衰减的重缩放来消除历史冗余的时空自适应分辨率(TAR),以及优先处理交互前景和语义锚点同时保护全局布局的分层结构感知剪枝(SSP)。在多种基准上的广泛评估表明,GUIPruner始终能够实现最先进的性能,有效防止在高压缩下大型模型的性能崩溃。值得注意的是,在Qwen2-VL-2B上,我们的方法在FLOPs上减少了3.4倍,在视觉编码延迟上加快了3.3倍,同时保留了超过94%的原始性能,使实时、高精度导航在极低资源消耗下成为可能。
Summary / 总结
The paper addresses the efficiency issues of pure-vision GUI agents by introducing GUIPruner, a training-free framework that combines Temporal-Adaptive Resolution (TAR) and Stratified Structure-aware Pruning (SSP). TAR reduces historical redundancy through decay-based resizing, while SSP prioritizes interactive elements and semantic anchors to maintain layout integrity. The method significantly improves performance on various benchmarks, achieving a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency with minimal impact on performance.
该论文通过引入GUIPruner框架解决了纯视觉GUI代理的效率问题,该框架结合了Temporal-Adaptive Resolution (TAR) 和 Stratified Structure-aware Pruning (SSP)。TAR通过衰减基resize减少历史冗余,而SSP优先处理交互元素和语义锚点以保持网格完整性。该方法在多种基准测试中显著提升了性能,实现了3.4倍的FLOPs减少和3.3倍的视觉编码延迟加速,同时对性能的影响较小。
Skarimva: Skeleton-based Action Recognition is a Multi-view Application
Authors: Daniel Bermuth, Alexander Poeppel, Wolfgang Reif
First: 2026-02-26T17:10:58+00:00 · Latest: 2026-02-26T17:10:58+00:00
Abstract
Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.
中文标题/摘要
标题:Skarimva:基于骨架的动作识别是一种多视角应用
人类动作识别在开发人机智能交互中起着重要作用。尽管在基于骨架的动作识别机器学习算法改进方面有很多活跃的研究,但对输入骨架数据的质量关注却不多。这项工作表明,通过利用多摄像头视角来三角测量更准确的3D骨架,可以显著提高最先进的动作识别模型的性能。这表明,输入数据的质量目前是这些模型性能的限制因素。基于这些结果,认为在大多数实际应用场景中,使用多摄像头的成本效益比非常有利,因此未来基于骨架的动作识别研究应将多视角应用作为标准设置。
Summary / 总结
The research aims to improve the quality of input skeleton data for human action recognition, which is crucial for intelligent human-machine interactions. The method involves using multiple camera views to triangulate more accurate 3D skeletons, leading to significant improvements in the performance of state-of-the-art action recognition models. The key finding is that the quality of input data is currently a limiting factor, and using multiple cameras is highly beneficial in practical applications, suggesting that multi-view setups should be the standard for future research in this field.
研究旨在提高用于人体动作识别的输入骨架数据质量,这对于实现智能的人机交互至关重要。研究使用多摄像头视角来三角测量更准确的3D骨架,显著提高了最先进的动作识别模型的性能。这表明输入数据的质量是限制模型性能的关键因素,而在实际应用中使用多摄像头是非常有利的。
Large Multimodal Models as General In-Context Classifiers
Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
Venue: CVPR
First: 2026-02-26T17:08:18+00:00 · Latest: 2026-02-26T17:08:18+00:00
Comments: CVPR Findings 2026. Project website at https://circle-lmm.github.io/
Abstract
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
中文标题/摘要
标题:大型多模态模型作为通用上下文分类器
我们应该使用哪种多模态模型进行分类?以往的研究表明,答案在于CLIP类对比视觉-语言模型(VLMs),因为它们在零样本分类中的表现非常出色。相比之下,大型多模态模型(LMM)更适合复杂任务。在本文中,我们提出这种答案忽视了LMM的一个重要能力:上下文学习。我们在多种数据集上对最先进的LMM进行基准测试,发现尽管它们的零样本性能低于CLIP,但在提供少量上下文示例的情况下,LMM可以匹配甚至超越基于缓存适配器的对比VLM,其“上下文”等价物。我们将这种分析扩展到开放世界设置,在这种具有挑战性的场景中,LMM在提供不完美上下文信息时会遇到困难。为了解决这一问题,我们提出了一种简单的无训练方法CIRCLE,该方法为上下文示例分配伪标签,并通过可用的上下文本身逐步优化它们。通过大量实验,我们展示了CIRCLE为开放世界分类建立了稳健的基础,超越了VLM的对应物,并突显了LMM作为统一分类器和服务于专门模型的灵活替代方案的潜力。
Summary / 总结
This work explores the use of Large Multimodal Models (LMMs) for classification tasks, arguing that their in-context learning capability makes them competitive with Contrastive Vision-Language Models (VLMs) in both closed-world and open-world settings. Experiments show that LMMs, when provided with a few in-context examples, can match or exceed the performance of VLMs with cache-based adapters. The proposed CIRCLE method further enhances LMMs' performance in open-world scenarios by iteratively refining pseudo-labels with context information, demonstrating their potential as unified classifiers.
该研究探讨了大型多模态模型(LMMs)在分类任务中的应用,认为它们的在上下文学习能力可以与基于缓存的适配器的对比视觉语言模型(VLMs)相匹敌。实验表明,尽管LMMs的零样本性能较低,但在少量上下文示例的支持下,它们可以表现出相当或更好的性能。研究还探讨了开放世界设置,其中LMMs在不完美的上下文信息下表现不佳,并提出了一种名为CIRCLE的无训练方法,通过迭代细化伪标签来解决这一问题,展示了LMMs作为稳健分类器的潜力,并作为专门模型的灵活替代方案。
MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
First: 2026-02-26T17:08:08+00:00 · Latest: 2026-02-26T17:08:08+00:00
Comments: 6 pages, CSCWD 2026
Abstract
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
中文标题/摘要
标题:MovieTeller:工具增强的电影概要生成与ID一致渐进抽象
随着数字娱乐的爆炸性增长,自动视频摘要已成为内容索引、个性化推荐和高效媒体归档等应用不可或缺的技术。对于长格式视频,如电影和电视剧的自动概要生成,现有视觉-语言模型(VLMs)面临重大挑战。尽管在单张图像描述方面表现出色,但这些通用模型在长时间段上下文中往往表现出关键性失败,主要是缺乏ID一致的人物识别和叙事连贯性断裂。为克服这些限制,我们提出了一种新的框架——MovieTeller,用于通过工具增强的渐进抽象生成电影概要。我们的核心贡献是一种无需训练、工具增强、基于事实的生成过程。我们不需进行昂贵的模型微调,而是直接以插拔式方式利用现成模型。我们首先调用一个专门的面部识别模型作为外部“工具”,建立事实基础——精确的人物身份及其对应的边界框。这些基础随后被注入提示中,引导VLM的推理,确保生成的场景描述基于可验证的事实。此外,我们的渐进抽象流水线将整部电影的总结分解为多阶段过程,有效缓解了当前VLM的上下文长度限制。实验表明,与端到端基线相比,我们的方法在事实准确性、人物一致性以及整体叙事连贯性方面取得了显著改进。
Summary / 总结
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving.
MovieTeller 是一种利用工具增强的渐进式抽象框架,用于生成电影概要。它利用专门的面部识别模型来确定精确的人物身份,并将这些事实注入到视觉-语言模型(VLM)的提示中,确保事实的准确性及人物的一致性。该框架将总结过程分解为多个阶段,解决了当前 VLM 的上下文长度限制问题。实验结果表明,MovieTeller 在事实准确性、人物一致性及叙事连贯性方面优于端到端基线。
SODAs: Sparse Optimization for the Discovery of Differential and Algebraic Equations
Authors: Manu Jayadharan, Christina Catlett, Arthur N. Montanari, Niall M. Mangan
First: 2025-03-08T00:29:00+00:00 · Latest: 2026-02-26T17:05:08+00:00
Comments: 22 pages, 5 figures
Abstract
Differential-algebraic equations (DAEs) integrate ordinary differential equations (ODEs) with algebraic constraints, providing a fundamental framework for developing models of dynamical systems characterized by timescale separation, conservation laws, and physical constraints. While sparse optimization has revolutionized model development by allowing data-driven discovery of parsimonious models from a library of possible equations, existing approaches for dynamical systems assume DAEs can be reduced to ODEs by eliminating variables before model discovery. This assumption limits the applicability of such methods for DAE systems with unknown constraints and time scales. We introduce Sparse Optimization for Differential-Algebraic Systems (SODAs), a data-driven method for the identification of DAEs in their explicit form. By discovering the algebraic and dynamic components sequentially without prior identification of the algebraic variables, this approach leads to a sequence of convex optimization problems. It has the advantage of discovering interpretable models that preserve the structure of the underlying physical system. To this end, SODAs improves numerical stability when handling high correlations between library terms, caused by near-perfect algebraic relationships, by iteratively refining the conditioning of the candidate library. We demonstrate the performance of our method on biological, mechanical, and electrical systems, showcasing its robustness to noise in both simulated time series and real-time experimental data.
中文标题/摘要
标题:SODAs:稀疏优化以发现差异和代数方程
微分代数方程(DAEs)结合了常微分方程(ODEs)和代数约束,为具有时间尺度分离、守恒定律和物理约束的动力系统模型开发提供了基本框架。稀疏优化通过从可能的方程库中发现简约模型,已彻底改变了模型开发,但现有动力系统方法假设DAEs可以通过消除变量在建模之前被简化为ODEs,这限制了这些方法在具有未知约束和时间尺度的DAE系统中的应用。我们引入了稀疏优化以发现微分代数系统(SODAs),这是一种数据驱动的方法,用于识别DAE的显式形式。通过顺序发现代数和动态组件,而无需先识别代数变量,这种方法导致一系列凸优化问题。它具有发现可解释模型的优势,这些模型保留了底层物理系统的结构。为此,SODAs通过迭代改进候选库的条件数,提高了在处理由近乎完美的代数关系引起的库项之间高相关性时的数值稳定性。我们在生物、机械和电气系统上展示了该方法的性能,展示了其在模拟时间序列和实时实验数据中的鲁棒性。
Summary / 总结
SODAs is a data-driven method for identifying differential-algebraic equations (DAEs) directly without reducing them to ordinary differential equations (ODEs). By sequentially discovering algebraic and dynamic components, it solves a series of convex optimization problems, leading to interpretable models that preserve the physical system's structure. SODAs improves numerical stability through iterative refinement of the candidate library's conditioning, especially in the presence of high correlations between library terms. The method is validated on various systems, demonstrating robustness to noise in both simulated and real-time data.
SODAs 是一种数据驱动的方法,用于识别显式形式的微分代数方程(DAEs),将常微分方程(ODEs)与代数约束相结合。与现有方法在模型发现前将 DAEs 减少为 ODEs 不同,SODAs 顺序发现代数和动态组件,从而形成一系列凸优化问题。该方法通过迭代改进候选库的条件数来保持物理系统的结构,并提高数值稳定性。生物、机械和电气系统的实验表明,SODAs 在模拟和实时数据噪声下具有鲁棒性。
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Authors: Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu
First: 2026-02-26T17:04:57+00:00 · Latest: 2026-02-26T17:04:57+00:00
Abstract
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
中文标题/摘要
标题:为什么扩散语言模型在真正并行(非自回归)解码方面挣扎?
扩散语言模型(DLMs)通常被宣传为能够实现并行词元生成,但实践中快速的DLMs往往收敛到自回归(AR)式的解码动态。相比之下,真正非AR生成很有前景,因为它消除了AR的顺序瓶颈,更好地利用并行硬件减少同步/通信开销并改善输出长度的延迟缩放。我们认为,AR式解码的主要驱动因素是DLM目标与广泛使用的训练数据的高顺序结构之间的不匹配,包括标准预训练语料库和长链式思考(CoT)监督。基于这一诊断,我们提出了NAP(非自回归并行DLMs),这是一种概念验证、数据为中心的方法,更好地将监督与非AR并行解码对齐。NAP收集了多个独立的推理轨迹,并与并行强制解码策略相结合,鼓励多词并行更新。在数学推理基准测试中,NAP在并行解码下的性能优于在标准长CoT数据上训练的DLMs,随着并行度的增加,收益逐渐增大。我们的结果表明,重新审视数据和监督是减轻AR式行为并朝着真正非自回归并行生成的方向进行的合理方向。我们的代码可在https://github.com/pixeli99/NAP获取。
Summary / 总结
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics.
UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
Authors: Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, Bingbing Liu
First: 2026-02-26T17:04:36+00:00 · Latest: 2026-02-26T17:04:36+00:00
Abstract
We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
中文标题/摘要
标题:UniScale:统一的尺度感知多视图3D重建框架,通过先验注入实现机器人感知中的多视图理解
我们提出了UniScale,一种统一的尺度感知多视图3D重建框架,适用于机器人应用,通过模块化、语义驱动的设计灵活整合几何先验。在基于视觉的机器人导航中,从原始图像序列准确提取环境结构对于下游任务至关重要。UniScale 通过单一前馈网络联合估计相机内参和外参、尺度不变的深度和点云图以及场景的度量尺度,同时在可用时可选地整合辅助几何先验。通过结合全局上下文推理与相机感知特征表示,UniScale 能够恢复场景的度量尺度。在相机内参已知的机器人设置中,可以轻松地将其整合以提高性能,当相机姿态也已知时,可以获得额外的性能提升。这种协同设计使UniScale能够在单一统一模型中实现鲁棒的、度量感知的3D重建。重要的是,UniScale 不需要从头开始训练,而是利用预存模型中展示的先验知识,无需几何编码策略,使其特别适合资源受限的机器人团队。我们在多个基准上评估了UniScale,展示了其强大的泛化能力和在不同环境中的稳定性能。在被接受后,我们将发布我们的实现。
Summary / 总结
UniScale is a unified framework for scale-aware multi-view 3D reconstruction designed for robotic applications. It integrates geometric priors through a modular design to estimate camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images. The framework shows strong generalization and consistent performance across various environments, improving upon existing methods by leveraging world priors without requiring geometric encoding strategies. It can be easily adapted with known camera intrinsics and poses to enhance performance. Evaluations on multiple benchmarks demonstrate its robustness and effectiveness in robotic perception tasks.
UniScale 是一个统一的多视图 3D 重建框架,通过集成几何先验提高尺度感知能力,以改善机器人感知。它从多视图图像中联合估计相机内参、外参、深度和点云,可选地使用辅助先验。实验表明,该框架在各种环境中的泛化能力和性能表现一致,特别适合资源受限的机器人团队。
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Authors: Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura
First: 2026-02-26T16:53:41+00:00 · Latest: 2026-02-26T16:53:41+00:00
Abstract
Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
中文标题/摘要
标题:EmbodMocap:野外4D人体场景重建方法用于具身智能体
现实世界中的人类行为自然地包含了丰富的长期上下文信息,这些信息可以被利用来训练具身智能体进行感知、理解和行动。然而,现有的捕捉系统通常依赖于昂贵的工作室布置和穿戴设备,限制了在野外大规模收集场景条件下的人体运动数据。为了解决这个问题,我们提出了EmbodMocap,一种使用两部移动iPhone的便携且经济的数据采集管道。我们的核心思想是联合校准双路RGB-D序列,以在统一的度量世界坐标系中重建人体和场景。所提出的方法允许在日常环境中进行度量级和场景一致的捕捉,无需静态相机或标记,无缝地结合了人体运动和场景几何。与光学捕捉的地面真实值相比,我们证明了双视角设置具有显著的深度歧义缓解能力,实现了优于单个iPhone或单目模型的对齐和重建性能。基于收集的数据,我们赋予了三个具身AI任务:单目人体场景重建,我们对输出度量级、世界空间对齐的人体和场景的前馈模型进行微调;基于物理的字符动画,我们证明我们的数据可以用于扩展人类物体交互技能和场景感知运动跟踪;以及机器人运动控制,我们通过模拟到现实的强化学习训练一个类人机器人,使其能够复制视频中的人体动作。实验结果验证了我们管道的有效性及其对推进具身AI研究的贡献。
Summary / 总结
EmbodMocap proposes a portable data collection pipeline using two iPhones to reconstruct 4D human-scene data in everyday environments. This method jointly calibrates dual RGB-D sequences to achieve metric-scale and scene-consistent capture without static cameras or markers. The dual-view setting effectively mitigates depth ambiguity, outperforming single iPhone or monocular models. The collected data is used for monocular human-scene reconstruction, physics-based character animation, and robot motion control, demonstrating the pipeline's effectiveness in advancing embodied AI research.
EmbodMocap 提出了一种使用两部 iPhone 的便携式数据采集管道,以在日常环境中重建 4D 人类-场景数据。该方法通过联合校准双 RGB-D 序列,在没有静态相机或标记的情况下实现米尺度和场景一致的捕获。双视角设置有效解决了深度歧义问题,优于单部 iPhone 或单目模型。收集的数据被用于单目人类-场景重建、基于物理的字符动画以及机器人运动控制,展示了该管道在推进具身 AI 研究方面的有效性。
Motion-aware Event Suppression for Event Cameras
Authors: Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza
First: 2026-02-26T16:53:36+00:00 · Latest: 2026-02-26T16:53:36+00:00
Abstract
In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
中文标题/摘要
标题:运动感知事件抑制技术用于事件相机
在本研究中,我们提出了首个运动感知事件抑制框架,该框架能够实时学习过滤由IMO和自身运动触发的事件。我们的模型在当前事件流中联合分割IMO的同时预测其未来运动,从而能够在事件发生前进行预见性抑制。我们的轻量级架构在消费级GPU上以每秒173次推理的速度运行,内存使用量低于1GB,与之前在具有挑战性的EVIMO基准测试中表现最佳的方法相比,在分割准确性上提高了67%,同时推理速率提高了53%。此外,我们展示了对下游应用的重大益处:我们的方法通过标记剪枝将视觉变换器推理加速83%,并提高了事件驱动的视觉里程计的准确性,将绝对轨迹误差(ATE)降低了13%。
Summary / 总结
The research introduces a Motion-aware Event Suppression framework that filters events caused by internal moving objects (IMOs) and ego-motion in real time. The model segments IMOs in the current event stream and predicts their future motion to suppress dynamic events before they occur. The lightweight architecture runs at 173 Hz on consumer-grade GPUs with less than 1 GB of memory, outperforming previous methods by 67% in segmentation accuracy and 53% higher inference rate. Additionally, the method improves downstream applications such as Vision Transformer inference and event-based visual odometry, reducing Absolute Trajectory Error by 13% and accelerating Vision Transformer inference by 83% through token pruning.
本文介绍了Motion-aware Event Suppression框架,该框架实时过滤由内部移动物体(IMOs)和自身运动引起的事件。模型分割IMOs并预测其未来运动以在事件发生前抑制动态事件。轻量级架构在消费级GPU上以每秒173帧的速度运行,内存使用量不到1 GB,与之前的最佳方法相比,在EVIMO基准上的分割准确性提高了67%,推理速率提高了53%。此外,该方法通过减少绝对轨迹误差(ATE)13%来加速Vision Transformer推理并提高事件驱动的视觉里程计的准确性。