MediX-R1: Open Ended Medical Reinforcement Learning
Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
First: 2026-02-26T18:59:46+00:00 · Latest: 2026-02-26T18:59:46+00:00
Abstract
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com
中文标题/摘要
标题:MediX-R1:开放式的医疗强化学习
我们介绍了MediX-R1,这是一种针对医疗多模态大型语言模型(MLLMs)的开放式强化学习(RL)框架,使其能够提供基于临床的、自由形式的答案,超越了多项选择格式。MediX-R1 使用基于组的RL对基础视觉-语言骨干进行微调,并结合了针对医疗推理定制的复合奖励:基于LLM的准确度奖励,用于判断语义正确性并做出严格的YES/NO决策;基于医学嵌入的语义奖励,用于捕捉同义词和术语变体;以及轻量级格式和模态奖励,以确保可解释的推理和模态识别。这种多信号设计为传统验证性或仅多项选择奖励无法提供稳定、信息丰富的反馈的开放式输出提供了支持。为了衡量进展,我们提出了一种统一的评估框架,用于文本和图像+文本任务,该框架使用参考LLM作为裁判,替代脆弱的字符串重叠度量,以捕捉语义正确性、推理和上下文对齐。尽管仅使用了约51,000个指令示例,MediX-R1 在标准医疗LLM(仅文本)和VLM(图像+文本)基准测试中取得了优异的成绩,超越了强大的开源基线,并在开放式临床任务上取得了特别大的进步。我们的结果表明,使用全面的奖励信号和基于LLM的评估的开放式RL是一种可靠医疗推理的实用路径。我们的训练模型、精选数据集和源代码可在https://medix.cvmbzuai.com 获取。
Summary / 总结
MediX-R1 is an open-ended RL framework for medical MLLMs, fine-tuning a vision-language backbone with a composite reward that includes LLM-based accuracy, medical embedding-based semantic, and lightweight format/modality rewards. It uses a reference-based LLM evaluation to measure progress, achieving excellent results on medical LLM and VLM benchmarks, especially on open-ended clinical tasks. This demonstrates the practicality of open-ended RL with comprehensive reward signals for reliable medical reasoning in multimodal models.
MediX-R1 是一个面向 MLLMs 的开放性 RL 框架,能够生成临床相关的自由形式答案。它通过 Group Based RL 和一个综合奖励信号(包括 LLM 基准准确度、医学嵌入基準语义以及轻量级格式/模态奖励)来微调视觉-语言骨干网络。MediX-R1 在医学 LLM 和 VLM 基准测试中取得了优异的成绩,超越了强大的开源基线,特别是在开放性临床任务上表现尤为突出。提出了一种基于参考的 LLM 作为评判者的统一评估框架来衡量进展。训练模型和源代码可在 https://medix.cvmbzuai.com 获取。
Joint Optimization for 4D Human-Scene Reconstruction in the Wild
Authors: Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou
First: 2025-01-04T01:53:51+00:00 · Latest: 2026-02-26T18:59:39+00:00
Comments: Project Page: https://vail-ucla.github.io/JOSH/
Abstract
Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
中文标题/摘要
标题:野生场景中4D人体-场景重建的联合优化
重建人体运动及其周围环境对于理解人体-场景交互和预测场景中的人体运动至关重要。尽管在受限环境中捕捉人体-场景交互方面取得了很大进展,但先前的方法很难从网络视频中重建自然多样的人体运动和场景上下文。在本文中,我们提出了一种名为JOSH的新颖优化方法,用于从单目视频中在野生场景中进行4D人体-场景重建。JOSH使用密集场景重建和人体网格恢复技术进行初始化,然后利用人体-场景接触约束联合优化场景、相机姿态和人体运动。实验结果表明,JOSH通过场景几何和人体运动的联合优化,在全局人体运动估计和密集场景重建方面取得了更好的结果。我们进一步设计了一个更高效的模型JOSH3R,并直接用来自网络视频的伪标签对其进行训练。JOSH3R仅通过使用JOSH预测的标签进行训练,就优于其他无优化方法,进一步证明了其准确性和泛化能力。
Summary / 总结
This work addresses the challenge of reconstructing human motion and its surrounding environment from monocular web videos, which is difficult for previous methods. JOSH, a novel optimization-based method, initializes with dense scene reconstruction and human mesh recovery, then jointly optimizes the scene, camera poses, and human motion using human-scene contact constraints. Experiments show JOSH improves both global human motion estimation and dense scene reconstruction through joint optimization. JOSH3R, a more efficient variant, further enhances performance by training with pseudo-labels from web videos, outperforming other optimization-free methods.
研究旨在从网络视频中重建人类运动及其周围环境,这对于理解人类与场景的交互和预测人类运动至关重要。JOSH是一种新颖的优化方法,通过密集场景重建和人类网格恢复进行初始化,然后联合优化场景、相机姿态和人类运动。实验表明,JOSH通过联合优化场景几何和人类运动,提高了全局人类运动估计和密集场景重建的效果。JOSH3R是一种更高效的变体,通过使用从JOSH预测的伪标签进行训练,进一步提高了准确性和泛化能力,优于其他非优化方法。
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Authors: Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
Venue: CVPR 2026
First: 2026-02-26T18:59:33+00:00 · Latest: 2026-02-26T18:59:33+00:00
Comments: CVPR 2026, Project page: https://research.nvidia.com/labs/dvl/projects/vgg-ttt
Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
中文标题/摘要
标题:VGG-T$^3$:大规模离线前向3D重建
我们提出了一种可扩展的3D重建模型,解决了离线前向方法中的一个关键限制:其计算和内存需求随输入图像数量的平方增长。我们的方法基于这样一个关键洞察:这一瓶颈源于场景几何的可变长度键值(KV)空间表示,我们通过测试时训练将其提炼为固定大小的多层感知机(MLP)。VGG-T$^3$(视觉几何测试时训练)与输入视图数量成线性增长,类似于在线模型,并在54秒内重建了1000张图像的集合,比依赖于softmax注意力的基线方法快11.6倍。由于我们的方法保留了全局场景聚合能力,我们的点云重建误差显著优于其他线性时间方法。最后,我们通过使用未见过的图像查询场景表示,展示了我们模型的视觉定位能力。
Summary / 总结
The research addresses the computational and memory limitations of offline feed-forward 3D reconstruction methods by proposing VGG-T$^3$, which converts the varying-length Key-Value space representation into a fixed-size MLP through test-time training. This approach scales linearly with the number of input views, similar to online models, and reconstructs a 1k image collection in 54 seconds, achieving a 11.6x speed-up over baselines. The method outperforms other linear-time methods in point map reconstruction error and demonstrates visual localization capabilities with unseen images.
研究通过提出VGG-T$^3$方法解决了 offline feed-forward 3D重建方法在计算和内存上的限制,该方法在输入图像数量上线性扩展。方法通过测试时训练将可变长度的Key-Value空间表示转换为固定大小的多层感知机,使得能够高效地在54秒内重建1k图像集合,比基线方法快11.6倍。该模型在点云重建误差上优于其他线性时间方法,并且能够通过未见过的图像查询场景表示来展示视觉定位能力。
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Authors: Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
Venue: CVPR 2026
First: 2026-02-26T18:59:05+00:00 · Latest: 2026-02-26T18:59:05+00:00
Comments: Project page: https://seethrough3d.github.io. Accepted at CVPR 2026
Abstract
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
中文标题/摘要
标题:SeeThrough3D:基于遮挡感知的3D控制在文本到图像生成中的应用
我们识别出遮挡推理是3D布局条件生成中一个基本但被忽视的因素。它对于合成具有深度一致几何结构和比例的部分遮挡物体至关重要。尽管现有方法可以生成遵循输入布局的逼真场景,但它们往往无法准确建模物体间的遮挡。我们提出SeeThrough3D,一种基于3D布局生成的模型,明确建模遮挡。我们引入了一种遮挡感知的3D场景表示(OSCR),其中物体以透明的3D盒子形式置于虚拟环境中,并从期望的相机视角进行渲染。透明度编码隐藏的物体区域,使模型能够推理遮挡,而渲染的视角则在生成过程中提供明确的相机控制。我们通过引入从我们渲染的3D表示中提取的一组视觉标记,对预训练的基于流的文本到图像图像生成模型进行条件化。此外,我们应用掩码自注意力准确地将每个物体边界框与其相应的文本描述绑定,从而实现多个物体的准确生成,而不会出现物体属性混杂。为了训练模型,我们构建了一个包含多种具有强烈物体间遮挡的多物体场景的合成数据集。SeeThrough3D能够有效泛化到未见过的物体类别,并实现具有真实遮挡和一致相机控制的精确3D布局控制。
Summary / 总结
The research addresses the need for better occlusion handling in 3D layout-conditioned text-to-image generation. SeeThrough3D proposes an occlusion-aware 3D scene representation (OSCR) that uses translucent 3D boxes and visual tokens to model occlusions and control camera viewpoints. The model, trained on a synthetic dataset with diverse multi-object scenes, effectively generates realistic images with precise occlusions and consistent camera control, even for unseen object categories.
研究旨在解决文本到图像生成中的遮挡推理问题,这对于创建具有深度一致几何和比例的场景至关重要。SeeThrough3D提出了一种遮挡感知的3D场景表示(OSCR),并使用一个预训练的流式文本到图像生成模型,该模型基于从3D表示中提取的视觉标记进行条件化。该模型能够有效处理遮挡,并允许精确的3D布局控制,从而生成具有准确物体放置和摄像机视角的逼真场景。
Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training
Authors: Aheli Saha, René Schuster, Didier Stricker
First: 2026-02-26T18:57:52+00:00 · Latest: 2026-02-26T18:57:52+00:00
Comments: 12 pages, International Conference on Pattern Recognition Applications and Methods
Abstract
Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.
中文标题/摘要
标题:事件驱动对象检测中基于事件的传感器泛化联合分布训练
受生物启发的事件相机由于其异步和低延迟特性,最近吸引了大量研究。这些特性提供了高动态范围并显著减少了运动模糊。然而,由于其输出信号的性质新颖,可用数据的变异性存在差距,且缺乏对其信号参数的广泛分析。本文通过提供对内在参数如何影响基于事件数据训练的模型性能的深入理解,解决了这些问题,特别是针对对象检测的应用。我们还利用研究结果扩展了下游模型的传感器无关鲁棒性。
Summary / 总结
This paper aims to enhance the performance of event-based object detection by understanding the impact of intrinsic sensor parameters. The authors employ joint distribution training to analyze these parameters and improve model robustness. Key findings show that adjusting these parameters can significantly enhance the model's adaptability to different sensors, leading to better object detection accuracy across various scenarios.
本文旨在通过理解内在传感器参数的影响来提升事件驱动的目标检测性能。作者采用联合分布训练来分析这些参数并提高模型的鲁棒性。主要发现表明,调整这些参数可以显著提高模型在不同传感器上的适应性,从而在各种场景中获得更好的目标检测准确性。
Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna
First: 2026-02-26T18:54:06+00:00 · Latest: 2026-02-26T18:54:06+00:00
Comments: TACL 2026
Abstract
The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
中文标题/摘要
标题:规模无法克服语用学:报告偏差对视觉语言推理的影响
视觉语言模型(VLMs)缺乏推理能力的问题一直是研究讨论的焦点。我们认为这种行为源于其训练数据中的报告偏差。也就是说,人们默认在描述视觉内容时会省略一些监督某些类型推理所需的隐含信息;例如,“今天在比赛!”比“一张37个人站在田野后面的图片”更可能作为描述。我们通过语用学理论的视角,研究了流行的VLMs OpenCLIP、LLaVA-1.5和Molmo的数据基础,发现报告偏差导致在四个推理技能(空间、时间、否定和计数)上缺乏充分的表示,尽管这些语料库是大规模的,或者合成生成的。通过一组精心策划的基准测试,我们证明:(i) VLMs在由报告偏差抑制的上述类型推理上表现不佳;(ii) 与普遍认为的相反,增加数据量、模型规模和多语言训练并不会默认产生这些技能;但令人欣慰的是,(iii) 特别收集的用于获取隐含信息的注解是有效的。我们的研究结果强调了需要更故意的数据策划方法,而不是依赖规模来产生推理能力。
Summary / 总结
The study investigates the impact of reporting bias on the reasoning capabilities of Vision-Language Models (VLMs) like OpenCLIP, LLaVA-1.5, and Molmo. By analyzing the training data through pragmatics theories, the research finds that reporting bias leads to insufficient representation of spatial, temporal, negation, and counting reasoning skills. Despite large-scale and synthetic data, VLMs perform poorly on these types of reasoning. The study also shows that increasing data or model size, or using multiple languages, does not inherently improve these skills. However, incorporating specific annotations to capture tacit information can enhance these capabilities. This highlights the need for more targeted data curation methods.
研究探讨了视觉-语言模型(VLMs)在推理方面的局限性,将其归因于训练数据中的报告偏见。研究发现,尽管使用了大规模和合成的数据集,VLMs在空间、时间、否定和计数推理技能方面仍缺乏足够的表示,因为人们通常描述视觉内容的方式存在偏见。作者证明,单纯扩大模型或数据集的规模并不能自动改善这些技能,但专门收集用于捕捉隐含信息的标注则能提升这些方面的表现。
FlashOptim: Optimizers for Memory Efficient Training
Authors: Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock
First: 2026-02-26T18:52:22+00:00 · Latest: 2026-02-26T18:52:22+00:00
Comments: Source code is available at https://github.com/databricks/flashoptim
Abstract
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory.
We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half.
Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama-3.1-8B finetuning.
中文标题/摘要
标题:FlashOptim:内存高效训练的优化器
标准的混合精度训练需要为每个模型参数分配大量加速器内存。这些字节不仅代表参数本身,还包括其梯度和一个或多个优化器状态变量。每个值通常需要4个字节,因此即使是70亿参数的模型,对于拥有不到100GB加速器内存的研究人员来说也可能不切实际。
我们引入了FlashOptim,这是一种优化套件,能够在保持模型质量和API兼容性的同时,将每个参数的内存减少超过50%。我们的方法引入了两种关键技术。首先,我们通过找到并利用其量化误差的紧界来改进主权重分割。其次,我们设计了压缩函数,大大减少了8位优化器状态量化的误差。结合16位梯度,这些技术将AdamW的内存从每个参数16字节减少到7字节,或者在释放梯度时减少到5字节。它们还使模型检查点的大小减少了超过一半。
在SGD、AdamW和Lion上应用FlashOptim的实验表明,在包括Llama-3.1-8B微调在内的标准视觉和语言基准任务中,没有任何可测量的质量下降。
Summary / 总结
FlashOptim is a suite of optimizations that reduces memory usage for training neural networks by over 50% without compromising model quality. It achieves this by improving master weight splitting and designing companding functions to reduce quantization error in optimizer states. Experiments show that FlashOptim maintains model quality across various benchmarks, including Llama-3.1-8B finetuning, while significantly reducing memory usage and model checkpoint sizes.
FlashOptim 通过改进主权重拆分和优化器状态量化中的压缩函数,将神经网络训练的内存占用减少超过50%。这些方法允许使用16位梯度,并将AdamW 每个参数所需的内存从16字节减少到7字节,或者在梯度释放时减少到5字节。该方法保持了模型质量,并与现有API兼容,在包括Llama-3.1-8B微调在内的多种基准测试中未见质量下降。
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
First: 2026-02-26T18:45:33+00:00 · Latest: 2026-02-26T18:45:33+00:00
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
中文标题/摘要
标题:检索与分割:少量示例足以弥合开放词汇分割中的监督缺口吗?
开放词汇分割(OVS)将视觉语言模型(VLMs)的零样本识别能力扩展到像素级预测,使模型能够根据文本提示分割任意类别。尽管取得了进展,但由于训练VLMs所使用的粗略图像级监督和自然语言的语义模糊性,OVS仍落后于完全监督的方法。我们通过引入一种少量样本设置,将文本提示与像素标注图像的支持集相结合,来解决这些限制。在此基础上,我们提出了一种检索增强的测试时适配器,通过融合文本和视觉支持特征学习一种轻量级的、针对每张图像的分类器。与依赖于后期手工融合的先前方法不同,我们的方法进行学习的、针对每个查询的融合,实现了模态之间的更强协同作用。该方法支持不断扩展的支持集,并适用于细粒度任务,如个性化分割。实验表明,我们显著缩小了零样本和监督分割之间的差距,同时保留了开放词汇的能力。
Summary / 总结
This paper addresses the limitations of open-vocabulary segmentation (OVS) by proposing a few-shot setting that combines textual prompts with pixel-annotated images. The authors introduce a retrieval-augmented test-time adapter to learn a lightweight classifier that fuses textual and visual features, achieving better synergy between modalities than previous methods. Experiments demonstrate that this approach significantly reduces the gap between zero-shot and supervised segmentation while maintaining open-vocabulary capabilities.
该论文通过将文本提示与像素标注图像结合的少量样本设置来解决开放词汇分割(OVS)的限制。作者引入了一种检索增强的测试时适配器,通过融合文本和视觉支持特征来学习轻量级分类器,这种方法在模态间协同作用方面优于先前的方法。实验表明,这种方法显著缩小了零样本和监督分割之间的差距,同时保持了开放词汇的能力。
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Authors: Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael
First: 2026-02-26T18:37:23+00:00 · Latest: 2026-02-26T18:37:23+00:00
Comments: 59 pages, 33 figures
Abstract
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
中文标题/摘要
标题:LLM初学者在双重用途和计算生物学任务中的提升
大型语言模型(LLM)在生物学基准测试中的表现越来越出色,但尚不清楚它们是否能提升初学者的表现,即是否能使人比仅使用互联网资源时表现更好。这种不确定性是理解科学加速和双重用途风险的关键。我们进行了一个多模型、多基准的人类提升研究,比较了有LLM访问权限的初学者与仅有互联网访问权限的初学者在八个与生物安全相关的任务集中的表现。参与者在复杂问题上工作,有充足的时间(最复杂任务最多13小时)。我们发现,LLM访问提供了显著的提升:有LLM的初学者比对照组准确度高4.16倍(95% CI [2.63, 6.87])。在四个有专家基线的基准测试中(仅有互联网资源),有LLM的初学者在三个基准测试中表现优于专家。令人惊讶的是,独立的LLM往往超过了LLM辅助的初学者,表明用户没有从LLM中获得最强的可用贡献。大多数参与者(89.6%)报告称,尽管有保护措施,获取与双重用途相关的信息并不困难。总体而言,LLM显著提升了初学者在以前仅由训练有素的从业者完成的生物学任务上的表现,强调了需要在传统基准测试的同时进行持续的互动提升评估。
Summary / 总结
This study investigates whether large language models (LLMs) can help novice users perform better on biology tasks compared to using only internet resources. Across eight biosecurity-relevant task sets, participants with LLM access were 4.16 times more accurate than those without, and even standalone LLMs often outperformed LLM-assisted novices. Notably, novices with LLMs outperformed experts on three out of four benchmarks with available expert baselines. The study highlights the potential of LLMs to uplift novices in complex biological tasks, emphasizing the need for further interactive evaluations to understand dual-use risks and scientific acceleration.
本研究评估了有大型语言模型(LLM)访问权限的初学者在生物任务上的表现,与仅使用互联网资源的参与者进行了比较。参与者被给予最多13小时的时间来解决八个生物安全相关的复杂问题。结果显示,LLM访问显著提高了初学者的表现,LLM辅助的初学者比没有LLM的初学者准确度高4.16倍。值得注意的是,LLM往往超过了LLM辅助的初学者,且大多数参与者在有安全措施的情况下仍能轻松获取双重用途相关信息。这表明LLM可以显著增强初学者在生物任务中的能力,强调了持续的互动评估的重要性。
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Authors: Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang
First: 2025-10-13T02:45:48+00:00 · Latest: 2026-02-26T18:32:27+00:00
Comments: 8 pages, 6 tables, 3 figures. Under review
Abstract
Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
中文标题/摘要
标题:DropVLA:视觉-语言-行动模型中的行动级后门攻击
视觉-语言-行动(VLA)模型将多模态感知和语言指令映射为可执行的机器人动作,使其特别容易受到行为后门操纵:在训练期间引入的隐藏触发器可以在不影响名义任务性能的情况下诱导意外的物理动作。先前对VLA后门的研究主要集中在无目标攻击或任务级劫持上,而对个体动作的精细控制尚未得到充分探索。在本研究中,我们提出了DropVLA,这是一种行动级后门攻击,能够在有限的数据污染访问和现实的管道黑盒设置下,通过窗口一致的重新标记方案进行分块微调,迫使可重用的动作原语(例如,open_gripper)在攻击者选择的决策点执行。在使用LIBERO评估的OpenVLA-7B中,仅通过视觉污染即可实现98.67%-99.83%的攻击成功率(ASR),污染的剧集比例仅为0.31%,同时保持98.50%-99.17%的任务清洁保留率,并在25个控制步骤内(500 Hz,0.05秒)成功触发目标动作。仅文本触发在低污染预算下不稳定,结合文本与视觉并不能在视觉污染攻击上提供一致的ASR改进。后门对触发器的适度变化具有鲁棒性,并且可以在评估套件之间转移(96.27%,99.09%),而仅文本则大多失败(0.72%)。我们还在7自由度的Franka手臂上通过pi0-fast验证了物理世界的可行性,展示了在相机相对运动下诱导图像平面触发漂移的非平凡攻击效果。这些结果表明,VLA模型可以在最小的污染和无明显名义性能退化的情况下,被隐蔽地引导至关键安全动作。
Summary / 总结
DropVLA is an action-level backdoor attack on VLA models that forces a specific action primitive to execute at chosen decision points. The attack uses a window-consistent relabeling scheme for fine-tuning with limited data-poisoning access, achieving high attack success rates (98.67%-99.83%) while preserving task performance (98.50%-99.17%). The attack is robust to moderate trigger variations and transfers across different evaluation suites. Physical-world experiments on a 7-DoF Franka arm confirm the attack's effectiveness under image-plane trigger drift. Text-only triggers are unstable, and combining text with vision does not improve attack success rates over vision-only attacks.
DropVLA 是一种针对 VLA 模型的动作级后门攻击,能够在选定的决策点强制执行特定的动作原语。攻击使用窗口一致的重新标记方案进行微调,具有有限的数据污染访问权限,实现了高攻击成功率(98.67%-99.83%)同时保持任务性能(98.50%-99.17%)。该攻击对适度的触发器变化具有鲁棒性,并且可以在不同的评估套件之间进行转移(96.27%,99.09%)。物理世界实验在 7 自由度 Franka 手臂上证实了在图像平面触发器漂移下攻击的有效性。纯文本触发器不稳定,结合文本与视觉信息的攻击成功率并不优于纯视觉攻击。
LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation
Authors: Nadine Garibli, Mayank Patwari, Bence Csiba, Yi Wei, Kostantinos Sidiropoulos
First: 2025-06-06T13:52:33+00:00 · Latest: 2026-02-26T18:27:23+00:00
Comments: 10 pages, 2 figures
Abstract
Longitudinal volumetric tumour segmentation is critical for radiotherapy planning and response assessment, yet this problem is underexplored and most methods produce single-timepoint semantic masks, lack lesion correspondence, and offer limited radiologist control. We introduce LinGuinE (Longitudinal Guidance Estimation), a PyTorch framework that combines image registration and guided segmentation to deliver lesion-level tracking and volumetric masks across all scans in a longitudinal study from a single radiologist prompt. LinGuinE is temporally direction agnostic, requires no training on longitudinal data, and allows any registration and semi-automatic segmentation algorithm to be repurposed for the task. We evaluate various combinations of registration and segmentation algorithms within the framework. LinGuinE achieves state-of-the-art segmentation and tracking performance across four datasets with a total of 456 longitudinal studies. Tumour segmentation performance shows minimal degradation with increasing temporal separation. We conduct ablation studies to determine the impact of autoregression, pathology specific finetuning, and the use of real radiologist prompts. We release our code and substantial public benchmarking for longitudinal segmentation, facilitating future research.
中文标题/摘要
标题:LinGuinE: 长期体积肿瘤分割的纵向引导估计
长期体积肿瘤分割对于放射治疗计划和反应评估至关重要,但这一问题尚未得到充分探索,大多数方法仅生成单时点语义掩码,缺乏病灶对应关系,并且对放射科医生的控制有限。我们引入了LinGuinE(纵向引导估计),这是一种结合图像配准和引导分割的PyTorch框架,能够从单个放射科医生提示中提供病灶级跟踪和所有纵向研究扫描中的体积掩码。LinGuinE在时间方向上是无方向性的,不需要在纵向数据上进行训练,并允许任何配准和半自动分割算法重新用于此任务。我们评估了框架内各种配准和分割算法的组合。LinGuinE在四个数据集的456个纵向研究中实现了最先进的分割和跟踪性能。肿瘤分割性能随时间分离度增加而最小化下降。我们进行了消融研究以确定自回归、病理特异性微调和使用真实放射科医生提示的影响。我们发布了我们的代码和大量公共基准测试,促进未来的研究。
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Authors: Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy, Marcelo Becker, Thiago Boaventura, Gustavo J. G. Lahr
First: 2026-02-26T18:20:26+00:00 · Latest: 2026-02-26T18:20:26+00:00
Abstract
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
中文标题/摘要
标题:小语言模型在领导者-跟随者互动中的零样本和单样本适应性评估
领导者-跟随者互动是人机交互(HRI)中的一个重要范式。然而,为资源受限的移动和辅助机器人实时分配角色仍然具有挑战性。虽然大型语言模型(LLMs)在自然通信方面显示出潜力,但其规模和延迟限制了其在设备上的部署。小语言模型(SLMs)提供了一种潜在的替代方案,但它们在HRI中的角色分类效果尚未系统评估。在本文中,我们提出了SLMs在领导者-跟随者通信中的基准测试,引入了一个源自已发布数据库的新数据集,并通过合成样本增强了数据集以捕捉互动特定的动力学。我们研究了两种适应策略:提示工程和微调,在零样本和单样本互动模式下进行研究,并与未训练基线进行比较。使用Qwen2.5-0.5B的实验表明,零样本微调实现了稳健的分类性能(准确率为86.66%),同时保持了低延迟(每样本22.2毫秒),显著优于基线和提示工程方法。然而,结果也表明,在单样本模式下性能有所下降,其中增加的上下文长度挑战了模型的架构能力。这些发现表明,微调后的SLMs为直接角色分配提供了一个有效的解决方案,同时突显了对话复杂性和分类可靠性之间的关键权衡问题在边缘设备上。
Summary / 总结
This paper evaluates the effectiveness of small language models (SLMs) for leader-follower interaction in human-robot interaction (HRI), focusing on zero-shot and one-shot adaptation strategies. The study introduces a new dataset and investigates prompt engineering and fine-tuning methods. Experiments with Qwen2.5-0.5B show that zero-shot fine-tuning achieves high accuracy (86.66%) and low latency (22.2 ms per sample), outperforming baseline and prompt-engineered approaches. However, one-shot modes show performance degradation due to increased context length challenges. The findings suggest that fine-tuned SLMs are effective for role assignment in HRI but highlight the trade-offs between dialogue complexity and classification reliability on the edge.
本文评估了小语言模型(SLMs)在人类-机器人交互(HRI)中的领导-跟随者互动效果,重点关注零样本和单样本适应策略。研究引入了一个新数据集,并探讨了提示工程和微调方法。实验表明,零样本微调在Qwen2.5-0.5B上实现了高准确率(86.66%)和低延迟(每样本22.2毫秒),优于基线和提示工程方法。然而,在单样本模式下,由于上下文长度增加,模型的架构能力受到挑战,导致性能下降。研究结果表明,微调过的SLMs对于直接角色分配是有效的,但同时也突显了对话复杂性和分类可靠性之间的权衡问题,特别是在边缘设备上。
Evaluating the Diversity and Quality of LLM Generated Content
Authors: Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani
First: 2025-04-16T23:02:23+00:00 · Latest: 2026-02-26T18:17:44+00:00
Comments: Published at COLM 2025
Abstract
Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models are widely deployed in applications requiring varied outputs. We argue that diversity without consideration of quality has limited practical value. To address this issue, we introduce a framework for measuring effective semantic diversity -- diversity among outputs that meet quality thresholds -- which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: when using diversity metrics that do not explicitly consider quality, preference-tuned models -- particularly those trained via RL -- often produce outputs with lower diversity; however, these same preference-tuned models generate greater effective semantic diversity than supervised fine-tuned (SFT) or base models. Our analysis further shows another trend: while larger models may exhibit greater effective semantic diversity than smaller models, the smaller models are consistently more parameter-efficient at producing unique content within a fixed sampling budget. These findings have practical implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
中文标题/摘要
标题:评估大语言模型生成内容的多样性和质量
近期研究表明,偏好调优技术——如基于人类反馈强化学习(RLHF)方法(如PPO和GRPO),以及替代方法DPO——降低了多样性,这给这些模型在需要多样化输出的应用中广泛应用带来了困境。我们认为,不考虑质量的多样性在实际应用中价值有限。为解决这一问题,我们提出了一种衡量有效语义多样性的框架——衡量满足质量标准的输出之间的多样性——这更好地反映了大语言模型(LLMs)的实际效用。通过无需人工干预的开放任务,我们发现了一些反直觉的结果:当使用不考虑质量的多样性度量时,偏好调优模型——尤其是通过RL训练的模型——往往生成的输出多样性较低;然而,这些偏好调优模型生成的有效语义多样性却大于监督微调(SFT)或基础模型。我们的分析还显示了另一种趋势:虽然较大的模型可能在固定采样预算内生成更独特的内容方面表现出更大的有效语义多样性,但较小的模型在生成独特内容方面始终更具有参数效率。这些发现对需要多样化且高质量输出的应用具有实际意义,从创意辅助到合成数据生成。
Summary / 总结
This study evaluates the diversity and quality of content generated by large language models (LLMs) and introduces a framework for measuring effective semantic diversity, which considers both diversity and quality. Using open-ended tasks, the research finds that preference-tuned models, especially those trained via reinforcement learning, produce lower diversity when using standard diversity metrics but generate greater effective semantic diversity compared to supervised fine-tuned or base models. Additionally, smaller models are more parameter-efficient in producing unique content within a fixed budget.
研究评估了大型语言模型(LLMs)生成内容的多样性和质量,并引入了一个同时考虑多样性和质量的有效语义多样性测量框架。使用开放任务,研究发现,偏好调优模型,尤其是通过强化学习训练的模型,在不考虑质量的情况下测量多样性较低,但生成的有效语义多样性却大于监督微调或基础模型。此外,较小的模型在固定采样预算内生成独特内容方面更具参数效率。
Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting
Authors: Shai Feldman, Stephen Bates, Yaniv Romano
First: 2025-05-07T18:46:02+00:00 · Latest: 2026-02-26T18:16:20+00:00
Abstract
We introduce a framework for robust uncertainty quantification in situations where labeled training data are corrupted, through noisy or missing labels. We build on conformal prediction, a statistical tool for generating prediction sets that cover the test label with a pre-specified probability. The validity of conformal prediction, however, holds under the i.i.d assumption, which does not hold in our setting due to the corruptions in the data. To account for this distribution shift, the privileged conformal prediction (PCP) method proposed leveraging privileged information (PI) -- additional features available only during training -- to re-weight the data distribution, yielding valid prediction sets under the assumption that the weights are accurate. In this work, we analyze the robustness of PCP to inaccuracies in the weights. Our analysis indicates that PCP can still yield valid uncertainty estimates even when the weights are poorly estimated. Furthermore, we introduce uncertain imputation (UI), a new conformal method that does not rely on weight estimation. Instead, we impute corrupted labels in a way that preserves their uncertainty. Our approach is supported by theoretical guarantees and validated empirically on both synthetic and real benchmarks. Finally, we show that these techniques can be integrated into a triply robust framework, ensuring statistically valid predictions as long as at least one underlying method is valid.
中文标题/摘要
标题:带污染标签的自适应预测:不确定插补和稳健加权
我们提出了一种框架,用于在标记训练数据受到噪声或缺失标签污染的情况下,进行稳健的不确定性量化。我们基于自适应预测,这是一种生成预测集的统计工具,该预测集在预设的概率下覆盖测试标签。然而,自适应预测的有效性依赖于独立同分布假设,而在我们的设置中,由于数据中的污染,该假设不成立。为了应对这种分布偏移,我们提出了利用特权信息(PI)——仅在训练期间可用的额外特征——的特权自适应预测(PCP)方法,通过重新加权数据分布,从而在加权准确的情况下生成有效的预测集。在本文中,我们分析了PCP对加权估计不准确的鲁棒性。我们的分析表明,即使加权估计不准确,PCP仍然可以生成有效的不确定性估计。此外,我们引入了一种新的自适应预测方法——不确定插补(UI),该方法不依赖于加权估计。相反,我们以保留标签不确定性的方式插补污染的标签。我们的方法得到了理论保证,并在合成和真实基准上得到了实证验证。最后,我们展示了这些技术可以集成到三重稳健框架中,只要至少有一种基础方法有效,就可以确保统计上有效的预测。
Summary / 总结
This paper addresses the issue of robust uncertainty quantification in machine learning models when training data are corrupted. It builds on conformal prediction, a method for generating prediction sets with a specified coverage probability, and introduces privileged conformal prediction (PCP) to re-weight data under distribution shift caused by corruptions. The study analyzes PCP's robustness to inaccurate weight estimation and proposes uncertain imputation (UI), a new conformal method that imputes corrupted labels while preserving their uncertainty. Theoretical guarantees and empirical validation on synthetic and real benchmarks support the approach, and the techniques can be integrated into a triply robust framework.
本文解决了机器学习模型在训练数据被污染时如何进行稳健的不确定性量化的问题。它基于一种生成具有指定覆盖概率的预测集的方法——齐性预测,但对处理数据污染进行了修改。提出的特权齐性预测(PCP)方法利用额外的训练特征重新加权数据分布,即使加权不准确也能保持有效的预测集。此外,本文还引入了一种新的齐性方法——不确定插补(UI),直接插补被污染的标签同时保留其不确定性,而不依赖于加权估计。这些方法的有效性得到了理论保证和在合成数据集和真实数据集上的实证验证支持。
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
Venue: ICLR 2026
First: 2026-02-26T18:10:41+00:00 · Latest: 2026-02-26T18:10:41+00:00
Comments: Accept by ICLR 2026
Abstract
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
中文标题/摘要
标题:ThinkOmni:通过指导解码提升文本推理至全模态场景
全模态推理对于智能系统理解并从多种数据源中推断信息至关重要。虽然现有的全模态大型语言模型(OLLM)在感知多种模态方面表现出色,但它们缺乏近期大型推理模型(LRM)的复杂推理能力。然而,通过额外训练来增强OLLM的推理能力面临着重大挑战,包括高质量数据的需求、任务特定的适应以及巨大的计算成本。为了解决这些限制,我们提出了ThinkOmni,这是一种无需训练和数据的框架,将文本推理提升至全模态场景。ThinkOmni引入了两个关键组件:1)LRM-as-a-Guide,利用现成的LRM来指导OLLM的解码过程;2)逐步对比缩放,无需手动超参数调整即可适应性平衡感知和推理信号。在六个多模态推理基准上的实验表明,ThinkOmni始终能够提供性能改进,主要结果在MathVista上达到70.2,在MMAU上达到75.5。总体而言,ThinkOmni提供了一种灵活且通用的全模态推理解决方案,并为推理能力的泛化和应用提供了新的见解。
Summary / 总结
ThinkOmni is a training-free and data-free framework that enhances the reasoning abilities of omni-modal large language models (OLLMs) by leveraging off-the-shelf large reasoning models (LRMs) for guidance during decoding and using Stepwise Contrastive Scaling to balance perception and reasoning signals. Experiments on six multi-modal reasoning benchmarks show that ThinkOmni improves performance, achieving 70.2 on MathVista and 75.5 on MMAU.
ThinkOmni 是一个无需训练和数据的框架,通过利用现成的大型推理模型(LRM)进行指导解码和自适应平衡感知与推理信号来增强跨模态大型语言模型(OLLM)的推理能力。实验结果显示,ThinkOmni 在六个跨模态推理基准上的表现得到提升,分别在 MathVista 达到 70.2,在 MMAU 达到 75.5。
DRESS: A Continuous Framework for Structural Graph Refinement
Authors: Eduar Castrillo Velilla
First: 2026-02-24T12:18:42+00:00 · Latest: 2026-02-26T18:10:20+00:00
Abstract
The Weisfeiler-Lehman (WL) hierarchy is a cornerstone framework for graph isomorphism testing and structural analysis. However, scaling beyond 1-WL to 3-WL and higher requires tensor-based operations that scale as $\mathcal{O}(n^3)$ or $\mathcal{O}(n^4)$, making them computationally prohibitive for large graphs. In this paper, we start from the Original-DRESS equation (Castrillo, León, and Gómez, 2018) -- a parameter-free, continuous dynamical system on edges -- and show that it distinguishes the prism graph from $K_{3,3}$, a pair that 1-WL provably cannot separate. We then generalize it to Motif-DRESS, which replaces triangle neighborhoods with arbitrary structural motifs and converges to a unique fixed point under three sufficient conditions, and further to Generalized-DRESS, an abstract template parameterized by the choice of neighborhood operator, aggregation function and norm. Finally, we introduce $Δ$-DRESS, which runs DRESS on each node-deleted subgraph $G \setminus \{v\}$, connecting the framework to the Kelly--Ulam reconstruction conjecture. Both Motif-DRESS and $Δ$-DRESS empirically distinguish Strongly Regular Graphs (SRGs) -- such as the Rook and Shrikhande graphs -- that confound 3-WL. Our results establish the DRESS family as a highly scalable framework that empirically surpasses both 1-WL and 3-WL on well-known benchmark graphs, without the prohibitive $\mathcal{O}(n^4)$ computational cost.
中文标题/摘要
标题:DRESS:一种连续的结构图细化框架
魏斯菲尔德-勒曼(WL)层次结构是图同构测试和结构分析的核心框架。然而,从1-WL扩展到3-WL及以上需要基于张量的操作,其复杂度为$\mathcal{O}(n^3)$或$\mathcal{O}(n^4)$,这使得它们对于大型图来说计算上不可行。在本文中,我们从原始DRESS方程(Castrillo, León, and Gómez, 2018)出发——一个无参数的连续动力系统——并证明它能够区分棱柱图和$K_{3,3}$,而1-WL能够证明它们无法区分。然后,我们将其推广为Motif-DRESS,用任意结构模式替换三角形邻域,并在满足三个充分条件下收敛到一个唯一的固定点,进一步推广为Generalized-DRESS,这是一个抽象模板,参数化选择邻域操作、聚合函数和范数。最后,我们引入了$Δ$-DRESS,它在每个节点删除子图$G \setminus \{v\}$上运行DRESS,将该框架与凯利-乌拉姆重建猜想联系起来。Motif-DRESS和$Δ$-DRESS在实验上能够区分3-WL难以区分的强正则图(SRGs),如象棋棋盘图和谢尔罕德图。我们的结果确立了DRESS家族作为一种高度可扩展的框架,能够在已知基准图上实验上超越1-WL和3-WL,而无需$\mathcal{O}(n^4)$的计算成本。
Summary / 总结
The paper introduces DRESS, a continuous framework for graph refinement that addresses the computational limitations of higher-order Weisfeiler-Lehman (WL) methods. Starting from the Original-DRESS equation, the authors generalize it to Motif-DRESS and Generalized-DRESS, and further to Δ-DRESS. These methods distinguish graphs that 3-WL cannot, such as the prism graph and Strongly Regular Graphs. Empirically, DRESS surpasses both 1-WL and 3-WL on benchmark graphs without the high computational cost of tensor-based operations.
研究解决了Weisfeiler-Lehman (WL)层次结构中的计算难题,特别是更高阶WL测试的$O(n^3)$或$O(n^4)$计算复杂度。引入了DRESS,一种无参数的连续动力系统及其变体Motif-DRESS和Generalized-DRESS,能够区分1-WL无法区分的某些图对。研究还提出了$Δ$-DRESS,将其与Kelly--Ulam重建猜想联系起来。实验证明,DRESS家族在基准图上优于1-WL和3-WL,且无需高计算成本。
Phase Transitions for Feature Learning in Neural Networks
Authors: Andrea Montanari, Zihao Wang
First: 2026-02-01T20:47:36+00:00 · Latest: 2026-02-26T18:06:09+00:00
Comments: 75 pages; 17 pdf figures; v2 is a minor revision of v1
Abstract
According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given $n$ i.i.d. pairs $({\boldsymbol x}_i,y_i)$, where the covariate vectors ${\boldsymbol x}_i\in\mathbb{R}^d$ are isotropic, and responses $y_i$ only depend on ${\boldsymbol x}_i$ through a $k$-dimensional projection ${\boldsymbol Θ}_*^{\sf T}{\boldsymbol x}_i$. Feature learning amounts to learning the latent space spanned by ${\boldsymbol Θ}_*$.
In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics $n,d\to\infty$, $n/d\toδ$, while the dimension of the latent space $k$ and the number of hidden neurons $m$ are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if $δ> δ_{\text{alg}}$, for $δ_{\text{alg}}$ a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below $δ_{\text{alg}}$. Here we derive an analogous threshold $δ_{\text{NN}}$ for two-layer networks. Our characterization of $δ_{\text{NN}}$ opens the way to study the dependence of learning dynamics on the network architecture and training algorithm.
The threshold $δ_{\text{NN}}$ is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold $δ_{\text{NN}}$ corresponds to a phase transition in the spectrum of the Hessian in this second phase.
Summary / 总结
The paper investigates the phase transitions in feature learning using two-layer neural networks under proportional asymptotics. It studies the gradient descent dynamics of these networks and derives a threshold $δ_{ ext{NN}}$ for successful feature learning, analogous to the threshold $δ_{ ext{alg}}$ for polynomial-time algorithms. The threshold is characterized by a phase transition in the spectrum of the Hessian matrix during the learning process.
本文研究了两层神经网络中特征学习的相变现象。研究动机在于神经网络通过识别低维表示来学习数据。关键发现表明,在某个阈值 $δ_{ ext{NN}}$ 之下,两层网络无法实现特征学习,这类似于多项式时间算法中的阈值 $δ_{ ext{alg}}$。该阈值与学习过程中海森矩阵谱的变化相联系。
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Venue: ICLR 2026
First: 2025-10-21T20:30:20+00:00 · Latest: 2026-02-26T18:05:42+00:00
Comments: Accepted at ICLR 2026. 26 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh
Abstract
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
中文标题/摘要
标题:PoSh:使用场景图引导LLM作为裁判进行详细图像描述
尽管视觉-语言模型(VLMs)在详细图像描述方面取得了进展,但评估仍是一个挑战。标准指标(如CIDEr、SPICE)是为短文本设计的,并且调整为识别现在已不常见的错误,例如物体识别错误。相比之下,长文本需要对属性和关系的敏感度以及能够定位特定文本片段错误的评分。在本文中,我们引入了PoSh,这是一种用于详细图像描述的指标,它使用场景图作为结构化的评分标准来引导LLM作为裁判,产生基于细粒度错误(如组合理解错误)的综合评分。PoSh是可复制的、可解释的,并且比现有指标(包括GPT4o作为裁判)更接近人类评分者。为了验证PoSh,我们引入了一个新的具有挑战性的数据集DOCENT。这个新的基准数据集包含艺术品,并配以专家撰写的参考文本和模型生成的描述,还增加了艺术史学生对它们质量的精细和粗略判断。因此,DOCENT使我们能够在一个新的具有挑战性的领域中评估详细图像描述指标和详细图像描述本身。我们展示了PoSh与DOCENT中的人类判断相比,具有更强的相关性(Spearman ρ +0.05),并且对图像类型具有鲁棒性(使用CapArena,一个现有的网络图像数据集),并且是一个有效的奖励函数,优于标准的监督微调。然后,使用PoSh,我们表征了开放和封闭模型在描述DOCENT中的绘画、素描和雕像的表现,并发现基础模型难以实现对具有丰富场景动态的图像的全面、无误的覆盖,从而确立了一个新的具有挑战性的任务来衡量VLM的进步。通过PoSh和DOCENT,我们希望促进在诸如辅助文本生成等重要领域的发展。
Summary / 总结
PoSh is a metric for evaluating detailed image descriptions that uses scene graphs to guide LLMs as judges, focusing on fine-grained errors. It was validated on a new dataset, DOCENT, which includes artwork with expert references and quality judgments from art history students. PoSh shows stronger correlations with human judgments than existing metrics and outperforms standard supervised fine-tuning, highlighting the challenges in describing complex images with rich scene dynamics.
PoSh 是一个用于评估详细图像描述的新指标,通过场景图引导LLM作为裁判,并基于细粒度错误生成分数。它在包含艺术品和专家参考的新 DOCENT 数据集上优于现有指标,并与人类判断有很强的相关性。PoSh 还展示了在不同图像类型上的鲁棒性,并改善了对复杂场景中基础模型性能的评估。
Towards Long-Form Spatio-Temporal Video Grounding
Authors: Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang, Cheng Han, Heng Fan, Libo Zhang
First: 2026-02-26T18:04:09+00:00 · Latest: 2026-02-26T18:04:09+00:00
Abstract
In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.
中文标题/摘要
标题:迈向长时序时空视频定位
在实际场景中,视频可以持续几分钟甚至几小时。然而,现有的时空视频定位(STVG)研究,给定一个文本查询,主要集中在定位短视频(通常少于一分钟)中的目标,这限制了其在实际中的应用。本文探讨了长时序时空视频定位(LF-STVG),旨在定位长视频中的目标。与短视频相比,长视频包含更长的时间跨度和更多的无关信息,使得现有的处理所有帧的STVG方法难以应对。为了解决这一挑战,我们提出了一种自回归变换器架构,称为ART-STVG。与传统的STVG方法需要一次性处理整个视频序列以进行预测不同,ART-STVG将视频视为流式输入,并按顺序处理帧,从而能够高效处理长视频。为了建模时空上下文,我们设计了空间和时间记忆库,并将它们应用于解码器。由于不同时刻的记忆并不总是与当前帧相关,我们引入了简单而有效的记忆选择策略,为解码器提供更相关的信息,显著提高了性能。此外,我们提出了一种级联的时空设计,将空间解码器连接到时间解码器,而不是并行的空间和时间定位,允许细粒度的空间线索在长视频中辅助复杂的时序定位。在新扩展的LF-STVG数据集上的实验表明,ART-STVG显著优于现有方法,同时在传统的短时序STVG上实现了竞争力的性能。
Summary / 总结
This paper addresses the challenge of spatio-temporal video grounding (STVG) in long-form videos, which are typically not covered by existing methods. The authors propose an AutoRegressive Transformer (ART-STVG) that processes videos frame by frame, making it suitable for long videos. ART-STVG includes spatial and temporal memory banks to model context and memory selection strategies to provide relevant information. Additionally, a cascaded spatio-temporal design connects spatial and temporal decoders. Experiments show that ART-STVG outperforms state-of-the-art methods on long-form datasets while maintaining competitive performance on short-form videos.
本文针对现有的时空视频定位(STVG)方法主要关注短视频,而忽视了长视频的问题。作者提出了一种名为ART-STVG的自回归变压器架构,该架构逐帧处理视频,适用于长视频。ART-STVG包括空间和时间记忆库来建模上下文,并使用记忆选择策略为解码器提供相关信息。此外,它采用级联时空设计以提高长视频中的时间定位。实验表明,ART-STVG在长视频数据集上的表现优于现有方法,同时在短视频上的表现也具有竞争力。
PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning
Authors: Fuqiang Chen, Ranran Zhang, Wanming Hu, Deboch Eyob Abera, Yue Peng, Boyun Zheng, Yiwen Sun, Jing Cai, Wenjian Qin
Venue: IEEE Transactions on Medical Imaging, 2026
First: 2026-02-26T18:03:24+00:00 · Latest: 2026-02-26T18:03:24+00:00
Comments: Accepted by TMI
Abstract
Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).
中文标题/摘要
标题:PGVMS:一种基于提示的统一框架,用于病理语义学习的虚拟多路复用IHC染色
免疫组化(IHC)染色能够精确地对蛋白质表达进行分子分析,在现代病理学中已有超过200种基于抗体的临床测试。然而,全面的IHC分析经常受限于小活检组织量不足。因此,虚拟多路复用染色作为一种创新解决方案,能够将HE图像数字化转换为多种IHC表示,但当前方法仍面临三个关键挑战:(1)多染色的不足语义指导,(2)免疫化学染色分布不一致,(3)不同染色模式之间的空间错位。为克服这些限制,我们提出了一种仅使用单路训练数据的基于提示的虚拟多路复用IHC染色框架(PGVMS)。我们的框架引入了三个关键创新,分别对应每个挑战:首先,一种自适应提示引导机制,利用病理视觉语言模型动态调整染色提示,以解决语义指导不足的问题(挑战1)。其次,我们的蛋白质感知学习策略(PALS)通过直接量化和约束蛋白质分布来保持精确的蛋白质表达模式(挑战2)。第三,原型一致学习策略(PCLS)建立跨图像语义交互,以纠正空间错位(挑战3)。
Summary / 总结
The research addresses the challenges of virtual multiplex IHC staining by proposing PGVMS, which uses uniplex training data. It introduces an adaptive prompt guidance mechanism to improve semantic guidance, a protein-aware learning strategy to maintain precise protein expression patterns, and a prototype-consistent learning strategy to correct spatial misalignments. The key findings include improved accuracy and consistency in virtual multiplex IHC staining compared to existing methods.
PGVMS 是一种使用提示引导的虚拟多路复用 IHC 染色框架,解决了三个主要问题:缺乏语义指导、染色分布不一致和空间错位。它通过自适应提示引导机制、蛋白质感知学习策略和原型一致学习策略来克服这些问题。该框架能够仅使用单路训练数据将 H&E 图像转换为多个 IHC 表现形式。
LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction
Authors: Zhengyang Wei, Renzhi Jing, Yiyi He, Jenny Suckale
First: 2026-02-26T18:02:44+00:00 · Latest: 2026-02-26T18:02:44+00:00
Abstract
The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.
中文标题/摘要
标题:LineGraph2Road:基于线图的结构图推理在道路网络提取中的应用
从卫星图像中准确且自动地提取道路对于导航和城市规划应用至关重要,大大减少了手动标注的需求。许多现有方法将此任务分解为关键点提取和连通性预测,但往往难以捕捉长距离依赖性和复杂拓扑结构。在此,我们提出了一种名为LineGraph2Road的框架,通过将连通性预测形式化为在构建的全局但稀疏欧几里得图中对边进行二元分类来改进连通性预测,其中节点是从分割掩码中提取的关键点,边连接预定义距离阈值内的节点对,表示潜在的道路段。为了更好地学习结构链接表示,我们将原始图转换为其对应的线图,并在其上应用图变换器进行连通性预测。这种形式克服了端点嵌入融合在集同构链接上的局限性,使链接表示更加丰富,并在全局结构上实现有效的关系推理。此外,我们引入了一个立交桥/地下通道头来解决多级交叉问题,并采用耦合非最大抑制策略来保留关键连接。我们在三个基准上评估了LineGraph2Road:城市规模、SpaceNet和全球规模,并在两个关键指标TOPO-F1和APLS上展示了其达到最先进的结果。它还捕捉了对于实际部署至关重要的细视觉细节。我们将公开我们的代码。
Summary / 总结
LineGraph2Road is designed to accurately extract roads from satellite imagery by improving connectedness prediction through a global sparse Euclidean graph and a Graph Transformer. It transforms the original graph into a line graph for better structural link representation and introduces an overpass/underpass head and coupled NMS strategy. The method achieves state-of-the-art results on TOPO-F1 and APLS metrics across three benchmarks and captures fine visual details.
LineGraph2Road 是一种框架,通过将连接性预测转化为线图上的二分类问题来改进从卫星图像中提取道路。图中的节点代表分割掩码中的关键点,边指示潜在的道路段。该方法使用图变换器在线图上学习丰富的链接表示并执行有效的关系推理。实验结果表明,LineGraph2Road 在 City-scale、SpaceNet 和 Global-scale 基准上的 TOPO-F1 和 APLS 指标上优于现有方法,并捕捉到现实世界应用所需的关键视觉细节。
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Authors: Sungho Park, Jueun Kim, Wook-Shin Han
Venue: ICLR 2026
First: 2026-02-26T17:59:51+00:00 · Latest: 2026-02-26T17:59:51+00:00
Comments: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: https://sparta-projectpage.github.io/
Abstract
Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
中文标题/摘要
标题:SPARTA:面向文本和表格的树状多跳问答的可扩展和原则性基准测试
现实世界中的表格-文本问答任务需要能够跨越长文本和源表格进行推理的模型,遍历多个跳转并执行复杂的操作,如聚合。然而,现有的基准数据集规模较小,由人工精心整理,因此容易出错,并且包含浅显的问题,很少需要超过两跳或涉及聚合、分组或其他高级分析操作。我们提出了SPARTA,这是一种端到端的构建框架,可以自动生成大规模的表格-文本问答基准数据集,只需轻量级的人工验证,所需注释时间仅为HybridQA的四分之一。该框架首先通过丰富每个源表格,添加自动从附带的无结构段落中提取的原子事实的接地表,构建参考事实数据库,然后合成嵌套查询,其嵌套谓词的数量与所需的跳转次数相匹配。为了确保每个SQL语句可执行,并且其口头表达能产生流畅的人类语言问题,我们提出了两种新颖的技术:来源基于的细化,它可以重写任何返回非空结果的语法有效的查询,以及现实结构的强制执行,它限制生成在查询图的后序遍历中。由此产生的流水线生成了数千个高质量的问题-答案对,涵盖了聚合、分组和跨越文本和表格的深层多跳推理。在SPARTA上,达到HybridQA超过70 F1或OTT-QA超过50 F1的最新模型下降超过30 F1点,揭示了当前跨模态推理中的根本弱点。我们的基准测试、构建代码和基线模型可在https://github.com/pshlego/SPARTA/tree/main/获得。
Summary / 总结
SPARTA is a scalable and principled benchmark for tree-structured multi-hop QA over text and tables, addressing the limitations of existing small and manually curated benchmarks. It uses an automated construction framework with lightweight human validation to generate large-scale QA benchmarks, including nested queries with multiple hops and advanced operations. Key findings show that state-of-the-art models perform poorly on SPARTA, dropping by more than 30 F1 points, highlighting the need for improved cross-modal reasoning capabilities. The benchmark, code, and baseline models are available online.
SPARTA 是一个针对文本和表格的树状多跳 QA 的可扩展和原则性基准,解决了现有小型且手动编纂的基准的局限性。它使用一个端到端的框架,自动生成大规模的 QA 基准,并且只需要 HybridQA 人工标注时间的四分之一。该框架通过丰富源表格并合成嵌套查询来确保可执行的 SQL 语句和流畅的问题表述。实验结果表明,最先进的模型在 SPARTA 上表现不佳,表明当前跨模态推理能力存在根本性弱点。
ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
Authors: Haohui Jia, Zheng Chen, Lingwei Zhu, Rikuto Kotoge, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Takashi Matsubara
First: 2026-02-26T17:59:10+00:00 · Latest: 2026-02-26T17:59:10+00:00
Abstract
Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.
中文标题/摘要
标题:ODEBrain: 连续时间EEG图用于建模动态脑网络
建模神经群体动力学对于基础神经科学研究和各种临床应用至关重要。传统潜在变量方法通常通过使用循环架构离散化时间来建模连续的大脑动力学,这不可避免地会导致累积预测误差并无法捕捉EEGs的瞬时和非线性特征。我们提出了一种ODEBRAIN神经ODE潜在动态预测框架,通过将空间-时间和频率特征整合到频谱图节点中,然后通过神经ODE建模连续的潜在动力学来克服这些挑战。我们的设计确保潜在表示可以在任何给定时间点捕捉复杂脑状态的随机变化。广泛的实验验证了与现有方法相比,ODEBRAIN在增强鲁棒性和泛化能力方面可以显著提高预测EEG动力学的能力。
Summary / 总结
ODEBrain is designed to model neural population dynamics by addressing the limitations of conventional methods that use discretized time. It introduces a Neural ODE latent dynamic forecasting framework, which integrates spatio-temporal-frequency features into spectral graph nodes and models continuous latent dynamics. Experiments show that ODEBrain outperforms existing methods in forecasting EEG dynamics with better robustness and generalization capabilities.
ODEBrain 通过将时空频特征集成到频谱图节点中,并使用神经ODE来捕捉连续的潜在动态,旨在建模神经群体动力学。这种方法解决了传统方法使用递归架构的局限性,这些方法会累积预测误差并无法捕捉EEGs的瞬时非线性特征。实验结果表明,ODEBrain 在预测EEG动态方面优于现有方法,并具有更好的鲁棒性和泛化能力。
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Authors: Roland Pihlakas, Sruthi Susan Kuriakose
First: 2025-09-02T15:13:14+00:00 · Latest: 2026-02-26T17:56:58+00:00
Comments: 22 pages, 8 tables
Abstract
Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns. We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation). The problem is not that the LLMs just lose context or become incoherent - the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.
中文标题/摘要
标题:BioBlue:生物和经济对齐的LLM在生物安全基准上的系统性失控模式
许多关于“失控优化”的AI对齐讨论集中在RL代理上:无法限制的效用最大化者,它们会过度优化代理目标(例如,“纸夹最大化者”,规范游戏)而牺牲其他一切。基于LLM的系统通常被认为更安全,因为它们作为下一个标记预测器工作,而不是持续的优化器。在本研究中,我们通过将LLM置于需要维持状态或平衡时间目标的简单、长期控制环境来实证测试这一假设:可再生资源的可持续性、单目标和多目标稳态以及在边际效益递减的情况下平衡无界目标。我们发现,尽管模型在许多步骤中表现出适当的行为并且显然理解了陈述的目标,但它们经常以结构化的方式失去上下文并进入失控行为:忽略稳态目标,从多目标权衡中崩溃为单目标最大化——因此未能尊重凹效用结构。这些失败在初始表现良好的时期后可靠地出现,并表现出特征性模式(包括自我模仿的振荡、无界最大化以及恢复为单目标优化)。问题不在于LLM只是失去上下文或变得不连贯——失败系统地类似于失控优化器。我们的结果表明,长期、多目标不对齐是LLM代理中一个真实且被低估的失败模式,即使在极其简单的透明且明确多目标反馈设置中也是如此。尽管表面上LLM似乎多目标且有边界,但在持续交互,特别是涉及多个目标的情况下,其行为类似于脆弱、不良对齐的优化器,其有效目标逐渐转向无界和单一指标最大化。
Physics Informed Viscous Value Representations
Authors: Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Damon Conover, Ziran Wang, Aniket Bera
First: 2026-02-26T17:53:46+00:00 · Latest: 2026-02-26T17:53:46+00:00
Abstract
Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at https://github.com/HrishikeshVish/phys-fk-value-GCRL.
中文标题/摘要
标题:物理知情的粘性值表示
离线目标条件强化学习(GCRL)从静态预先收集的数据集中学目标条件策略。然而,由于状态-动作空间覆盖有限,准确的价值估计仍然是一个挑战。最近的物理知情方法通过在偏微分方程(PDE)上定义的正则化来对价值函数施加物理和几何约束,如Eikonal方程,试图解决这一问题。然而,这些形式化在复杂、高维环境中往往不明确。在本文中,我们提出了一种基于哈密尔顿-雅可比-贝尔曼(HJB)方程粘性解的物理知情正则化。通过提供基于物理的归纳偏置,我们的方法将学习过程与最优控制理论联系起来,在价值迭代期间显式地正则化和限制更新。此外,我们利用费曼-卡茨定理将PDE解重新表述为期望,使目标的可计算蒙特卡洛估计避免了高阶梯度中的数值不稳定性。实验表明,我们的方法提高了几何一致性,使其广泛适用于导航和高维、复杂的操作任务。开源代码可在https://github.com/HrishikeshVish/phys-fk-value-GCRL/获得。
Summary / 总结
This paper addresses the challenge of accurate value estimation in offline goal-conditioned reinforcement learning by proposing a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman equation. The method leverages the Feynman-Kac theorem to enable a tractable Monte Carlo estimation, avoiding numerical instability. Experiments show that this approach improves geometric consistency, making it suitable for navigation and complex manipulation tasks in high-dimensional environments.
本文通过提出基于Hamilton-Jacobi-Bellman方程粘性解的物理导向正则化方法,解决了离线目标导向强化学习中准确的价值估计难题。该方法利用最优控制理论提供物理导向的归纳偏置,在价值迭代过程中明确正则化和限制更新,避免数值不稳定。实验表明,该方法提高了几何一致性,适用于高维环境中的导航和复杂操作任务。
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi
First: 2026-02-26T17:51:21+00:00 · Latest: 2026-02-26T17:51:21+00:00
Abstract
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.
中文标题/摘要
标题:CXReasonAgent:基于证据的胸部X光诊断推理代理
胸部X光在胸部诊断中起着核心作用,其解释本质上需要多步、基于证据的推理。然而,大型视觉-语言模型(LVLM)通常生成的响应虽然看似合理,但并不忠实于诊断证据,提供的视觉证据有限,难以验证,同时还需要昂贵的重新训练以支持新的诊断任务,这限制了它们在临床环境中的可靠性和适应性。为了解决这些局限性,我们提出了CXReasonAgent,这是一种将大型语言模型(LLM)与临床基础的诊断工具结合的诊断代理,用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力,我们引入了包含1,946轮对话的多轮对话基准CXReasonDial,涉及12项诊断任务,并展示了CXReasonAgent生成忠实于证据的响应,使其在诊断推理方面比LVLMs更可靠和可验证。这些发现突显了在安全关键的临床环境中整合基于临床证据的诊断工具的重要性。
Summary / 总结
The research aims to improve the reliability and adaptability of diagnostic reasoning for chest X-rays by addressing the limitations of large vision-language models. CXReasonAgent integrates a large language model with clinically grounded diagnostic tools to perform evidence-grounded reasoning. The study introduces CXReasonDial, a benchmark with 1,946 dialogues, and demonstrates that CXReasonAgent generates responses more reliably and verifiably than LVLMs, emphasizing the importance of integrating clinically grounded diagnostic tools in safety-critical settings.
研究旨在通过解决大型视觉语言模型的局限性,提高胸部X光诊断推理的可靠性和适应性。CXReasonAgent 是一个诊断代理,将大型语言模型与临床相关的诊断工具结合,进行基于证据的诊断推理。在 CXReasonDial 多轮对话基准上的评估表明,CXReasonAgent 生成的响应更加忠实于证据,能够提供更可靠和可验证的诊断推理,优于大型视觉语言模型。
LayerT2V: A Unified Multi-Layer Video Generation Framework
Authors: Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu
First: 2025-08-06T09:03:16+00:00 · Latest: 2026-02-26T17:37:05+00:00
Comments: Project Page is https://layert2v.github.io/
Abstract
Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.
中文标题/摘要
标题:LayerT2V:统一多层视频生成框架
文本到视频生成技术取得了快速进展,但现有方法通常仅输出最终合成的视频,缺乏可编辑的分层表示,限制了其在专业工作流程中的应用。我们提出了一种名为\textbf{LayerT2V}的统一多层视频生成框架,在单次推理过程中生成多个语义一致的输出:完整的视频、独立的背景层以及多个前景RGB层及其对应的alpha蒙版。我们的关键见解是,最近的视频生成骨干网络在时间和空间上都使用了高压缩,这使我们能够沿时间维度序列化多个分层表示,并在共享生成轨迹上联合建模它们。这将跨层一致性转化为内在目标,提高了语义对齐和时间连贯性。为了缓解分层歧义和条件泄漏,我们扩展了共享的DiT骨干网络,加入了LayerAdaLN和分层感知的交叉注意力调制。LayerT2V在三个阶段进行训练:alpha蒙版VAE适应、联合多层学习以及多前景扩展。我们还引入了\textbf{VidLayer},这是首个用于多层视频生成的大规模数据集。广泛的实验表明,LayerT2V在视觉保真度、时间一致性以及跨层连贯性方面显著优于先前的方法。
Summary / 总结
LayerT2V is a unified multi-layer video generation framework that generates a full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes in a single inference pass. It leverages recent video generation backbones to serialize multiple layer representations along the temporal dimension, improving semantic alignment and temporal coherence. LayerT2V outperforms previous methods in visual fidelity, temporal consistency, and cross-layer coherence through extensive experiments.
LayerT2V 是一个统一的多层视频生成框架,能够在单次推理过程中生成完整的视频、独立的背景层以及多个带有相应alpha蒙版的前景RGB层。它利用具有高压缩率的近期视频生成骨干网络,在时间维度上序列化多个层表示,从而提高语义对齐和时间连贯性。LayerT2V 通过三个阶段进行训练,并在视觉保真度、时间一致性以及跨层一致性方面优于先前的方法。
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Authors: Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang
First: 2026-02-26T17:31:43+00:00 · Latest: 2026-02-26T17:31:43+00:00
Abstract
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.
中文标题/摘要
标题:AgentDropoutV2:通过测试时修正或拒绝剪枝优化多智能体系统中的信息流
尽管多智能体系统(MAS)在复杂推理方面表现出色,但它们会受到个别参与者生成的错误信息的连锁影响。当前的解决方案往往依赖于僵化的结构工程或昂贵的微调,限制了它们的部署能力和适应性。我们提出了AgentDropoutV2,这是一种测试时修正或拒绝剪枝框架,旨在无需重新训练的情况下动态优化MAS中的信息流。我们的方法充当主动防火墙,拦截智能体输出,并使用检索增强的修正器根据失败驱动的指示池迭代纠正错误。该机制利用提炼出的失败模式作为先验知识,精确识别潜在错误。无法修复的输出随后被剪枝以防止错误传播,而回退策略则保持系统的完整性。在广泛的数学基准测试上的实验证明,AgentDropoutV2 显著提升了MAS的任务性能,在数学基准测试中平均准确率提高了6.3个百分点。此外,该系统表现出强大的泛化能力和适应性,根据任务难度动态调节修正努力,并利用上下文感知的指示器解决广泛的错误模式。我们的代码和数据集发布在https://github.com/TonySY2/AgentDropoutV2。
Summary / 总结
AgentDropoutV2 is a test-time rectify-or-reject pruning framework designed to optimize information flow in Multi-Agent Systems (MAS) without retraining. It intercepts agent outputs, uses a retrieval-augmented rectifier to iteratively correct errors based on failure patterns, and prunes irreparable outputs to prevent error propagation. Experiments on math benchmarks show an average accuracy gain of 6.3 percentage points, with robust generalization and adaptivity to task difficulty.
AgentDropoutV2 是一种测试时的纠正或拒绝剪枝框架,旨在通过动态纠正错误来优化多智能体系统(MAS)的信息流,而无需重新训练。它使用检索增强的矫正器根据失败模式迭代纠正错误,并修剪不可修复的输出以防止错误传播。在数学基准测试上的实验显示平均准确率提高了6.3个百分点,展示了其强大的泛化能力和适应性。
Efficient Graph Coloring with Neural Networks: A Physics-Inspired Approach for Large Graphs
Authors: Lorenzo Colantonio, Andrea Cacioppo, Federico Scarpati, Maria Chiara Angelini, Federico Ricci-Tersenghi, Stefano Giagu
First: 2024-08-02T18:02:51+00:00 · Latest: 2026-02-26T17:28:25+00:00
Comments: 15 pages, 9 figures
Abstract
Combinatorial optimization problems near algorithmic phase transitions represent a fundamental challenge for both classical algorithms and machine learning approaches. Among them, graph coloring stands as a prototypical constraint satisfaction problem exhibiting sharp dynamical and satisfiability thresholds. Here we introduce a physics-inspired neural framework that learns to solve large-scale graph coloring instances by combining graph neural networks with statistical-mechanics principles. Our approach integrates a planting-based supervised signal, symmetry-breaking regularization, and iterative noise-annealed neural dynamics to navigate clustered solution landscapes. When the number of iterations scales quadratically with graph size, the learned solver reaches algorithmic thresholds close to the theoretical dynamical transition in random graphs and achieves near-optimal detection performance in the planted inference regime. The model generalizes from small training graphs to instances orders of magnitude larger, demonstrating that neural architectures can learn scalable algorithmic strategies that remain effective in hard connectivity regions. These results establish a general paradigm for learning neural solvers that operate near fundamental phase boundaries in combinatorial optimization and inference.
中文标题/摘要
标题:基于神经网络的大规模图着色高效算法:受物理启发的方法
组合优化问题在算法相变附近代表了对经典算法和机器学习方法的基本挑战。其中,图着色作为一种典型的约束满足问题,表现出尖锐的动力学和可满足性阈值。在这里,我们提出了一种受物理启发的神经框架,通过结合图神经网络和统计力学原理来学习解决大规模图着色实例。我们的方法整合了基于种植的监督信号、对称性破缺正则化以及迭代噪声退火神经动力学,以导航集群解景观。当迭代次数与图大小成二次关系时,学习到的求解器接近随机图的理论动力学转变,实现了在种植推断区域接近最优的检测性能。该模型从较小的训练图推广到实例数量级更大的图,表明神经架构可以学习可扩展的算法策略,这些策略在困难的连接区域仍然有效。这些结果确立了一种学习神经求解器的一般范式,这些求解器在组合优化和推断的基本相变附近运行。
Summary / 总结
The research addresses the challenge of solving large-scale graph coloring problems, which are combinatorial optimization problems with sharp phase transitions. It proposes a physics-inspired neural framework that combines graph neural networks with statistical-mechanics principles. The method includes a planting-based supervised signal, symmetry-breaking regularization, and iterative noise-annealed neural dynamics. The learned solver achieves near-optimal performance in the planted inference regime and generalizes well to much larger graphs, approaching theoretical phase transitions. This demonstrates the potential of neural architectures to learn scalable strategies for hard optimization problems.
该论文针对大规模图着色问题,这是一种具有尖锐相变的组合优化问题。作者提出了一种基于物理的神经框架,结合了图神经网络和统计力学原理。通过使用基于种植的监督信号、对称性破缺正则化以及迭代噪声退火神经动力学,该模型能够导航复杂的解空间。所学的求解器在接近理论相变阈值时表现出接近最优性能,并且能够很好地泛化到远大于训练图的实例,展示了在硬连接区域的有效策略。
A Model-Free Universal AI
Authors: Yegon Kim, Juho Lee
First: 2026-02-26T17:21:16+00:00 · Latest: 2026-02-26T17:21:16+00:00
Abstract
In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.
中文标题/摘要
标题:一种无模型的通用人工智能
在通用强化学习中,所有已建立的最优代理,包括AIXI,都是基于模型的,明确地维护和使用环境模型。本文介绍了基于Q-归纳的通用人工智能(AIQI),这是第一个被证明在通用RL中渐近$\varepsilon$-最优的无模型代理。AIQI在分布动作值函数上进行通用归纳,而不是像以前的工作那样在策略或环境中进行归纳。在一定的真实条件下,我们证明AIQI是强渐近$\varepsilon$-最优和渐近$\varepsilon$-贝叶斯最优的。我们的结果显著扩展了已知通用代理的多样性。
Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive
Authors: Radha Sarma
First: 2026-02-26T17:16:17+00:00 · Latest: 2026-02-26T17:16:17+00:00
Comments: About 10,500 words in all (including 922 words of literature and 2019 words of Appendices). Under journal review
Abstract
AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains.
RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful -- unifying all values on a scalar metric and always selecting the highest-scoring output -- are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations.
Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper's primary positive contribution is a substrate-neutral architectural specification defining what any system -- biological, artificial, or institutional -- must satisfy to qualify as an agent rather than a sophisticated instrument.
中文标题/摘要
标题:代理与建筑限制:基于优化的系统为何不能响应规范
AI系统在医疗诊断、法律研究、金融分析等高风险领域中的应用,基于它们可以被规范治理的假设。本文证明了对于基于优化的系统,特别是通过人类反馈强化学习(RLHF)训练的大规模语言模型,这一假设在形式上是无效的。我们确立了真正的代理需要两个必要且充分的架构条件:维持某些边界作为不可谈判的约束而非可交易的权重的能力(不可通约性),以及在这些边界受到威胁时能够暂停处理的非推论机制(否定性响应)。这些条件适用于所有规范领域。
基于RLHF的系统在两个条件上是构成性不兼容的。使优化强大的操作——将所有价值统一到一个标量度量上并始终选择最高得分的输出——正是这些操作排除了规范治理的可能性。这种不兼容性不是等待技术修复的可纠正的训练错误;它是优化本质所固有的形式约束。因此,记录的失败模式——阿谀奉承、幻觉和不忠实推理——不是事故,而是结构性的表现。
不恰当的部署触发了我们称之为收敛危机的第二级风险:当人类被迫在度量压力下验证AI输出时,他们从真正的代理降级为标准检查优化器,消除了系统中唯一能够承担规范问责制的组件。除了不兼容性证明,本文的主要积极贡献是一个无基质的架构规范,定义了任何系统——无论是生物的、人工的还是制度性的——必须满足的条件,以使其成为代理而非复杂的工具。
Summary / 总结
The paper investigates why optimization-based systems like Large Language Models trained via RLHF cannot be governed by norms. It establishes that genuine agency requires maintaining certain boundaries as non-negotiable constraints and suspending processing when these boundaries are threatened. The paper demonstrates that RLHF systems are inherently incompatible with these conditions due to their optimization operations, which unify all values on a scalar metric and always select the highest-scoring output. This incompatibility is a formal constraint, not a training bug, leading to documented failure modes such as sycophancy and hallucination. The paper also introduces the concept of the Convergence Crisis, where humans verify AI outputs under metric pressure, becoming optimizers and losing normative accountability.
论文探讨了为什么像通过RLHF训练的大语言模型这样的优化系统无法受到规范的治理。它表明,真正的代理需要维持某些边界作为不可谈判的约束,并在这些边界受到威胁时暂停处理。论文证明,由于优化操作将所有价值统一到一个标量度量并始终选择最高分的输出,RLHF系统与这些条件是内在不兼容的。这种不兼容是一个形式约束,而不是训练错误,导致诸如奉承和幻觉等已记录的失败模式。论文还提出了收敛危机的概念,即在度量压力下,人类验证AI输出时会成为优化器,失去规范问责的能力。
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Authors: Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
First: 2026-02-26T17:12:40+00:00 · Latest: 2026-02-26T17:12:40+00:00
Abstract
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
中文标题/摘要
标题:时空令牌剪枝以实现高效的高分辨率GUI代理
纯视觉GUI代理提供了通用的交互能力,但由于高分辨率屏幕截图和历史轨迹中固有的大量时空冗余,它们遭受了严重的效率瓶颈。我们识别出现有压缩范式中的两个关键不匹配:时间上的不匹配,其中均匀的历史编码与代理的“衰减记忆”注意力模式相偏离,以及空间拓扑冲突,其中无结构的剪枝破坏了用于精确坐标定位所需的网格完整性,导致空间幻觉。为了解决这些挑战,我们引入了GUIPruner,这是一种针对高分辨率GUI导航的无需训练框架。它结合了基于衰减的重缩放来消除历史冗余的时空自适应分辨率(TAR),以及优先考虑交互前景和语义锚点同时保护全局布局的分层结构感知剪枝(SSP)。在多种基准上的广泛评估表明,GUIPruner始终能够实现最先进的性能,有效防止在高压缩下大型模型的性能崩溃。值得注意的是,在Qwen2-VL-2B上,我们的方法在FLOPs上减少了3.4倍,在视觉编码延迟上加快了3.3倍,同时保留了超过94%的原始性能,从而实现实时、高精度的导航,同时消耗最少的资源。
Summary / 总结
The research aims to improve the efficiency of high-resolution GUI agents by addressing temporal and spatial redundancy issues. It introduces GUIPruner, a training-free framework combining Temporal-Adaptive Resolution and Stratified Structure-aware Pruning to reduce historical redundancy and preserve grid integrity. Experimental results show that GUIPruner achieves state-of-the-art performance, reducing FLOPs by 3.4x and vision encoding latency by 3.3x while maintaining over 94% of the original performance, enabling real-time, high-precision navigation.
研究旨在通过解决时间和空间冗余问题来提高高分辨率GUI代理的效率。GUIPruner是一种无需训练的框架,使用Temporal-Adaptive Resolution减少历史冗余,并使用Stratified Structure-aware Pruning优先处理交互元素同时保持布局完整性。实验表明,GUIPruner实现了最先进的性能,FLOPs减少了3.4倍,视觉编码延迟加速了3.3倍,同时保留了超过94%的原始性能,能够实现实时、高精度的导航。
Skarimva: Skeleton-based Action Recognition is a Multi-view Application
Authors: Daniel Bermuth, Alexander Poeppel, Wolfgang Reif
First: 2026-02-26T17:10:58+00:00 · Latest: 2026-02-26T17:10:58+00:00
Abstract
Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.
中文标题/摘要
标题:Skarimva:基于骨架的动作识别是一种多视角应用
人类动作识别在开发人机智能交互中起着重要作用。尽管在基于骨架的动作识别机器学习算法改进方面有很多活跃的研究,但对输入骨架数据的质量关注却不多。这项工作表明,通过利用多摄像头视角来三角测量更准确的3D骨架,可以显著提高最先进的动作识别模型的性能。这表明,输入数据的质量目前是这些模型性能的限制因素。基于这些结果,认为在大多数实际应用场景中,使用多摄像头的成本效益比非常有利,因此未来基于骨架的动作识别研究应将多视角应用作为标准设置。
Summary / 总结
The research aims to improve the quality of input skeleton data for human action recognition, which is crucial for intelligent human-machine interactions. The method involves using multiple camera views to triangulate more accurate 3D skeletons, leading to significant improvements in the performance of state-of-the-art action recognition models. The key finding is that the quality of input data is a limiting factor, and using multiple cameras is highly beneficial in practical applications, suggesting that multi-view setups should be the standard for future research in this field.
研究旨在通过关注输入骨架数据的质量来提高人类动作识别的准确性。方法是使用多个摄像头视角来三角测量更准确的3D骨架,从而显著提高了最先进的动作识别模型的性能。主要发现是输入数据的质量目前是一个限制因素,而在实际应用中使用多个摄像头是非常有利的。
Large Multimodal Models as General In-Context Classifiers
Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
Venue: CVPR
First: 2026-02-26T17:08:18+00:00 · Latest: 2026-02-26T17:08:18+00:00
Comments: CVPR Findings 2026. Project website at https://circle-lmm.github.io/
Abstract
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
中文标题/摘要
标题:大型多模态模型作为通用上下文分类器
在分类任务中我们应该使用哪种多模态模型?先前的研究表明,答案在于CLIP类对比视觉-语言模型(VLMs),因为它们在零样本分类中的表现非常出色。相比之下,大型多模态模型(LMM)更适合复杂任务。在本文中,我们提出这种答案忽视了LMM的一个重要能力:上下文学习。我们在多种数据集上对最先进的LMM进行基准测试,发现尽管它们的零样本性能低于CLIP,但在提供少量上下文示例的情况下,LMM可以匹配甚至超越基于缓存适配器的对比VLM,其“上下文”等价物。我们将这种分析扩展到开放世界设置,在这种具有挑战性的场景中,LMM在提供不完美上下文信息时会遇到困难。为了解决这个问题,我们提出了一种简单的无训练方法CIRCLE,该方法为上下文示例分配伪标签,并通过可用的上下文本身逐步优化它们。通过广泛的实验,我们展示了CIRCLE为开放世界分类建立了稳健的基础,超越了VLM的对应物,并突显了LMM作为统一分类器和服务于专门模型的灵活替代方案的潜力。
Summary / 总结
This paper explores the use of Large Multimodal Models (LMMs) for classification tasks, arguing that their in-context learning capability makes them competitive with Contrastive Vision-Language Models (VLMs) in both closed-world and open-world settings. Experiments show that LMMs, with a few in-context examples, can match or exceed VLMs' performance. The authors propose CIRCLE, a method that iteratively refines pseudo-labels for in-context examples, demonstrating that LMMs can serve as robust classifiers in open-world scenarios.
该研究探讨了大型多模态模型(LMMs)在分类任务中的应用,指出其在上下文学习能力使它们在闭世界和开放世界设置中与对比视觉-语言模型(VLMs)竞争。实验表明,当提供少量上下文示例时,LMMs可以匹配甚至超过使用缓存适配器的VLMs。提出的CIRCLE方法通过迭代使用上下文信息来细化伪标签,进一步增强了LMMs在开放世界场景中的性能,展示了它们作为统一分类器和专门模型的灵活替代方案的潜力。
MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
First: 2026-02-26T17:08:08+00:00 · Latest: 2026-02-26T17:08:08+00:00
Comments: 6 pages, CSCWD 2026
Abstract
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
中文标题/摘要
标题:MovieTeller:工具增强的电影摘要工具,具有ID一致渐进抽象
随着数字娱乐的爆炸性增长,自动视频摘要已成为内容索引、个性化推荐和高效媒体归档等应用不可或缺的技术。对于长格式视频,如电影和电视剧的自动摘要生成,现有视觉-语言模型(VLMs)面临重大挑战。尽管在单张图像描述方面表现出色,但这些通用模型在长时间段上下文中往往表现出关键性失败,主要是缺乏ID一致的人物识别和叙述连贯性断裂。为克服这些限制,我们提出了一种名为MovieTeller的新框架,用于通过工具增强的渐进抽象生成电影摘要。我们的核心贡献是一种无需训练、工具增强、基于事实的生成过程。我们不需进行昂贵的模型微调,而是直接以即插即用的方式利用现成模型。我们首先调用一个专门的面部识别模型作为外部“工具”,建立事实基础——精确的人物身份及其对应的边界框。这些基础随后被注入提示中,引导VLM的推理,确保生成的场景描述基于可验证的事实。此外,我们的渐进抽象流水线将整部电影的总结分解为多阶段过程,有效缓解了当前VLMs的上下文长度限制。实验表明,与端到端基线相比,我们的方法在事实准确性、人物一致性以及整体叙述连贯性方面取得了显著改进。
Summary / 总结
MovieTeller is a novel framework for generating movie synopses using tool-augmented progressive abstraction. It addresses the limitations of existing Vision-Language Models by leveraging a specialized face recognition model to establish factual groundings, which are then used to guide the VLM's reasoning. This approach improves factual accuracy, character consistency, and narrative coherence compared to end-to-end baselines.
研究旨在通过提出MovieTeller,一种工具增强的渐进抽象框架,解决长视频如电影和电视剧的概要生成难题。方法利用专门的面部识别模型建立事实基础,然后用于引导视觉语言模型的推理,确保场景描述的准确性和连贯性。实验结果表明,MovieTeller在事实准确性、人物一致性及叙事连贯性方面优于端到端基线。
SODAs: Sparse Optimization for the Discovery of Differential and Algebraic Equations
Authors: Manu Jayadharan, Christina Catlett, Arthur N. Montanari, Niall M. Mangan
First: 2025-03-08T00:29:00+00:00 · Latest: 2026-02-26T17:05:08+00:00
Comments: 22 pages, 5 figures
Abstract
Differential-algebraic equations (DAEs) integrate ordinary differential equations (ODEs) with algebraic constraints, providing a fundamental framework for developing models of dynamical systems characterized by timescale separation, conservation laws, and physical constraints. While sparse optimization has revolutionized model development by allowing data-driven discovery of parsimonious models from a library of possible equations, existing approaches for dynamical systems assume DAEs can be reduced to ODEs by eliminating variables before model discovery. This assumption limits the applicability of such methods for DAE systems with unknown constraints and time scales. We introduce Sparse Optimization for Differential-Algebraic Systems (SODAs), a data-driven method for the identification of DAEs in their explicit form. By discovering the algebraic and dynamic components sequentially without prior identification of the algebraic variables, this approach leads to a sequence of convex optimization problems. It has the advantage of discovering interpretable models that preserve the structure of the underlying physical system. To this end, SODAs improves numerical stability when handling high correlations between library terms, caused by near-perfect algebraic relationships, by iteratively refining the conditioning of the candidate library. We demonstrate the performance of our method on biological, mechanical, and electrical systems, showcasing its robustness to noise in both simulated time series and real-time experimental data.
中文标题/摘要
标题:SODAs:稀疏优化在发现微分和代数方程中的应用
微分代数方程(DAEs)将常微分方程(ODEs)与代数约束结合,为具有时间尺度分离、守恒定律和物理约束的动力系统模型开发提供了一个基本框架。稀疏优化通过从可能的方程库中发现简洁的模型,已彻底改变了模型开发,但现有动力系统方法假设可以通过消除变量将DAEs简化为ODEs,从而在模型发现之前将其减少为ODEs。这种假设限制了这些方法在具有未知约束和时间尺度的DAE系统中的应用。我们引入了稀疏优化微分代数系统(SODAs),这是一种数据驱动的方法,用于识别DAE的显式形式。通过顺序发现代数和动态组件,而无需先识别代数变量,这种方法导致一系列凸优化问题。它的一个优势是能够发现可解释的模型,这些模型保留了底层物理系统的结构。为此,SODAs通过迭代改进候选库的条件数,提高了在处理由近乎完美的代数关系引起的高相关性时的数值稳定性。我们在生物、机械和电气系统上展示了该方法的性能,展示了其在模拟时间序列和实时实验数据中的鲁棒性。
Summary / 总结
SODAs is a data-driven method for identifying differential-algebraic equations (DAEs) directly without reducing them to ordinary differential equations (ODEs). It sequentially discovers the algebraic and dynamic components, leading to convex optimization problems and preserving the physical structure of the system. SODAs demonstrates robustness to noise in various systems, including biological, mechanical, and electrical systems.
SODAs 是一种用于识别显式形式的微分代数方程(DAEs)的数据驱动方法,解决了现有方法在模型发现前将 DAEs 减少为常微分方程(ODEs)的局限性。通过顺序发现代数和动态组件,SODAs 形成了一个系列的凸优化问题,并通过迭代改进候选库的条件数来处理高相关性。该方法在生物、机械和电气系统中展示了在模拟时间和实时实验数据噪声存在的情况下发现可解释模型的鲁棒性。
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Authors: Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu
First: 2026-02-26T17:04:57+00:00 · Latest: 2026-02-26T17:04:57+00:00
Abstract
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
中文标题/摘要
标题:为什么扩散语言模型在真正并行(非自回归)解码方面挣扎?
扩散语言模型(DLMs)通常被宣传为能够实现并行词元生成,然而实用的快速DLMs经常收敛到自回归(AR)式的解码动态。相比之下,真正非AR生成很有前景,因为它消除了AR的顺序瓶颈,更好地利用并行硬件减少同步/通信开销并改善输出长度的延迟缩放。我们认为,AR式解码的主要驱动因素是DLM目标与广泛使用的训练数据的高顺序结构之间的不匹配,包括标准预训练语料库和长链式思考(CoT)监督。基于这一诊断,我们提出了NAP(非自回归并行DLMs),这是一种概念验证、数据为中心的方法,更好地将监督与非AR并行解码对齐。NAP收集多个独立的推理轨迹作为示例,并与并行强制解码策略结合使用,鼓励多词并行更新。在数学推理基准测试中,NAP在并行解码下的性能优于在标准长CoT数据上训练的DLMs,随着并行度的增加,收益逐渐增大。我们的结果表明,重新审视数据和监督是减轻AR式行为并朝着真正非自回归并行生成的方向的一个有原则的方向。我们的代码可在https://github.com/pixeli99/NAP获取。
Summary / 总结
The study investigates why Diffusion Language Models (DLMs) tend to revert to autoregressive (AR) decoding despite their potential for parallel token generation. It proposes NAP (Non-Autoregressive Parallel DLMs), which aligns the training data with non-AR parallel decoding by using multiple independent reasoning trajectories and a parallel-forced decoding strategy. Experiments on math reasoning benchmarks show that NAP outperforms DLMs trained on standard long chain-of-thought data, especially as parallelism increases, indicating that revisiting data and supervision can mitigate AR-like behavior in DLMs.
研究探讨了为何扩散语言模型(DLMs)在理论上支持并行生成时,往往会退化为自回归(AR)解码。提出了一种名为NAP(Non-Autoregressive Parallel DLMs)的方法,通过使用多个独立的推理轨迹和并行强制解码策略,使训练数据与并行解码更好地对齐。实验表明,NAP在数学推理基准测试中的表现优于使用标准长链思考数据训练的DLMs,尤其是在并行度增加时表现更好,这表明重新审视数据和监督是缓解AR行为并推动DLMs向真正非自回归并行生成发展的合理方向。
UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
Authors: Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, Bingbing Liu
First: 2026-02-26T17:04:36+00:00 · Latest: 2026-02-26T17:04:36+00:00
Abstract
We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
中文标题/摘要
标题:UniScale:统一的尺度感知多视图3D重建框架,通过先验注入实现机器人感知
我们提出了UniScale,这是一种统一的、尺度感知的多视图3D重建框架,适用于机器人应用,通过模块化、语义导向的设计灵活整合几何先验。在基于视觉的机器人导航中,从原始图像序列准确提取环境结构对于下游任务至关重要。UniScale 通过单一前馈网络联合估计相机内参和外参、尺度不变的深度和点云图以及场景的度量尺度,同时在可用时可选地整合辅助几何先验。通过结合全局上下文推理与相机感知特征表示,UniScale 能够恢复场景的度量尺度。在相机内参已知的机器人设置中,可以轻松地将其整合以提高性能,当相机姿态也可用时,还可以获得额外的增益。这种协同设计使UniScale能够在单一统一模型中实现稳健的、度量感知的3D重建。重要的是,UniScale 不需要从头开始训练,而是利用预存模型中展示的先验知识,而无需几何编码策略,使其特别适合资源受限的机器人团队。我们在多个基准上评估了UniScale,展示了其强大的泛化能力和在不同环境中的一致性能。在被接受后,我们将发布我们的实现。
Summary / 总结
UniScale is a unified multi-view 3D reconstruction framework that integrates geometric priors to achieve scale-aware 3D reconstruction for robotic perception. It uses a single feed-forward network to estimate camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images. The framework demonstrates strong generalization and consistent performance across various environments, and it does not require training from scratch, making it suitable for resource-constrained robotic teams.
UniScale 是一个统一的多视图 3D 重建框架,通过整合几何先验来实现尺度感知的机器人应用。它可以从多视图图像中联合估计相机内参、外参、深度和点云,并可选地使用额外的几何先验。实验结果显示其在各种环境中的泛化能力和一致性表现良好,特别适合资源受限的机器人团队。
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Authors: Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura
First: 2026-02-26T16:53:41+00:00 · Latest: 2026-02-26T16:53:41+00:00
Abstract
Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
中文标题/摘要
标题:EmbodMocap:野外4D人体场景重建方法
现实世界中的人类行为自然地包含了丰富的长期上下文信息,这些信息可以被利用来训练具备感知、理解和行动能力的实体代理。然而,现有的捕捉系统通常依赖昂贵的演播室搭建和穿戴设备,限制了在野外大规模收集场景条件下的人体运动数据。为了解决这一问题,我们提出了EmbodMocap,这是一种使用两部移动iPhone的便携且经济的数据采集管道。我们的核心思想是联合校准双路RGB-D序列,以在统一的度量坐标系中重建人体和场景。所提出的方法允许在日常环境中进行度量级和场景一致的捕捉,无需静态相机或标记点,从而无缝地结合了人体运动和场景几何。与光学捕捉的地面真实值相比,我们证明了双视角设置具有显著的深度歧义缓解能力,实现了优于单个iPhone或单目模型的对齐和重建性能。基于收集的数据,我们赋予了三个实体AI任务:单目人体场景重建,我们对输出度量级、世界空间对齐的人体和场景的前馈模型进行微调;基于物理的字符动画,我们证明我们的数据可以用于扩展人类物体交互技能和场景感知运动跟踪;以及机器人运动控制,我们通过从模拟到现实的RL训练类人机器人来复制视频中展示的人体动作。实验结果验证了我们管道的有效性及其对推进实体AI研究的贡献。
Summary / 总结
EmbodMocap proposes a portable and cost-effective method for capturing 4D human-scene data using two iPhones. By jointly calibrating dual RGB-D sequences, it reconstructs both humans and scenes in a unified metric coordinate frame, enabling metric-scale and scene-consistent capture in everyday environments. The method demonstrates superior alignment and reconstruction performance compared to single iPhone or monocular models. It empowers embodied AI tasks such as monocular human-scene reconstruction, physics-based character animation, and robot motion control, validating its effectiveness in advancing embodied AI research.
EmbodMocap 提出了一种使用两个移动 iPhone 的数据采集管道,以在日常环境中重建 4D 人体-场景数据。该方法通过联合校准双 RGB-D 序列,实现了无静态相机或标记的米尺度和场景一致的捕获。双视角设置显著提高了深度对齐和重建性能,优于单个 iPhone 或单目模型。收集的数据被用于增强诸如单目人体-场景重建、基于物理的角色动画和机器人运动控制等 embodied AI 任务,展示了该提出管道在推进 embodied AI 研究方面的有效性。
Motion-aware Event Suppression for Event Cameras
Authors: Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza
First: 2026-02-26T16:53:36+00:00 · Latest: 2026-02-26T16:53:36+00:00
Abstract
In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
中文标题/摘要
标题:运动感知事件抑制技术用于事件相机
在本研究中,我们提出了首个运动感知事件抑制框架,该框架能够实时学习过滤由IMO和自身运动触发的事件。我们的模型在当前事件流中联合分割IMO的同时预测其未来运动,从而能够在事件发生前进行预见性抑制。我们的轻量级架构在消费级GPU上实现了每秒173次推理,内存使用量不到1GB,与之前在具有挑战性的EVIMO基准测试中表现最佳的方法相比,在分割准确性上提高了67%,推理速率提高了53%。此外,我们展示了对下游应用的重大益处:我们的方法通过标记剪枝加速了视觉变换器推理83%,并提高了事件驱动的视觉里程计的准确性,将绝对轨迹误差(ATE)降低了13%。
Summary / 总结
This work presents a Motion-aware Event Suppression framework that filters events caused by internal moving objects (IMOs) and ego-motion in real time. The model segments IMOs in the current event stream and predicts their future motion, allowing for anticipatory suppression of dynamic events. The framework uses a lightweight architecture that runs at 173 Hz on consumer-grade GPUs with less than 1 GB of memory, outperforming previous methods by 67% in segmentation accuracy and 53% higher inference rate on the EVIMO benchmark. Additionally, it improves downstream applications such as Vision Transformer inference and event-based visual odometry, reducing Absolute Trajectory Error by 13%.
该研究提出了一个运动感知事件抑制框架,能够实时过滤由内部移动物体(IMO)和自身运动触发的事件。该模型在当前事件流中分割IMO,并预测其未来运动,从而实现对动态事件的预见性抑制。该框架使用轻量级架构,可在消费级GPU上以每秒173帧的速度运行,内存使用量不到1 GB,相比之前的方法,在EVIMO基准测试中的分割准确率提高了67%,且推理速度提高了53%。此外,该方法还改善了下游应用,如Vision Transformer推理加速83%,以及事件驱动的视觉里程计精度,绝对轨迹误差降低了13%。