Multimodal Large Language Models as Image Classifiers
Authors: Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas
First: 2026-03-06T18:59:58+00:00 · Latest: 2026-03-06T18:59:58+00:00
Abstract
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.
中文标题/摘要
标题:多模态大型语言模型作为图像分类器
多模态大型语言模型(MLLM)的分类性能在很大程度上取决于评估协议和真实标签的质量。比较MLLM、监督模型和视觉语言模型的研究报告结论不一,我们表明这些分歧源于要么夸大要么低估性能的评估协议。在最常见的评估协议中,我们识别并解决了关键问题:模型输出超出提供的类别列表并被丢弃、由于弱的选择题干扰项导致的夸大结果以及在开放世界设置中由于输出映射不佳而表现不佳。我们还量化了通常被忽视的设计选择——批量大小、图像排序和文本编码器选择的影响,表明它们显著影响准确性。在我们的多标签重注释的625个ImageNet-1k类别上进行评估显示,MLLM最受益于修正的标签(最多+10.8%),显著缩小了与监督模型之间的感知差距。因此,报告的MLLM在分类上的表现不佳很大程度上是由于嘈杂的真实标签和有缺陷的评估协议,而不是真正的模型缺陷。对监督训练信号依赖较少的模型对注释质量最为敏感。最后,我们展示了MLLM可以辅助人类注释员:在受控案例研究中,注释员在大约50%的困难案例中确认或整合了MLLM的预测,证明了它们在大规模数据集整理中的潜力。
Summary / 总结
The study investigates the performance of Multimodal Large Language Models (MLLM) as image classifiers, identifying issues in evaluation protocols and ground truth quality that lead to conflicting conclusions. By correcting these issues, the research reveals that MLLMs benefit significantly from accurate labels, narrowing the performance gap with supervised models. The study also finds that models less reliant on supervised training signals are more sensitive to annotation quality and that MLLMs can assist human annotators in dataset curation.
研究探讨了多模态大型语言模型(MLLM)作为图像分类器的性能,发现评估协议中的问题会导致其性能被夸大或低估。通过解决这些问题,研究揭示MLLM在ReGT(ImageNet-1k类的重新注释)上的准确性提高了最多10.8%,这表明其大部分报告的性能不足是由于噪声的标注和不合理的评估协议造成的,而不是模型本身的缺陷。此外,MLLM在数据集整理中可以辅助人类标注员,约50%的困难案例中,标注员确认或整合了MLLM的预测结果。
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Authors: Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu
First: 2026-03-06T18:59:57+00:00 · Latest: 2026-03-06T18:59:57+00:00
Comments: Project page: https://omni-diffusion.github.io
Abstract
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.
中文标题/摘要
标题:Omni-Diffusion:基于掩码离散扩散模型的统一多模态理解和生成
尽管近期的多模态大型语言模型(MLLMs)取得了显著进展,但它们主要采用传统的自回归架构作为基础,这在架构设计方面留下了探索有效且高效的替代方案的巨大空间。同时,近期的研究成功将离散扩散模型应用于视觉理解、图像生成等多个领域,揭示了其作为多模态系统潜在有力基础架构的巨大潜力。受到这些开创性研究的启发,我们提出了Omni-Diffusion,这是首个完全基于掩码离散扩散模型的任意到任意的多模态语言模型,它统一了文本、语音和图像之间的理解和生成。Omni-Diffusion采用统一的掩码离散扩散模型直接捕捉离散多模态标记的联合分布。这种方法不仅支持二模态任务,还支持涉及多种模态的更复杂场景。在一系列多样的基准测试中,我们的方法在处理两种或多种模态的现有多模态系统中表现出色或与其持平,突显了扩散模型在推动下一代多模态基础模型方面的巨大潜力。项目网页:https://omni-diffusion.github.io
Summary / 总结
Omni-Diffusion is a unified multimodal language model that uses mask-based discrete diffusion models to understand and generate across text, speech, and images. It outperforms or matches existing multimodal systems on various benchmarks, demonstrating the potential of diffusion models in multimodal tasks. This work fills a gap in architectural design by providing an alternative to autoregressive models and opens new possibilities for multimodal understanding and generation.
Omni-Diffusion 是一种使用基于掩码的离散扩散模型来理解和生成跨文本、语音和图像的统一多模态语言模型。它在多种基准测试中表现出色,超过了或与现有的多模态系统相当,展示了扩散模型在多模态任务中的潜力。这项工作解决了传统自回归架构的局限性,并为多模态基础模型开辟了新的可能性。
BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations
Authors: Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding
First: 2026-03-06T18:59:55+00:00 · Latest: 2026-03-06T18:59:55+00:00
Comments: 4 figures, 6 tables in the main paper, 32 pages in total
Abstract
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
中文标题/摘要
标题:BEVLM:将大型语言模型的语义知识提炼为鸟瞰图表示
将大型语言模型(LLMs)集成到自动驾驶中引起了广泛关注,因为它们强大的推理和语义理解能力对于处理复杂的决策和长尾场景至关重要。然而,现有方法通常独立地将LLMs输入多视图和多帧图像的标记,导致冗余计算和空间一致性有限。这种视觉处理的分离阻碍了准确的三维空间推理,并且无法在视图之间保持几何一致性。另一方面,从几何标注任务(例如物体检测)中学习的鸟瞰图(BEV)表示提供了空间结构,但缺乏基础视觉编码器的语义丰富性。为了弥合这一差距,我们提出了一种BEVLM框架,该框架将空间一致且语义提炼的BEV表示与LLMs连接起来。通过广泛的实验,我们展示了BEVLM使LLMs在跨视图驾驶场景中推理更加有效,通过利用BEV特征作为统一输入,准确率提高了46%。此外,通过将LLMs中的语义知识提炼到BEV表示中,BEVLM在安全关键场景中将闭环端到端驾驶性能显著提高了29%。
Summary / 总结
The paper proposes BEVLM, a framework that integrates semantic knowledge from LLMs into BEV representations to enhance autonomous driving. It addresses the limitations of existing methods by using a unified BEV input for LLMs, which improves 3D spatial reasoning and maintains geometric coherence. Experimental results show a 46% increase in accuracy and a 29% improvement in closed-loop driving performance in safety-critical scenarios.
BEVLM 将 LLM 的语义知识融入 BEV 表示中以提升自动驾驶性能。它解决了现有方法的空间不一致性和语义贫乏的问题,提供了一致的空间和丰富的语义输入,从而改善了 3D 空间推理并保持了几何一致性。实验表明,BEVLM 在交叉视图驾驶场景中的准确率提高了 46%,并在安全关键场景中将闭环端到端驾驶性能提升了 29%。
Fly360: Omnidirectional Obstacle Avoidance within Drone View
Authors: Xiangkai Zhang, Dizhe Zhang, WenZhuo Cao, Zhaoliang Wan, Yingjie Niu, Lu Qi, Xu Yang, Zhiyong Liu
First: 2026-03-06T18:59:43+00:00 · Latest: 2026-03-06T18:59:43+00:00
Comments: 16 pages, 10 figures
Abstract
Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial intelligence. However, current obstacle-avoidance methods mainly depend on limited field-of-view sensors and are ill-suited for UAV scenarios which require full-spatial awareness when the movement direction differs from the UAV's heading. This limitation motivates us to explore omnidirectional obstacle avoidance for panoramic drones with full-view perception. We first study an under explored problem setting in which a UAV must generate collision-free motion in environments with obstacles from arbitrary directions, and then construct a benchmark that consists of three representative flight tasks. Based on such settings, we propose Fly360, a two-stage perception-decision pipeline with a fixed random-yaw training strategy. At the perception stage, panoramic RGB observations are input and converted into depth maps as a robust intermediate representation. For the policy network, it is lightweight and used to output body-frame velocity commands from depth inputs. Extensive simulation and real-world experiments demonstrate that Fly360 achieves stable omnidirectional obstacle avoidance and outperforms forward-view baselines across all tasks. Our model is available at https://zxkai.github.io/fly360/
中文标题/摘要
标题:Fly360:全景无人机全方位避障
无人机(UAV)的避障能力作为一项基本功能,随着对空间智能的关注增加而越来越受到重视。然而,当前的避障方法主要依赖于有限视野的传感器,这并不适用于无人机场景,因为当移动方向与无人机航向不同时,需要全方位的空间意识。这一限制促使我们探索全景无人机的全方位避障方法,以实现全视角感知。我们首先研究了一个未被充分探索的问题设置,即在具有来自任意方向障碍物的环境中,无人机必须生成无碰撞的运动,然后构建了一个由三个代表性飞行任务组成的基准。基于这种设置,我们提出了Fly360,这是一种两阶段感知-决策流水线,采用固定随机偏航训练策略。在感知阶段,全景RGB观察输入并转换为深度图作为稳健的中间表示。对于策略网络,它是一种轻量级网络,用于从深度输入中输出机体坐标系下的速度命令。广泛的仿真实验和实地实验表明,Fly360实现了稳定的全方位避障,并在所有任务中优于前视基准。我们的模型可在https://zxkai.github.io/fly360/ 获取。
Summary / 总结
The research aims to address the limitation of current obstacle-avoidance methods in UAVs by developing omnidirectional obstacle avoidance for panoramic drones. The method involves a two-stage perception-decision pipeline that processes panoramic RGB observations into depth maps and uses a lightweight policy network to generate body-frame velocity commands. Experiments show that Fly360 outperforms forward-view baselines in achieving stable omnidirectional obstacle avoidance across various flight tasks.
研究动机是解决当前无人机避障方法依赖于有限视野传感器的局限性,这些方法不适合需要全方位感知的场景。主要方法是采用两阶段感知决策管道,将全景RGB观察结果转换为深度图,并由轻量级策略网络输出身体坐标系下的速度指令。关键实验发现表明,Fly360实现了稳定的全方位避障,并在所有任务中优于前视基线。
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation
Authors: Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu
Venue: CVPR 2026
First: 2026-03-06T18:59:36+00:00 · Latest: 2026-03-06T18:59:36+00:00
Comments: Accepted at CVPR 2026
Abstract
Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey-UP-Lab/SCOPE.
中文标题/摘要
标题:SCOPE:场景上下文增量少量标注3D分割
增量少量标注(IFS)分割旨在仅从少量注释中学习新的类别。虽然在2D中广泛研究,但在3D点云中仍被严重忽视。现有方法遭受灾难性遗忘或在稀疏监督下无法学习区分性原型,且经常忽视一个关键线索:新类别经常在基础训练场景中的未标注背景中出现。我们引入了SCOPE(场景上下文原型丰富),这是一种插件式背景引导的原型丰富框架,可以与任何基于原型的3D分割方法结合使用。在基础训练后,一个类无感知的分割模型从背景区域提取高置信度的伪实例以构建原型池。当新类别到来时,带有少量标注样本,相关背景原型被检索并融合到少量标注原型中,形成丰富表示,无需重新训练骨干或增加参数。在ScanNet和S3DIS上的实验表明,SCOPE实现了SOTA性能,新类别IoU提高最多6.98%和3.61%,平均IoU分别提高2.25%和1.70%,同时保持低遗忘率。代码可在https://github.com/Surrey-UP-Lab/SCOPE获取。
Summary / 总结
SCOPE is a background-guided prototype enrichment framework for Incremental Few-Shot (IFS) 3D segmentation, which addresses the issues of catastrophic forgetting and sparse supervision. It builds a prototype pool from high-confidence pseudo-instances extracted from background regions during base training. When new classes are introduced, relevant background prototypes are fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS demonstrate that SCOPE achieves state-of-the-art performance, improving novel-class IoU by up to 6.98% and mean IoU by 2.25%. It also maintains low forgetting rates.
SCOPE 是一种背景引导的原型增强框架,用于增量少量监督 3D 分割,解决了灾难性遗忘和稀疏监督的问题。在基训练期间,它从背景区域中提取高置信度的伪实例来构建原型池。当引入新类别时,它会检索并融合相关背景原型与少量监督原型,形成增强表示。实验结果表明,SCOPE 达到了最先进的性能,新类别 IoU 提高了 6.98%,平均 IoU 提高了 2.25%,同时保持了低遗忘率。
SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
Authors: Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri
First: 2026-03-06T18:58:36+00:00 · Latest: 2026-03-06T18:58:36+00:00
Abstract
Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.
中文标题/摘要
标题:SUREON:一个手术推理基准和视觉语言模型
外科医生不只是看,而是进行解释。当专家观察手术场景时,他们不仅理解正在使用的器械是什么,还理解为什么选择这种器械,它带来的风险是什么,接下来会发生什么。当前的手术AI无法回答这些问题,主要是因为大规模标注包含手术推理的训练数据极其困难。然而,手术视频讲座中已经包含了这些内容——由专家解释意图、理由和预测,目的是教学。尽管这些叙述本身是噪音且结构化不足,但它们编码了当前手术AI所缺乏的推理。我们引入了SUREON,一个大规模的视频问答数据集,系统地从手术学术视频中收集这种训练信号。SUREON定义了12个问题类别,涵盖安全评估、决策理由和预测,并使用多智能体流水线在大规模下提取和结构化监督。在134.7万段剪辑和170种手术类型中,SUREON产生了206.8万对问答对和354个专家验证基准。为了评估这种监督是否转化为手术推理能力,我们引入了两个模型:SureonVLM,通过监督微调适应的视觉语言模型,以及SureonVLM-R1,使用组相对策略优化训练的推理模型。这两个模型都能回答复杂的手术问题,并显著优于大型通用领域模型,在SUREON基准测试中超过84%的准确率,同时在标准的手术感知任务中也优于通用领域模型。对SureonVLM-R1的定性分析显示了明确的推理行为,例如从视觉上下文推断手术意图。
Summary / 总结
SUREON is a benchmark and vision-language model for surgical reasoning, addressing the lack of explicit surgical reasoning in current AI. It uses surgical academic videos to create a large-scale video QA dataset with 12 question categories, covering safety, rationale, and forecasting. Two models, SureonVLM and SureonVLM-R1, were trained and tested, showing significant improvements in answering complex surgical questions, with SureonVLM-R1 achieving over 84% accuracy on the SUREON benchmark.
SUREON 是一个新的基准和视觉-语言模型,用于外科推理。它通过使用手术视频讲座中的叙述来解决当前AI中缺乏明确外科推理的问题,尽管这些叙述本身是嘈杂的,但富含推理信息。SUREON 包含12个问题类别和206,800个问答对,并通过两种方法评估模型:SureonVLM,一种经过监督微调的视觉-语言模型,和SureonVLM-R1,一种使用组相对策略优化训练的推理模型。这两种模型在SUREON基准测试中显著优于通用领域模型,准确率超过84%,并在标准的外科感知任务中表现出色。
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Authors: Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang
First: 2026-03-06T18:58:04+00:00 · Latest: 2026-03-06T18:58:04+00:00
Comments: Penguin-VL Technical Report; Code: https://github.com/tencent-ailab/Penguin-VL
Abstract
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
中文标题/摘要
标题:Penguin-VL:基于LLM的视觉编码器探索VLM的效率极限
视觉语言模型(VLM)的发展主要依赖于扩大模型规模,这阻碍了在计算受限的移动和边缘设备(如智能手机和机器人)上的部署。在本研究中,我们探索了紧凑型(例如,2B和8B)VLM的性能极限。我们挑战了当前VLM必须依赖通过大规模对比预训练(例如,CLIP/SigLIP)初始化的视觉编码器的主流做法。我们发现了一种目标不匹配:优化用于区分的对比学习,会强制执行粗略的和类别级别的不变性,抑制了密集描述和复杂VLM推理所需的细粒度视觉线索。为了解决这一问题,我们提出了Penguin-VL,其视觉编码器从纯文本的LLM初始化。我们的实验表明,Penguin-Encoder比传统的对比预训练更优越,能够为多模态理解提供更高的视觉保真度和数据效率。在各种图像和视频基准测试中,Penguin-VL在数学推理方面与领先VLM(例如,Qwen3-VL)表现相当,在文档理解、视觉知识和多视角视频理解等任务上则超越了它们。值得注意的是,这些改进是通过轻量级架构实现的,表明改进的视觉表示而非模型规模是性能提升的主要驱动力。我们的消融实验表明,Penguin-Encoder始终优于对比预训练的编码器,保留了对密集感知和复杂推理至关重要的细粒度空间和时间线索。这使其成为计算高效的VLM的强有力替代品,并在资源受限的环境中实现高性能。代码:https://github.com/tencent-ailab/Penguin-VL
Neural Signals Generate Clinical Notes in the Wild
Authors: Jathurshan Pradeepkumar, Zheng Chen, Jimeng Sun
First: 2026-01-29T13:07:30+00:00 · Latest: 2026-03-06T18:57:14+00:00
Abstract
Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We curate a large-scale clinical EEG dataset with $9{,}922$ reports paired with approximately $11{,}000$ hours of EEG recordings from $9{,}048$ patients. We therefore develop CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales, including recording description, background activity, epileptiform abnormalities, events/seizures, and impressions. Experimental results show that, with patient history supervision, our method achieves $70\%$-$95\%$ average relative improvements in standard generation metrics (e.g., ROUGE-1 and METEOR) from $0.2$-$0.3$ to $0.4$-$0.6$. In the zero-shot setting without patient history, CELM attains generation scores in the range of $0.43$-$0.52$, compared to baselines of $0.17$-$0.26$. CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We release our model and benchmark construction pipeline at https://github.com/Jathurshan0330/CELM.
中文标题/摘要
标题:神经信号生成野生环境中的临床笔记
从长时间的EEG记录中生成总结异常模式、诊断发现和临床解释的临床报告仍然劳动密集型工作。我们整理了一个大规模的临床EEG数据集,包含9,922份报告,配对约11,000小时的EEG记录,来自9,048名患者。因此,我们开发了CELM,这是第一个能够总结长时间、变长EEG记录并进行端到端多尺度临床报告生成的临床EEG到语言基础模型,包括记录描述、背景活动、癫痫样异常、事件/癫痫发作和印象。实验结果表明,在患者历史监督下,我们的方法在标准生成指标(如ROUGE-1和METEOR)上实现了20%-30%的平均相对改进,从0.2-0.3提高到0.4-0.6。在没有患者历史的零样本设置中,CELM的生成得分为0.43-0.52,而基线得分为0.17-0.26。CELM将预训练的EEG基础模型与语言模型结合,以实现可扩展的多模态学习。我们在https://github.com/Jathurshan0330/CELM上发布了我们的模型和基准构建管道。
Boosting deep Reinforcement Learning using pretraining with Logical Options
Authors: Zihan Ye, Phil Chau, Raban Emunds, Jannis Blüml, Cedric Derstroff, Quentin Delfosse, Oleg Arenz, Kristian Kersting
First: 2026-03-06T18:55:15+00:00 · Latest: 2026-03-06T18:55:15+00:00
Abstract
Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.
中文标题/摘要
标题:使用逻辑选项预训练提升深度强化学习
深度强化学习代理往往与目标不一致,因为它们会过度利用早期的奖励信号。最近,一些符号方法通过编码稀疏目标和一致的计划来解决这些挑战。然而,纯粹的符号架构难以扩展,并且难以应用于连续环境。因此,我们提出了一种混合方法,灵感来源于人类获取新技能的能力。我们使用两阶段框架,在基于神经网络的强化学习代理中注入符号结构,而不牺牲深度策略的表达能力。我们的方法称为混合层次化强化学习(H^2RL),它引入了一种基于逻辑选项的预训练策略,以引导学习策略远离短期奖励循环,转向目标导向行为,同时允许最终策略通过标准环境交互进行细化。实验上,我们展示了这种方法在长期决策制定方面的一致改进,并产生了优于强大神经、符号和神经符号基线的代理。
Summary / 总结
The research aims to address the misalignment issue in deep reinforcement learning agents by proposing a hybrid approach that combines symbolic and neural methods. The method, Hybrid Hierarchical RL (H^2RL), uses a logical option-based pretraining strategy to guide the learning process towards goal-directed behavior while maintaining the flexibility of deep policies. Experiments demonstrate that this approach enhances long-term decision-making and outperforms other neural, symbolic, and neuro-symbolic baselines.
研究旨在通过结合符号和神经方法解决深度强化学习代理的对齐问题,提出了一种名为Hybrid Hierarchical RL (H^2RL)的混合方法。该方法使用基于逻辑选项的预训练策略引导学习过程向目标导向行为发展,同时保持深度策略的灵活性。关键实验发现是,该方法在长时决策制定方面表现出色,并优于其他神经、符号和神经-符号基线。
EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking
Authors: Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel
First: 2026-03-06T18:49:04+00:00 · Latest: 2026-03-06T18:49:04+00:00
Comments: preprint
Abstract
Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
中文标题/摘要
标题:EgoReasoner:通过任务自适应结构化思考学习自中心4D推理
自中心视频理解由于环境的动态4D特性而固有地复杂,其中摄像机运动和物体位移需要不断重新评估空间关系。在本工作中,我们针对一系列尚未充分探索的自中心4D推理任务,包括固定装置交互计数、视角相对固定装置位置、物体运动行程跟踪和静止物体定位,这些任务需要不同的认知操作:空间锚定、时间跟踪和持续时间推理。我们观察到,这些结构差异使得任务无关的方法不足:通用的链式思考方法缺乏任务适当的推理原语,而统一的强化学习会主动破坏空间任务的性能。为了解决这个问题,我们提出了EgoReasoner,这是一种两阶段框架,将推理框架和奖励信号与每个任务的认知结构对齐。在第一阶段,任务自适应思考模板指导结构化CoT轨迹的合成,通过监督微调使模型能够适应性地推理不同类型的任务。在第二阶段,任务感知的奖励函数验证实体定位、时间对齐和任务自适应逻辑一致性,通过基于GRPO的强化微调选择性地加强每条推理路径。我们的3亿参数模型,在仅使用16000个样本训练后,在具有挑战性的HD-EPIC基准测试中实现了37.5%的平均准确率,超过了Qwen2.5-VL-7B(25.7%)超过10个百分点。
Summary / 总结
EgoReasoner is a two-stage framework designed to handle egocentric 4D reasoning tasks such as fixture interaction counting and object movement tracking. It uses Task-Adaptive Thinking Templates to guide structured Chain-of-Thought reasoning and task-aware reward functions to verify logical consistency, achieving 37.5% average accuracy on the HD-EPIC benchmark, surpassing Qwen2.5-VL-7B by over 10 points.
EgoReasoner 是一个两阶段框架,旨在解决需要不同认知操作(如空间锚定、时间跟踪和持续时间推理)的自视角4D推理任务。第一阶段使用任务自适应思维模板来引导结构化CoT痕迹的合成,并进行监督微调;第二阶段采用任务自意识奖励函数验证逻辑一致性,并通过强化学习微调选择性地加强推理路径。EgoReasoner 使用30亿参数,仅在16,000个样本上进行训练,平均准确率达到37.5%,超越Qwen2.5-VL-7B 10个百分点。
CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion
Authors: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez
First: 2025-12-22T16:21:39+00:00 · Latest: 2026-03-06T18:46:27+00:00
Comments: updated with improved CA results
Abstract
Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs both from a text-only LLM and by adapting a pretrained insertion-based VLM, showing that simple cross-attention is far more competitive with token insertion than previously reported, and (iii) we demonstrate the practical advantages of cross-attention on real-time video captioning, where it naturally maintains low latency and near-constant memory cost. For samples and code, please see our project page at https://kyutai.org/casa .
中文标题/摘要
标题:CASA:自注意力上的交叉注意力高效视觉-语言融合
视觉-语言模型(VLMs)通常通过将预训练视觉编码器中的图像令牌直接插入语言模型的文字流中进行训练。这使得文本和图像信息能够在模型内部完全相互注意,但在长多图像对话或流式视频应用中,这在内存和计算方面变得非常昂贵。利用交叉注意力(CA)的VLM是令牌插入的高效替代方案,因为图像令牌不会被添加到KV缓存中。尽管早期就引入了多模态CA模型,但在当前的VLM文献中仍然很少见,并且通常不如基于令牌插入的模型表现好。在本文中,我们重新研究了交叉注意力在视觉-语言建模中的有效性:(i) 我们分析了交叉注意力和自注意力机制的核心差异,(ii) 我们从仅文本的大语言模型和通过调整预训练的基于插入的VLM训练交叉注意力VLMs,表明简单的交叉注意力比之前报告的更具有竞争力,(iii) 我们展示了交叉注意力在实时视频字幕中的实际优势,它自然地保持了低延迟和近恒定的内存成本。有关样本和代码,请参见我们的项目页面 https://kyutai.org/casa 。
Summary / 总结
The research aims to improve the efficiency of vision-language models (VLMs) by exploring cross-attention (CA) mechanisms, which avoid the memory and compute costs associated with token insertion. The study compares CA with self-attention, showing that simple cross-attention outperforms token insertion in both text-only and pretrained VLMs. Key findings include the practical benefits of cross-attention for real-time video captioning, maintaining low latency and constant memory usage.
该论文通过与自注意力机制的比较,探讨了交叉注意力(CA)在视觉语言模型(VLMs)中的有效性。研究表明,基于CA的模型在实时视频字幕生成中比插入方法更具竞争力,尤其是在保持低延迟和恒定内存使用方面。作者从纯文本语言模型和预训练插入模型出发训练CA模型,证明了简单的交叉注意力是VLMs中一种可行且高效的替代方法。
Causal Interpretation of Neural Network Computations with Contribution Decomposition
Authors: Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus
Venue: ICLR 2026 poster
First: 2026-03-06T18:46:06+00:00 · Latest: 2026-03-06T18:46:06+00:00
Comments: 32 pages, 19 figures. ICLR 2026 poster
Abstract
Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.
中文标题/摘要
标题:神经网络计算的因果解释与贡献分解
理解神经网络如何将输入转化为输出对于解释和操控其行为至关重要。现有的大多数方法通过识别与人类可解释的概念相关联的隐藏层激活模式来分析内部表示。在这里,我们采取直接的方法来研究隐藏神经元如何驱动网络输出。我们引入了CODEC(贡献分解)方法,该方法使用稀疏自编码器将网络行为分解为隐藏神经元贡献的稀疏模式,揭示了仅通过分析激活模式无法确定的因果过程。将CODEC应用于基准图像分类网络,我们发现贡献在各层中变得越来越稀疏和高维,并且出乎意料地发现它们逐渐去相关了对网络输出的正向和负向影响。我们进一步表明,将贡献分解为稀疏模式能够更好地控制和解释中间层,支持对网络输出的因果操控以及对驱动该输出的不同图像组件的人类可解释可视化。最后,通过对脊椎动物视网膜神经活动的最新模型进行分析,我们证明CODEC揭示了模型中间神经元的组合作用,并确定了动态感受野的来源。总体而言,CODEC提供了一种丰富的可解释框架,用于理解非线性计算如何在分层层中演变,并建立了贡献模式作为机械洞察人工神经网络的分析单位。
Summary / 总结
The research aims to understand how neural networks process inputs by examining the causal contributions of hidden neurons. The method, CODEC, decomposes network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes not evident from activation patterns alone. Key findings include the progressive sparsity and decorrelation of positive and negative effects across layers, enabling better control and interpretation of intermediate layers and supporting causal manipulations and human-interpretable visualizations of image components.
该研究旨在通过直接分析隐藏神经元的贡献来理解神经网络如何处理输入。作者引入了CODEC方法,使用稀疏自编码器将网络行为分解为隐藏神经元贡献的稀疏模式。关键发现包括贡献在各层中的稀疏性和维度增加,以及正负效应对网络输出的逐步去相关。该方法还使对中间层的控制和解释更加容易,支持对网络输出的因果操纵和对图像组件的可解释可视化。此外,CODEC还分析了脊椎动物视网膜神经活动模型,揭示了中间神经元的组合作用,并确定了动态感受野的来源。
Measuring AI R&D Automation
Authors: Alan Chan, Ranay Padarath, Joe Kwon, Hilary Greaves, Markus Anderljung
First: 2026-03-04T12:36:13+00:00 · Latest: 2026-03-06T18:41:24+00:00
Abstract
The automation of AI R&D (AIRDA) could have significant implications, but its extent and ultimate effects remain uncertain. We need empirical data to resolve these uncertainties, but existing data (primarily capability benchmarks) may not reflect real-world automation or capture its broader consequences, such as whether AIRDA accelerates capabilities more than safety progress or whether our ability to oversee AI R&D can keep pace with its acceleration. To address these gaps, this work proposes metrics to track the extent of AIRDA and its effects on AI progress and oversight. The metrics span dimensions such as capital share of AI R&D spending, researcher time allocation, and AI subversion incidents, and could help decision makers understand the potential consequences of AIRDA, implement appropriate safety measures, and maintain awareness of the pace of AI development. We recommend that companies and third parties (e.g. non-profit research organisations) start to track these metrics, and that governments support these efforts.
中文标题/摘要
标题:测量AI研发自动化
AI研发自动化(AIRDA)可能具有重大影响,但其程度及其最终效果尚不确定。我们需要实证数据来解决这些不确定性,但现有的数据(主要是能力基准)可能无法反映实际的自动化程度或捕捉其更广泛的影响,例如AIRDA是否比安全进展更快地加速能力,或者我们是否能够跟上AI研发的加速步伐。为解决这些差距,本研究提出了用于跟踪AIRDA程度及其对AI进展和监管影响的指标。这些指标涵盖了AI研发支出的资本份额、研究人员时间分配和AI反制事件等维度,有助于决策者了解AIRDA的潜在后果,实施适当的保障措施,并保持对AI发展速度的意识。我们建议公司和第三方(如非营利研究组织)开始跟踪这些指标,并建议政府支持这些努力。
Summary / 总结
This research aims to measure the extent of AI R&D automation and its effects on AI progress and oversight, addressing uncertainties in existing data. The study proposes metrics covering capital share of AI R&D spending, researcher time allocation, and AI subversion incidents. Key findings include the potential consequences of AI R&D automation on safety progress and the need for oversight to keep pace with technological advancements. Decision makers are advised to track these metrics to understand and manage the implications of AI R&D automation effectively.
本研究旨在衡量AI R&D自动化程度及其对AI进展和监管的影响,解决现有实证数据中的不确定性。研究提出了涵盖AI R&D支出资本份额、研究人员时间分配和AI反制事件的指标。关键发现表明,这些指标可以帮助决策者了解AI R&D自动化的潜在后果,实施适当的安全部署,并保持对AI发展速度的意识。建议公司和第三方开始跟踪这些指标,政府应提供支持。
Conditionally Site-Independent Neural Evolution of Antibody Sequences
Authors: Stephen Zhewen Lu, Aakarsh Vermani, Kohei Sanno, Jiarui Lu, Frederick A Matsen, Milind Jagota, Yun S. Song
First: 2026-02-21T23:23:30+00:00 · Latest: 2026-03-06T18:40:20+00:00
Comments: 24 pages, 14 figures. Currently under review
Abstract
Common deep learning approaches for antibody engineering focus on modeling the marginal distribution of sequences. By treating sequences as independent samples, however, these methods overlook affinity maturation as a rich and largely untapped source of information about the evolutionary process by which antibodies explore the underlying fitness landscape. In contrast, classical phylogenetic models explicitly represent evolutionary dynamics but lack the expressivity to capture complex epistatic interactions. We bridge this gap with CoSiNE, a continuous-time Markov chain parameterized by a deep neural network. Mathematically, we prove that CoSiNE provides a first-order approximation to the intractable sequential point mutation process, capturing epistatic effects with an error bound that is quadratic in branch length. Empirically, CoSiNE outperforms state-of-the-art language models in zero-shot variant effect prediction by explicitly disentangling selection from context-dependent somatic hypermutation. Finally, we introduce Guided Gillespie, a classifier-guided sampling scheme that steers CoSiNE at inference time, enabling efficient optimization of antibody binding affinity toward specific antigens.
中文标题/摘要
标题:条件独立的抗体序列神经进化方法
抗体工程中的常见深度学习方法侧重于建模序列的边缘分布。然而,通过将序列视为独立样本,这些方法忽略了亲和力成熟作为进化过程中抗体探索潜在适应度景观的重要且未充分利用的信息来源。相比之下,经典的系统发生模型明确表示了进化动力学,但缺乏捕捉复杂表型相互作用的能力。我们通过CoSiNE填补了这一空白,CoSiNE是一种由深度神经网络参数化的连续时间马尔可夫链。从数学上讲,我们证明CoSiNE提供了难以处理的点突变过程的一阶近似,能够捕捉表型效应,误差界为分支长度的平方。从经验上讲,CoSiNE在零样本变体效应预测中优于最先进的语言模型,通过明确分离选择与上下文依赖的体细胞高频突变。最后,我们引入了引导吉尔利西方法,这是一种分类器引导的采样方案,在推理时引导CoSiNE,从而实现抗体结合亲和力的高效优化,以特定抗原为目标。
ContextBench: Modifying Contexts for Targeted Latent Activation
Authors: Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom
Venue: ICLR 2026
First: 2025-06-15T16:54:09+00:00 · Latest: 2026-03-06T18:37:24+00:00
Comments: Published at ICLR 2026
Abstract
Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.
中文标题/摘要
标题:ContextBench:修改上下文以针对激活目标潜在特征
识别能够触发语言模型特定行为或潜在特征的输入可能具有广泛的安全应用案例。我们研究了一类能够生成目标明确、语言流畅的输入的方法,这些输入可以激活特定的潜在特征或引发模型行为。我们将这种方法形式化为上下文修改,并介绍了ContextBench——一个基准测试,用于评估核心方法能力和潜在的安全应用。我们的评估框架衡量了引发强度(激活潜在特征或行为)和语言流畅性,突显了当前最先进的方法在平衡这些目标方面面临的挑战。我们通过LLM辅助和扩散模型补全改进了进化提示优化(EPO),并证明这些变体在平衡引发效果和流畅性方面达到了最先进的性能。
Summary / 总结
The research aims to identify inputs that can trigger specific behaviors or latent features in language models for safety applications. The study introduces ContextBench, a benchmark for evaluating context modification methods, which assesses both the elicitation strength and linguistic fluency of generated inputs. The evaluation shows that current state-of-the-art methods struggle to balance these objectives. The authors enhance Evolutionary Prompt Optimisation with LLM-assistance and diffusion model inpainting, achieving state-of-the-art performance in balancing elicitation effectiveness and fluency.
研究旨在识别能够触发语言模型特定行为或潜在特征的输入,以应用于安全性场景。研究引入了ContextBench基准,用于评估上下文修改方法的激发强度和语言流畅性。研究通过LLM辅助和扩散模型补丁增强了进化提示优化方法,展示了这些方法在平衡激发效果和流畅性方面达到了最先进的性能。
PepEDiff: Zero-Shot Peptide Binder Design via Protein Embedding Diffusion
Authors: Po-Yu Liang, Tibo Duran, Jun Bai
First: 2026-01-19T19:07:32+00:00 · Latest: 2026-03-06T18:34:27+00:00
Abstract
We present PepEDiff, a novel peptide binder generator that designs binding sequences given a target receptor protein sequence and its pocket residues. Peptide binder generation is critical in therapeutic and biochemical applications, yet many existing methods rely heavily on intermediate structure prediction, adding complexity and limiting sequence diversity. Our approach departs from this paradigm by generating binder sequences directly in a continuous latent space derived from a pretrained protein embedding model, without relying on predicted structures, thereby improving structural and sequence diversity. To encourage the model to capture binding-relevant features rather than memorizing known sequences, we perform latent-space exploration and diffusion-based sampling, enabling the generation of peptides beyond the limited distribution of known binders. This zero-shot generative strategy leverages the global protein embedding manifold as a semantic prior, allowing the model to propose novel peptide sequences in previously unseen regions of the protein space. We evaluate PepEDiff on TIGIT, a challenging target with a large, flat protein-protein interaction interface that lacks a druggable pocket. Despite its simplicity, our method outperforms state-of-the-art approaches across benchmark tests and in the TIGIT case study, demonstrating its potential as a general, structure-free framework for zero-shot peptide binder design. The code for this research is available at GitHub: https://github.com/LabJunBMI/PepEDiff-An-Peptide-binder-Embedding-Diffusion-Model
中文标题/摘要
标题:PepEDiff:基于蛋白质嵌入扩散的零样本肽结合体设计
我们提出了PepEDiff,一种新颖的肽结合体生成器,给定目标受体蛋白序列及其口袋残基,设计结合序列。肽结合体生成在治疗和生物化学应用中至关重要,但许多现有方法依赖于中间结构预测,增加了复杂性并限制了序列多样性。我们的方法通过从预训练的蛋白质嵌入模型中派生的连续潜在空间直接生成结合序列,而不依赖于预测的结构,从而提高了结构和序列多样性。为了鼓励模型捕获与结合相关的特征而不是记忆已知序列,我们进行了潜在空间探索和基于扩散的采样,使生成的肽超出已知结合体的分布范围。这种零样本生成策略利用全局蛋白质嵌入流形作为语义先验,使模型能够在蛋白质空间中未见过的区域提出新的肽序列。我们在TIGIT上评估了PepEDiff,这是一个具有挑战性的目标,其蛋白-蛋白相互作用界面大而平坦,缺乏可成药口袋。尽管方法简单,但在基准测试和TIGIT案例研究中,我们的方法均优于最先进的方法,证明了其作为零样本肽结合体设计的一般、无结构框架的潜力。此研究的代码可在GitHub上获得:https://github.com/LabJunBMI/PepEDiff-An-Peptide-binder-Embedding-Diffusion-Model
Summary / 总结
PepEDiff is a novel peptide binder generator that designs binding sequences for a target receptor protein without relying on intermediate structure prediction, thereby enhancing structural and sequence diversity. By leveraging a pretrained protein embedding model and performing latent-space exploration and diffusion-based sampling, PepEDiff generates peptides beyond the known distribution. The method outperforms existing approaches in benchmark tests and on the challenging TIGIT target, showcasing its potential as a general, structure-free framework for peptide binder design.
PepEDiff 是一种新颖的肽结合体生成器,可以根据给定的目标受体蛋白质序列及其口袋残基设计结合序列,而不依赖于中间结构预测。该方法通过在从预训练蛋白质嵌入模型派生的连续潜在空间中直接生成结合序列,提高了结构和序列多样性。该方法在基准测试和 TIGIT 案例研究中均优于最先进的方法,展示了其作为通用的、无结构框架的零样本肽结合体设计的潜力。
LiveSense: A Real-Time Wi-Fi Sensing Platform for Range-Doppler on COTS Laptop
Authors: Jessica Sanson, Rahul C. Shah, Maximilian Pinaroc, Cagri Tanriover, Valerio Frascolla
First: 2026-03-06T18:33:14+00:00 · Latest: 2026-03-06T18:33:14+00:00
Abstract
We present LiveSense - a cross-platform that transforms a commercial off-the-shelf (COTS) Wi-Fi Network Interface Card (NIC) on a laptop into a centimeter-level Range-Doppler sensor while preserving simultaneous communication capability. The laptops are equipped with COTS Intel AX211 (Wi-Fi 6E) or Intel BE201 (Wi-Fi 7) NICs. LiveSense can (i) Extract fully-synchronized channel state information (CSI) at >= 40 Hz, (ii) Perform time-phase alignment and self-interference cancellation on-device, and (iii) Provide a real-time stream of range, Doppler, subcarrier magnitude/phase and annotated video frames to a Python/Qt Graphical User Interface (GUI). The demo will showcase the ability to detect (i) Distance and radial velocity of attendees within a few meters of the device, (ii) Micro-motion (respiration), and (iii) Hand-gesture ranging. To the best of our knowledge, this is the first-ever demo to obtain accurate range information of targets from commercial Wi-Fi, despite the limited 160 MHz bandwidth.
中文标题/摘要
标题:LiveSense:一种基于商用笔记本电脑COTS Wi-Fi网卡的实时雷达-Doppler平台
我们介绍了LiveSense - 一种跨平台技术,能够将商用现成(COTS)Wi-Fi网络接口卡(NIC)转换为厘米级雷达-Doppler传感器,同时保持同时通信能力。笔记本电脑配备了COTS Intel AX211(Wi-Fi 6E)或Intel BE201(Wi-Fi 7)NIC。LiveSense可以(i)以>=40 Hz的频率提取完全同步的信道状态信息(CSI),(ii)在设备上执行时间-相位对齐和自干扰消除,以及(iii)向Python/Qt图形用户界面(GUI)提供实时的范围、Doppler、子载波幅度/相位和标注视频帧流。演示将展示LiveSense检测(i)设备几米范围内参会者的距离和径向速度,(ii)微运动(呼吸),以及(iii)手势测距的能力。据我们所知,这是首次从商用Wi-Fi中获得目标准确距离信息的演示,尽管其带宽仅为160 MHz。
Summary / 总结
LiveSense transforms a COTS Wi-Fi NIC on a laptop into a centimeter-level Range-Doppler sensor, enabling real-time extraction of channel state information at 40 Hz or higher. It performs time-phase alignment and self-interference cancellation on-device and provides real-time range, Doppler, and subcarrier data to a GUI. Key findings include accurate detection of distance, radial velocity, and micro-motion such as respiration and hand gestures, demonstrating the potential of commercial Wi-Fi for sensing applications despite limited bandwidth.
LiveSense 将商用笔记本电脑中的 Wi-Fi 网络接口卡转换为厘米级的 Range-Doppler 传感器,能够以 40 Hz 或更高的频率实时提取信道状态信息。它在设备上执行时间相位对齐和自干扰消除,并将范围、多普勒、子载波数据实时传输到图形用户界面。关键发现包括准确检测距离、径向速度以及微运动如呼吸和手势,展示了商用 Wi-Fi 在传感应用中的潜力,尽管带宽有限。
NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion
Authors: Taewon Kang, Ming C. Lin
First: 2026-03-06T18:21:49+00:00 · Latest: 2026-03-06T18:21:49+00:00
Comments: 50 pages, 32 figures
Abstract
Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.
中文标题/摘要
标题:NEGATE:基于文本到视频扩散的受限语义指导语言否定处理
否定是基本的语义操作符,但在基于扩散的生成系统中仍然没有得到充分建模。在本文中,我们通过将否定建模为扩散动力学中语义指导的结构化可行性约束,为基于扩散的生成模型提供了一种形式化的处理方法。我们没有引入启发式方法或重新训练模型参数,而是重新解释无分类器引导作为定义语义更新方向,并通过从语言结构中导出的凸约束集投影更新来强制执行否定。这种新颖的公式化提供了一个统一框架,用于处理各种否定现象,包括对象缺席、等级非反转语义、多重否定组合和范围敏感的消歧。我们的方法是无需训练的,与预训练的扩散主干兼容,并自然地从图像生成扩展到时间演变的视频轨迹。此外,我们引入了一个结构化否定中心基准套件,以隔离生成系统中的不同语言失败模式,进一步推动该领域的研究。实验表明,我们的方法在保持视觉保真度和结构连贯性的同时实现了稳健的否定一致性,建立了扩散生成模型中语言否定的第一个统一公式,超越了表示级评估。
Summary / 总结
This paper addresses the inadequacy of modeling negation in diffusion-based generative systems by proposing a structured feasibility constraint on semantic guidance. The method reinterprets classifier-free guidance as a semantic update direction and enforces negation through projection onto a convex constraint set. Experiments show that the approach maintains visual fidelity and structural coherence while achieving robust negation compliance, covering diverse negation phenomena such as object absence and multi-negation composition.
该研究通过提出一种结构化的语义指导可行性约束来解决扩散生成模型中对否定的建模不足问题。方法重新解释了无分类器引导,并通过从语言结构中导出的凸约束集来强制执行否定。实验表明,该方法在保持视觉保真度和结构连贯性的同时实现了稳健的否定一致性,提供了一种统一的框架来处理各种否定现象。
Spatial Calibration of Diffuse LiDARs
Authors: Nikhil Behari, Ramesh Raskar
First: 2026-03-06T18:18:07+00:00 · Latest: 2026-03-06T18:18:07+00:00
Abstract
Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.
中文标题/摘要
标题:漫反射LiDAR的空间校准
漫反射直接飞行时间LiDAR报告由广泛瞬时视场内的光子返回聚合形成的每个像素的深度直方图,违反了标准LiDAR-RGB校准背后的单束光假设。我们提出了一种简单的空间校准方法,用于估计每个漫反射LiDAR像素的有效支持区域(脚印)及其相对于共定位RGB图像平面的相对空间灵敏度。使用扫描的反光板并进行背景减法,我们恢复了每个像素的响应图,提供了LiDAR到RGB的显式对应关系,用于跨模态对齐和融合。我们在ams OSRAM TMF8828上演示了该方法。
Summary / 总结
The research aims to address the calibration challenge for diffuse LiDARs, which report depth histograms instead of single-ray distances. The method involves estimating the footprint and relative spatial sensitivity of each LiDAR pixel in a co-located RGB image plane. By using a retroreflective patch and background subtraction, the authors recover per-pixel response maps, enabling accurate cross-modal alignment and fusion. The method was validated on the ams OSRAM TMF8828 LiDAR system.
研究旨在解决散射LiDAR的校准问题,这些LiDAR报告的是深度直方图而不是单射线距离。方法包括估计每个LiDAR像素在共定位RGB图像平面的有效支持区域及其相对空间灵敏度。通过使用反射板并进行背景减法,作者恢复了每个像素的响应图,从而实现了跨模态对齐和融合。该方法在ams OSRAM TMF8828 LiDAR系统上进行了验证。
AV-Unified: A Unified Framework for Audio-visual Scene Understanding
Authors: Guangyao Li, Xin Wang, Wenwu Zhu
First: 2026-03-06T18:16:30+00:00 · Latest: 2026-03-06T18:16:30+00:00
Comments: Accepted by IEEE Transactions on Multimedia (TMM)
Abstract
When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.
中文标题/摘要
标题:AV-统一:视听场景理解的统一框架
当人类感知世界时,他们自然会将多种视听任务整合到动态的真实世界场景中。然而,当前的工作(如事件定位、解析、分割和问答)大多单独探索,这使得全面理解复杂的视听场景和探索任务间的关系变得具有挑战性。因此,我们提出了**AV-统一**,一种统一框架,能够跨多种视听场景理解任务进行联合学习。AV-统一标准化了每个任务的多样输入输出格式,并结合多尺度时空感知网络以有效捕捉视听关联。具体来说,我们通过将所有支持任务的输入和输出统一为离散令牌序列,建立共享表示,使得单一架构能够在异构数据集上联合训练。考虑到视听事件的时间粒度差异,我们设计了多尺度时间感知模块以捕捉关键线索。同时,为克服视觉领域缺乏听觉监督的问题,我们设计了一种跨模态引导的空间感知模块,以建模空间视听关联。此外,我们使用任务特定的文本提示以增强模型的适应性和任务意识。在基准数据集(如AVE、LLP、MUSIC-AVQA、VGG-SS和AVS)上的广泛实验表明,AV-统一在时间、空间和时空任务上均表现出有效性。
Summary / 总结
AV-Unified is a unified framework designed to jointly learn across various audio-visual scene understanding tasks, such as event localization and question answering. It standardizes input-output formats and uses a multi-scale spatiotemporal perception network to capture audio-visual associations. The framework demonstrates effectiveness across different tasks on benchmark datasets, showing improvements in temporal, spatial, and spatiotemporal understanding.
AV-Unified 是一个统一框架,用于整合事件定位、解析、分割和问答等多种音频-视觉场景理解任务。它使用多尺度时空感知网络来捕捉音频-视觉关联,并将输入和输出转换为离散的标记以建立共享表示。该框架包括一个多尺度时间感知模块和一个跨模态引导的空间感知模块,以处理不同的时间粒度和缺乏听觉监督的问题。在基准数据集上的实验表明,AV-Unified 在时间、空间和时空任务中均表现出色。
The Limits of Long-Context Reasoning in Automated Bug Fixing
Authors: Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker
Venue: ICLR 2026
First: 2026-02-17T22:51:40+00:00 · Latest: 2026-03-06T18:01:03+00:00
Comments: Accepted to ICLR 2026 ICBINB workshop
Abstract
Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k-30k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.
中文标题/摘要
标题:自动化错误修复中长上下文推理的局限性
随着上下文长度的迅速增加,人们假设大型语言模型(LLMs)可以直接对整个代码库进行推理。同时,LLMs 的最新进展使其在软件工程基准测试中表现出色,尤其是在与代理型工作流结合使用时。在本研究中,我们系统地评估当前的LLMs是否能够可靠地执行长上下文代码调试和补丁生成。使用SWE-bench Verified作为受控实验环境,我们首先在代理型框架(mini-SWE-agent)中评估最先进的模型,性能显著提高:GPT-5-nano在100个样本中最高解决率为31%,开源模型如Deepseek-R1-0528获得竞争力的结果。然而,基于token的分析显示,成功的代理型轨迹通常保持在20k-30k token以下,而更长的累积上下文与较低的成功率相关,表明代理型成功主要来自于将任务分解为短上下文步骤,而不是有效的长上下文推理。为了直接测试长上下文能力,我们构建了一个数据管道,通过将相关文件放入上下文(确保完美的检索召回率)来人为增加输入的上下文长度;然后在真正长的上下文中(64k token)研究单次补丁生成。尽管如此,性能急剧下降:Qwen3-Coder-30B-A3B在64k上下文中仅实现7%的解决率,而GPT-5-nano没有解决任何任务。定性分析揭示了系统性的失败模式,包括虚假的差异、错误的文件目标和不规范的补丁头。总体而言,我们的研究结果突显了当前LLMs名义上下文长度与可用上下文容量之间的显著差距,并表明现有的代理型编程基准未能实质性地评估长上下文推理。
Summary / 总结
This work evaluates the capability of large language models (LLMs) in performing long-context code debugging and patch generation. Using SWE-bench Verified, the study finds that while performance improves with agentic workflows, successful trajectories typically remain under 20k-30k tokens, indicating that task decomposition rather than long-context reasoning is key. Directly testing long-context capability by inflating context length shows significant performance degradation, with only 7% resolution at 64k tokens. The study highlights a gap between nominal and usable context lengths in LLMs and suggests that current benchmarks do not adequately assess long-context reasoning.
这项研究评估了大型语言模型(LLMs)在进行长上下文代码调试和补丁生成方面的能力。使用SWE-bench Verified,研究发现虽然通过代理工作流可以提高性能,但成功的轨迹通常不超过20k-30k个标记,表明任务分解而非长上下文推理是关键。通过增加上下文长度直接测试长上下文能力显示了显著的性能下降,在64k标记上下文中仅7%的解决率。研究指出,LLMs在名义上下文长度和可用上下文容量之间存在显著差距,并暗示当前的基准测试未能充分评估长上下文推理能力。
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Authors: Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
Venue: ICLR 2026
First: 2025-06-18T05:48:05+00:00 · Latest: 2026-03-06T17:55:20+00:00
Comments: ICLR 2026. Code available at https://github.com/Ksartik/sysformer
Abstract
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
中文标题/摘要
标题:Sysformer:通过自适应系统提示保护冻结的大语言模型
随着大语言模型(LLMs)在安全关键环境中部署,确保其响应符合安全标准变得至关重要。先前的研究表明,LLMs往往无法理解安全行为的概念,导致对无害提示的不合理拒绝或生成有害内容。尽管已经做出了大量努力来提高其鲁棒性,但现有的防御措施往往依赖于昂贵的模型参数微调或采用次优的启发式技术。在本工作中,我们通过学习在指令调优的LLMs中适应系统提示来采取一种新颖的方法来保护LLMs。虽然LLMs通常预训练为遵循固定系统提示,但我们研究了将系统提示根据每个特定用户输入进行调整对响应安全性的影响。为此,我们提出了Sysformer,这是一种更新初始系统提示以在LLM输入嵌入空间中生成更鲁棒系统提示的转子former模型,同时关注用户提示。在冻结LLM参数的情况下,Sysformer被训练为拒绝一组有害提示,同时对一组安全提示作出理想响应。通过在来自不同家族的5个LLM和2个最新基准上的广泛实验,我们证明Sysformer可以显著增强LLMs的鲁棒性,使其在有害提示上的拒绝率提高多达80%,同时在安全提示上的合规性提高多达90%。结果还很好地推广到复杂的jailbreaking攻击,使LLMs对不同攻击策略的鲁棒性提高多达100%。实验结果表明,Sysformer可以更便宜地保护LLMs,并激发未来关于设计可变系统提示的研究。
Summary / 总结
The research aims to enhance the safety of large language models (LLMs) in critical applications by adapting system prompts. The method involves training a Sysformer model to update the system prompt in the input embedding space of LLMs while keeping the LLM parameters frozen. Experiments on five LLMs and two benchmarks show that Sysformer can significantly improve the refusal rate of harmful prompts by up to 80% and enhance compliance with safe prompts by up to 90%. The approach also effectively resists sophisticated jailbreaking attacks, making LLMs up to 100% more robust against various attack strategies.
本文旨在解决大型语言模型(LLMs)在关键应用中的安全性问题。提出了Sysformer,一种适应系统提示的变压器模型,可以在不微调模型参数的情况下提高LLM响应的安全性。实验结果显示,Sysformer可以显著增强LLM的鲁棒性,对于有害提示的拒绝率最高可提升80%,对于安全提示的合规性最高可提升90%。该方法还能有效抵御复杂的破解攻击,使LLM对各种攻击策略更具鲁棒性。
Culture in Action: Evaluating Text-to-Image Models through Social Activities
Authors: Sina Malakouti, Boqing Gong, Adriana Kovashka
First: 2025-11-07T19:51:11+00:00 · Latest: 2026-03-06T17:45:17+00:00
Abstract
Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.
中文标题/摘要
标题:文化在行动:通过社会活动评估文本到图像模型
文本到图像(T2I)扩散模型通过大规模网络数据训练,实现了令人印象深刻的逼真度,但模型继承了文化偏见,未能忠实描绘未被充分代表的地区。现有的文化基准主要集中在以对象为中心的类别(如食物、服饰和建筑)上,忽视了更能反映文化规范的社会和日常活动。很少有度量标准可以衡量文化忠实度。我们引入了CULTIVate,这是一个用于评估T2I模型在跨文化活动(如问候、用餐、游戏、传统舞蹈和文化庆典)上的基准。CULTIVate覆盖了16个国家,有576个提示和超过19,000张图像,并提供了一个基于解释性描述符的多文化维度评估框架,包括背景、服饰、物体和互动。我们提出了四个度量标准来衡量文化一致性、幻觉、夸张元素和多样性。我们的研究发现揭示了系统性差异:模型在表现上更优于全球北方国家,而对全球南方国家的表现则较差,不同T2I系统存在不同的失败模式。人类研究证实,我们的度量标准与现有的文本图像度量标准相比,与人类判断的相关性更强。
Summary / 总结
This study introduces CULTIVate, a benchmark for evaluating text-to-image models on cross-cultural social activities, addressing the cultural biases in existing benchmarks. The benchmark includes 576 prompts and over 19,000 images from 16 countries, focusing on cultural norms through descriptors in background, attire, objects, and interactions. Four metrics are proposed to measure cultural alignment, hallucination, exaggerated elements, and diversity. The study finds that models perform better for global north countries and worse for the global south, with distinct failure modes across different T2I systems. Human studies confirm that these metrics correlate more strongly with human judgments than existing text-image metrics.
该研究引入了CULTIVate,一个用于评估文本到图像模型在跨文化社交活动上的基准,解决了现有基准中的文化偏见问题。该基准包含来自16个国家的576个提示和超过19,000张图像,通过背景、服饰、物品和互动的描述来关注文化规范。提出了四个指标来衡量文化一致性、幻觉、夸张元素和多样性。研究发现,模型在北半球国家的表现优于南半球国家,不同T2I系统存在不同的失败模式。人类研究证实,这些指标与现有文本图像指标相比,更强烈地与人类判断相关。
How Well Does Agent Development Reflect Real-World Work?
Authors: Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig
First: 2026-03-01T17:55:49+00:00 · Latest: 2026-03-06T17:43:36+00:00
Abstract
AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.
中文标题/摘要
标题:智能代理开发如何反映现实世界工作?
人工智能代理越来越多地在与人类工作相关的基准上进行开发和评估,但这些基准的努力是否代表整个劳动力市场仍不清楚。在本研究中,我们系统地研究了智能代理开发努力与现实世界人类工作的分布之间的关系,通过将基准实例映射到工作领域和技能。我们首先分析了43个基准和72,342个任务,测量它们与人类就业和资本分配在所有1,016个美国劳动力市场的现实世界职业中的契合度。我们揭示了智能代理开发倾向于编程中心与人类劳动和经济价值集中领域之间存在显著差异。在智能代理当前目标的工作领域内,我们进一步通过测量其自主水平来表征当前代理的实用性,为不同工作场景下的代理交互策略提供实用指导。基于这些发现,我们提出了三个可衡量的原则,以设计更好地捕捉社会重要和技术挑战形式工作的基准:覆盖面、现实性和细粒度评估。
Summary / 总结
This study investigates the alignment between AI agent development and real-world human work by analyzing 43 benchmarks and 72,342 tasks against the U.S. labor market. The research reveals significant mismatches, with agent development focusing more on programming skills while human labor and economic value are concentrated in other areas. The study proposes three principles—coverage, realism, and granular evaluation—for designing more representative benchmarks. Within the current target areas of agents, the study also measures autonomy levels to guide practical interaction strategies in work scenarios.
研究通过分析43个基准和72,342个任务与美国劳动力市场的匹配情况,考察了AI代理开发与现实世界人类工作之间的对齐。研究揭示了显著的不匹配,代理开发更多集中在编程技能上,而人类劳动和经济价值集中在其他领域。研究提出了三个原则——覆盖、现实性和细粒度评估——以设计更具有代表性的基准。在代理当前目标领域内,研究还测量了自主性水平,以指导工作场景中的代理互动策略。
When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models
Authors: Qitong Wang, Haoran Dai, Haotian Zhang, Christopher Rasmussen, Binghui Wang
Venue: ICLR 2026
First: 2026-03-06T17:42:08+00:00 · Latest: 2026-03-06T17:42:08+00:00
Comments: Accepted to the ICLR 2026 Workshop on Principled Design for Trustworthy AI. The first two authors contributed equally
Abstract
While diffusion models have revolutionized visual content generation, their rapid adoption has underscored the critical need to investigate vulnerabilities, e.g., to backdoor attacks. In multimodal diffusion models, it is natural to expect that attacking multiple modalities simultaneously (e.g., text and image) would yield complementary effects and strengthen the overall backdoor. In this paper, we challenge this assumption by investigating the phenomenon of Backdoor Modality Collapse, a scenario where the backdoor mechanism degenerates to rely predominantly on a subset of modalities, rendering others redundant. To rigorously quantify this behavior, we introduce two novel metrics: Trigger Modality Attribution (TMA) and Cross-Trigger Interaction (CTI). Through extensive experiments across diverse training configurations in multimodal conditional diffusion, we consistently observe a ``winner-takes-all'' dynamic in backdoor behavior. Our results reveal that (1) attacks often collapse into subset-modality dominance, and (2) cross-modal interaction is negligible or even negative, contradicting the intuition of synergistic vulnerability. These findings highlight a critical blind spot in current assessments, suggesting that high attack success rates often mask a fundamental reliance on a subset of modalities. This establishes a principled foundation for mechanistic analysis and future defense development.
中文标题/摘要
标题:一模独大:多模态扩散模型中的后门模态坍塌
尽管扩散模型已经彻底改变了视觉内容生成,但它们的快速采用也凸显了研究其脆弱性(例如,后门攻击)的迫切需要。在多模态扩散模型中,同时攻击多个模态(例如,文本和图像)会产生互补效果并增强整体后门攻击的假设是自然的。在本文中,我们通过研究后门模态坍塌现象挑战了这一假设,这是一种后门机制退化为主要依赖于少数模态集,使其他模态变得多余的情况。为了严格量化这种行为,我们引入了两个新的度量标准:触发模态归因(TMA)和跨触发交互(CTI)。通过在多模态条件扩散中进行广泛的实验,我们一致观察到后门行为中的“赢家通吃”动态。我们的结果表明:(1)攻击往往坍塌为少数模态主导,(2)跨模态交互可以忽略不计甚至为负,这与协同脆弱性的直觉相矛盾。这些发现突显了当前评估中的一个关键盲点,表明高攻击成功率往往掩盖了对少数模态的依赖。这为机制分析和未来防御开发奠定了原则性的基础。
Summary / 总结
This paper investigates the phenomenon of Backdoor Modality Collapse in multimodal diffusion models, where the backdoor mechanism relies primarily on a subset of modalities, making others redundant. To quantify this, the authors introduce TMA and CTI metrics. Experiments across various training setups show a 'winner-takes-all' dynamic, with attacks often concentrating on a single modality, contradicting the expectation of synergistic vulnerability. This highlights a critical blind spot in current assessments, indicating that high attack success rates may mask a fundamental reliance on a subset of modalities.
本文研究了多模态扩散模型中的Backdoor Modality Collapse现象,即后门机制主要依赖于少数模态,使其他模态变得多余。为了量化这一现象,作者引入了两个指标:触发模态归因(TMA)和跨触发交互(CTI)。在各种训练配置下进行的实验显示,攻击往往会集中到少数模态上,且跨模态交互是微不足道的甚至为负,这挑战了协同脆弱性的假设。这些发现揭示了当前评估中的一个关键盲点,表明高攻击成功率可能掩盖了对少数模态的依赖。
Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Authors: Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach
First: 2026-03-06T17:41:49+00:00 · Latest: 2026-03-06T17:41:49+00:00
Comments: project webpage: https://bfl.ai/research/self-flow
Abstract
Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.
中文标题/摘要
标题:自我监督的流匹配以实现可扩展的多模态合成
强大的语义表示可以提高扩散和流模型的收敛性和生成质量。现有方法主要依赖外部模型,需要单独训练,操作于对齐不良的目标,并表现出意外的扩展行为。我们认为这种依赖源于模型的训练目标,该目标提出了去噪任务,缺乏学习语义表示的动力。我们提出了Self-Flow:一种自我监督的流匹配范式,将表示学习整合到生成框架中。我们的关键机制,双时间步调度,对不同标记应用异质噪声水平,创建信息不对称,迫使模型从受损输入中推断缺失信息。这促进了在生成能力的同时学习强大的表示,而无需外部监督。我们的方法在不同模态之间具有通用性,并在遵循预期的扩展定律的同时实现多模态训练,实现了卓越的图像、视频和音频生成。
Summary / 总结
The research aims to improve the convergence and generation quality of diffusion and flow models by integrating representation learning within the generative framework. The method, Self-Flow, uses a self-supervised flow matching paradigm and Dual-Timestep Scheduling to apply heterogeneous noise levels, forcing the model to infer missing information. This approach enables multi-modal training and generalizes across modalities, achieving better image, video, and audio generation without external supervision.
研究旨在通过将表示学习集成到生成框架中,提高扩散和流模型的收敛性和生成质量。提出的Self-Flow方法使用自我监督的流匹配范式和双时间步调度机制,应用不同的噪声水平,迫使模型从受损输入中推断缺失信息。这种方法可以在不同模态之间进行多模态训练并泛化,实现图像、视频和音频的优质生成,而无需外部监督。该方法遵循预期的缩放定律,并避免了现有方法依赖外部模型和操作对齐目标的问题。
Spectral/Spatial Tensor Atomic Cluster Expansion with Universal Embeddings in Cartesian Space
Authors: Zemin Xu, Wenbo Xie, P. Hu
First: 2025-09-18T13:51:07+00:00 · Latest: 2026-03-06T17:38:33+00:00
Abstract
Equivariant atomistic machine learning models have largely been built on spherical-tensor representations, where explicit angular-momentum coupling introduces substantial complexity and systematic extensions beyond energies and forces remain challenging, often requires problem-specific architectural choices. Here we introduce the Tensor Atomic Cluster Expansion (TACE), which unifies scalar and tensorial modeling in Cartesian and space by decomposing local environments into irreducible Cartesian tensors (ICT) constructing a controlled many-body hierarchy with atomic cluster expansion (ACE). In addition to performing ACE in the frequency domain, we propose an efficient Clebsch-Gordan-free alternative in the spatial domain. TACE provides universal invariant (e.g., fidelity tags and charges) and equivariant (e.g., external electric fields and non-collinear magnetic moments) embeddings and predicted tensorial observables are handled on equal footing and enabling explicit control at inference. We demonstrate the accuracy, stability, and efficiency across finite molecules and extended materials, including in-domain and out-of-domain benchmarks, spectra, Hessian, external-field responses, charged systems, and multi-fidelity/head training. We further show its robustness on nonequilibrium/reactive datasets and controlled scaling when extending to large foundation model datasets.
中文标题/摘要
标题:谱空间张量原子簇展开与通用嵌入在笛卡尔空间中的结合
等变原子机器学习模型大多基于球张量表示,其中显式的角动量耦合引入了大量复杂性,且系统扩展到能量和力之外仍然具有挑战性,通常需要针对特定问题的架构选择。在这里,我们引入了张量原子簇展开(TACE),它通过将局部环境分解为不可约笛卡尔张量(ICT)并构建原子簇展开(ACE)控制的多体层次结构,统一了标量和张量建模在笛卡尔和空间中的建模。除了在频域中执行ACE外,我们还提出了一种在空间域中的高效Clebsch-Gordan自由替代方案。TACE提供了通用不变量(例如,保真度标签和电荷)和等变性嵌入(例如,外部电场和非共线磁矩),预测的张量观测值得到了平等处理,从而在推理时实现了显式控制。我们展示了其在有限分子和扩展材料中的准确性、稳定性和效率,包括域内和域外基准测试、光谱、哈密顿量、外部场响应、带电系统和多保真度/头训练。我们还展示了其在非平衡/反应数据集上的鲁棒性,并在扩展到大型基础模型数据集时实现了可控的扩展性。
Summary / 总结
The research introduces Tensor Atomic Cluster Expansion (TACE), which unifies scalar and tensorial modeling in Cartesian space by decomposing local environments into irreducible Cartesian tensors (ICT) and constructing a controlled many-body hierarchy with atomic cluster expansion (ACE). TACE provides universal invariant and equivariant embeddings and enables explicit control at inference. The method demonstrates high accuracy, stability, and efficiency across various benchmarks, including finite molecules, extended materials, spectra, Hessian, external-field responses, charged systems, and multi-fidelity/head training.
研究旨在解决球对称张量表示在原子机器学习模型中的复杂性和局限性。Tensor Atomic Cluster Expansion (TACE) 方法将局部环境分解为不可约笛卡尔张量,并构建了一个受控的多体层次结构。TACE 提供了通用和对称嵌入,并准确预测了各种基准测试中的张量观测值,展示了其在有限分子和扩展材料中的准确性和稳定性。
CanvasMAR: Improving Masked Autoregressive Video Prediction With Canvas
Authors: Zian Li, Muhan Zhang
First: 2025-10-15T15:29:09+00:00 · Latest: 2026-03-06T17:33:15+00:00
Abstract
Masked autoregressive models (MAR) have emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the expressiveness of continuous tokenizers. However, when sampling individual frames, video MAR models often produce highly distorted outputs due to the lack of a structured global prior, especially when using only a few sampling steps. To address this, we propose CanvasMAR, a novel autoregressive video prediction model that predicts high-fidelity frames with few sampling steps by introducing a canvas--a blurred, global one-step prediction of the next frame that serves as a non-uniform mask during masked generation. The canvas supplies global structure early in sampling, enabling faster and more coherent frame synthesis. To further stabilize autoregressive sampling, we propose an easy-to-hard curriculum via a motion-aware sampling order that synthesizes relatively stationary regions before attending to highly dynamic ones. We also integrate compositional classifier-free guidance that jointly strengthens the canvas and temporal conditioning to improve generation fidelity. Experiments on the BAIR, UCF-101, and Kinetics-600 benchmarks demonstrate that CanvasMAR produces higher-quality videos with fewer autoregressive steps. On the challenging Kinetics-600 dataset, CanvasMAR achieves remarkable performance among autoregressive models and rivals advanced diffusion-based methods.
中文标题/摘要
标题:CanvasMAR:通过Canvas改进遮罩自回归视频预测
遮罩自回归模型(MAR)已成为图像和视频生成的强大范式,结合了遮罩建模的灵活性和连续分词器的表达能力。然而,在逐帧采样时,视频MAR模型往往会因缺乏结构化的全局先验而产生高度失真的输出,尤其是在仅使用少量采样步骤时。为了解决这一问题,我们提出了一种名为CanvasMAR的新型自回归视频预测模型,通过引入一个模糊的全局一帧预测——作为遮罩生成期间的非均匀遮罩——来预测高保真帧,从而在少量采样步骤内实现高质量的帧预测。Canvas在采样早期提供全局结构,使帧合成更快且更连贯。为了进一步稳定自回归采样,我们提出了一种易于难的课程学习,通过运动感知的采样顺序,先合成相对静止的区域,再关注高度动态的区域。我们还整合了组合式的无条件分类器引导,共同加强Canvas和时间条件,以提高生成保真度。在BAIR、UCF-101和Kinetics-600基准测试中,CanvasMAR在更少的自回归步骤内生成了更高质量的视频。在具有挑战性的Kinetics-600数据集上,CanvasMAR在自回归模型中表现出色,并与先进的扩散方法相媲美。
Summary / 总结
CanvasMAR is designed to improve the quality of video generation using masked autoregressive models by introducing a canvas, a blurred global prediction that acts as a non-uniform mask during sampling. This canvas helps in faster and more coherent frame synthesis. Additionally, CanvasMAR uses a motion-aware sampling order and compositional classifier-free guidance to stabilize the sampling process. Experiments show that CanvasMAR generates higher-quality videos with fewer sampling steps compared to other autoregressive models, especially on the Kinetics-600 dataset.
CanvasMAR 是一种新颖的自回归视频预测模型,通过引入一个模糊的全局一帧预测的画布作为非均匀掩码,在掩码生成过程中提高视频生成的质量。它还使用了运动感知的采样顺序和组合分类器无指导引导来增强生成的稳定性和质量。实验表明,CanvasMAR 生成的视频质量更高,所需采样步骤更少,特别是在 Kinetics-600 数据集上表现出色。
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Authors: Kartik Sharma, Rakshit S. Trivedi
Venue: ICLR 2026
First: 2026-03-06T17:27:27+00:00 · Latest: 2026-03-06T17:27:27+00:00
Comments: ICLR 2026. Code available at https://github.com/Ksartik/cold-steer
Abstract
Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
中文标题/摘要
标题:COLD-Steer:通过上下文内一步学习动力学引导大型语言模型
激活引导方法允许在推理时控制大型语言模型(LLM)的行为而无需重新训练,但当前方法面临一个基本的权衡:样本高效的方法以次优方式捕捉标记示例中的引导信号,而能够更好地提取这些信号的方法则需要数百到数千个示例。我们提出了COLD-Steer,这是一种无需训练的框架,通过近似上下文内示例进行梯度下降所导致的表示变化来引导LLM的激活。我们的核心见解是,对一小组示例进行微调的效果可以在推理时通过不实际更新参数来高效近似。我们通过两种互补的方法进行了形式化:(i)一种单位核近似方法,直接使用它们的梯度更新激活,梯度在示例之间归一化,以及(ii)一种仅需两次前向传递的有限差分近似,无论示例数量如何。在各种引导任务和基准测试中的实验表明,COLD-Steer 在使用比最佳基线少50倍的样本的情况下,实现了高达95%的引导效果。COLD-Steer 使容纳多样观点无需大量演示数据成为可能,我们通过在多元对齐任务中的实验进行了验证。我们的框架为通过原理上近似学习动力学而不是专门的训练程序来实现适应性和上下文感知的模型控制打开了新的可能性。
Summary / 总结
COLD-Steer is a training-free framework that steers large language model activations by approximating the representational changes from in-context examples. It uses two methods: a unit kernel approximation and a finite-difference approximation. Experiments show COLD-Steer achieves up to 95% steering effectiveness with 50 times fewer samples compared to existing methods, enabling more efficient and diverse perspective accommodation.
COLD-Steer 是一个无需训练的框架,通过近似在上下文示例上的梯度下降引起的表示变化来引导大型语言模型的激活。它使用两种方法:单位核近似和有限差分近似。实验表明,COLD-Steer 在最佳基线基础上少用 50 倍样本即可实现高达 95% 的引导效果,使其在各种引导任务和基准测试中非常有效。
Quantum Diffusion Models: Score Reversal Is Not Free in Gaussian Dynamics
Authors: Ammar Fayad
First: 2026-03-06T17:16:17+00:00 · Latest: 2026-03-06T17:16:17+00:00
Abstract
Diffusion-based generative modeling suggests reversing a noising semigroup by adding a score drift. For continuous-variable Gaussian Markov dynamics, complete positivity couples drift and diffusion at the generator level. For a quantum-limited attenuator with thermal parameter $ν$ and squeezing $r$, the fixed-diffusion Wigner-score (Bayes) reverse drift violates CP iff $\cosh(2r)>ν$. Any Gaussian CP repair must inject extra diffusion, implying $-2\ln F\ge c_{\text{geom}}(ν_{\min})I_{\mathrm{dec}}^{\mathrm{wc}}$.
中文标题/摘要
标题:量子扩散模型:在高斯动力学中逆向扩散不是免费的
基于扩散的生成建模建议通过添加分数漂移来逆转加噪半群。对于连续变量的高斯马尔可夫动力学,完全正性在生成器级别将漂移和扩散耦合在一起。对于具有热参数ν和压缩r的量子限制衰减器,固定扩散的Wigner-分数(贝叶斯)逆向漂移在cosh(2r)>ν时违反CP。任何高斯CP修复都必须注入额外的扩散,这意味着-2lnF≥c_{geom}(ν_{min})I_{dec}^{wc}。
TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback
Authors: Lei Pang, Jun Luo, Ruinan Jin
First: 2025-08-04T19:01:19+00:00 · Latest: 2026-03-06T17:14:15+00:00
Comments: 44 pages
Abstract
Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with group-normalized rewards while retaining PPO-style token-level importance sampling based on an old policy. Our theoretical analysis reveals that the GRPO update rule estimates the policy gradient at the old policy rather than the current one; however, since the old policy is refreshed every few steps, the resulting discrepancy remains small and the induced bias is negligible in practice. To empirically validate this insight, we conduct an ablation study that entirely removes importance sampling and performs multiple optimization steps using gradients estimated at a fixed old policy. Remarkably, this simplified variant attains performance comparable to standard GRPO.
Motivated by this finding, we propose Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure. Furthermore, we present the first convergence analysis for GRPO-style methods and show that TIC-GRPO converges faster than GRPO. Finally, empirical results across math reasoning and coding tasks demonstrate the superiority of TIC-GRPO.
中文标题/摘要
标题:TIC-GRPO:可证明且高效的强化学习优化方法,基于人类反馈
DeepSeek 最近引入的组相对策略优化 (GRPO) 是一种用于微调大型语言模型的无评论强化学习算法。GRPO 使用组归一化的奖励替换 PPO 中的价值函数,同时保留基于旧策略的 PPO 风格的令牌级重要性采样。我们的理论分析表明,GRPO 更新规则估计的是旧策略而非当前策略的策略梯度;然而,由于旧策略每隔几步就会更新,因此产生的差异很小,实际中的偏差可以忽略不计。为了验证这一见解,我们进行了一项消融研究,完全移除重要性采样,并使用固定旧策略估计的梯度进行多次优化步骤。令人惊讶的是,这种简化版本的性能与标准 GRPO 相当。
受这一发现的启发,我们提出了轨迹级重要性校正 GRPO (TIC-GRPO),这是一种新算法,用单个轨迹级概率比替换令牌级重要性比,从而获得当前策略梯度的估计值,同时保留无评论结构。此外,我们首次对 GRPO 风格的方法进行了收敛性分析,并证明 TIC-GRPO 比 GRPO 收敛得更快。最后,数学推理和编程任务上的实验证明了 TIC-GRPO 的优越性。
Summary / 总结
The research aims to improve reinforcement learning from human feedback by addressing the limitations of Group Relative Policy Optimization (GRPO). The study introduces TIC-GRPO, which simplifies GRPO by using a single trajectory-level probability ratio instead of token-level importance sampling, leading to better performance and faster convergence. Experiments show that TIC-GRPO outperforms GRPO in math reasoning and coding tasks.
研究旨在通过人类反馈提高强化学习算法的效率和可证明性。方法是通过在Proximal Policy Optimization (PPO)算法中用组归一化的奖励替换价值函数,并使用基于旧策略的令牌级重要性采样来实现。关键实验发现表明,移除重要性采样的简化版本仍然能达到与原始算法相当的性能。在此基础上,作者提出了轨迹级重要性校正的GRPO (TIC-GRPO),该算法使用单个轨迹级概率比来估计当前策略梯度,从而实现更快的收敛和更好的数学推理和编程任务性能。
Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
Authors: Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu
First: 2025-12-31T19:56:51+00:00 · Latest: 2026-03-06T17:02:14+00:00
Comments: Technical Report
Abstract
4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
中文标题/摘要
标题:Spatial4D-Bench:一个多功能的4D空间智能基准
4D空间智能涉及感知和处理物体随时间移动或变化的方式。人类自然具备4D空间智能,支持广泛的空间推理能力。多模态大型语言模型(MLLMs)能否达到人类水平的4D空间智能程度如何?在本研究中,我们提出了Spatial4D-Bench,一个多功能的4D空间智能基准,旨在全面评估MLLMs的4D空间推理能力。与现有的空间智能基准相比,后者往往规模较小或在多样性方面有限,Spatial4D-Bench提供了一个大规模、多任务评估基准,包含约4万个问题-答案对,覆盖18个明确的任务。我们系统地将这些任务组织成六个认知类别:物体理解、场景理解、空间关系理解、时空关系理解、空间推理和时空推理。因此,Spatial4D-Bench为评估MLLMs的空间认知能力提供了一个结构化和全面的基准,涵盖了与人类空间智能的广泛任务相对应的多种任务。我们在Spatial4D-Bench上对各种最先进的开源和专有MLLMs进行了基准测试,并揭示了它们在路线规划、动作识别和物理合理性推理等众多4D空间推理方面的显著局限性。我们希望本研究提供的发现为社区提供有价值的见解,并希望通过我们的基准促进更强大的MLLMs向人类水平的4D空间智能的发展。更多资源可以在我们的项目页面上找到。
Summary / 总结
Spatial4D-Bench is a large-scale benchmark designed to evaluate the 4D spatial reasoning abilities of Multimodal Large Language Models (MLLMs). It consists of over 40,000 question-answer pairs covering 18 tasks organized into six cognitive categories. The benchmark reveals significant limitations of current MLLMs in various 4D spatial reasoning aspects, such as route planning and physical plausibility reasoning, highlighting the need for further development towards human-level 4D spatial intelligence.
Spatial4D-Bench 是一个大规模基准,旨在评估多模态大型语言模型(MLLMs)的4D空间推理能力。它包含超过40,000个问题-答案对,涵盖18个任务,并按六个认知类别组织。该基准揭示了当前MLLMs在路线规划和物理合理性推理等多种4D空间推理任务中的显著局限性。本研究旨在为向人类水平的4D空间智能发展更强大的MLLMs提供有价值的见解。
Co-Layout: LLM-driven Co-optimization for Interior Layout
Authors: Chucheng Xiang, Ruchao Bao, Biyin Feng, Wenzheng Wu, Zhongyuan Liu, Yirui Guan, Ligang Liu
Venue: AAAI 2026
First: 2025-11-16T06:20:55+00:00 · Latest: 2026-03-06T17:00:55+00:00
Comments: AAAI 2026
Abstract
We present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.
中文标题/摘要
标题:Co-布局:基于LLM的室内布局联合优化框架
我们提出了一种结合大型语言模型(LLM)和基于网格的整数规划的新型自动化室内设计框架,以联合优化房间布局和家具摆放。给定一个文本提示,LLM驱动的代理工作流提取与房间配置和家具布局相关的结构化设计约束。这些约束被编码为一种统一的基于网格的表示,灵感来源于“Modulor”。我们的建模考虑了关键的设计要求,包括走廊连通性、房间可达性、空间排他性和用户指定的偏好。为了提高计算效率,我们采用了一种从粗到细的优化策略,从低分辨率网格开始解决简化问题,并在全分辨率下引导解决方案。在各种场景下的实验结果表明,我们的联合优化方法在解决方案质量上显著优于现有的两阶段设计管道,并通过从粗到细的策略实现了显著的计算效率。
Summary / 总结
The research aims to automate interior design by integrating large language models with grid-based integer programming. The framework extracts design constraints from textual prompts and optimizes both room layout and furniture placement simultaneously. The approach uses a coarse-to-fine optimization strategy, starting with a low-resolution grid to enhance computational efficiency. Experiments show that this joint optimization method outperforms existing two-stage design pipelines in terms of solution quality and computational efficiency.
研究旨在通过结合大型语言模型和网格整数规划来自动化室内设计。方法包括一个由LLM驱动的工作流,从文本提示中提取设计约束并将其编码到网格表示中。该方法同时优化房间布局和家具摆放,考虑走廊连通性等关键设计要求。实验结果表明,联合优化方法在解决方案质量上优于现有的两阶段设计管道,并通过粗到细的优化策略实现了更好的计算效率。
Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching
Authors: Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony
First: 2026-03-06T16:56:46+00:00 · Latest: 2026-03-06T16:56:46+00:00
Abstract
Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.
中文标题/摘要
标题:Match4Annotate:通过隐式神经特征匹配传播稀疏视频注释
获取每帧视频注释仍然是在医学成像等专门领域部署计算机视觉的主要瓶颈,其中专家标注速度慢且成本高。标签传播提供了一种自然的解决方案,但现有方法面临根本性的限制。视频跟踪器和分割模型可以在单个序列内传播标签,但需要逐视频初始化,并且无法在视频之间泛化。经典对应关系管道在检测器选择的关键点上操作,并且在低纹理场景中挣扎,而密集特征匹配和单次分割方法可以实现跨视频传播,但缺乏时空平滑性和对点和掩码注释的统一支持。我们提出了Match4Annotate,这是一种轻量级框架,用于在视频内和视频间传播点和掩码注释。我们的方法在测试时拟合基于SIREN的隐式神经表示到DINOv3特征,生成连续的高分辨率时空特征场,并学习帧对之间的平滑隐式变形场以指导对应匹配。我们在三个具有挑战性的临床超声波数据集上进行了评估。Match4Annotate实现了最先进的跨视频传播,优于特征匹配和单次分割基线,同时在视频内传播方面仍与专门的跟踪器竞争。我们的结果表明,轻量级、测试时优化的特征匹配管道有可能提供一种高效且易于访问的解决方案,以实现可扩展的注释工作流。
Summary / 总结
Match4Annotate is a lightweight framework designed to propagate sparse video annotations both within and across videos. It uses a SIREN-based implicit neural representation to create a continuous spatiotemporal feature field and learns a smooth deformation field to guide correspondence matching. The method outperforms feature matching and one-shot segmentation baselines for inter-video propagation and remains competitive with specialized trackers for intra-video propagation, demonstrating its potential for scalable annotation workflows in specialized domains like medical imaging.
Match4Annotate 是一个轻量级框架,旨在在视频中传播点和掩码注释,解决了现有方法的局限性。它使用基于 SIREN 的隐式神经表示和隐式变形场来生成连续的时空特征场,从而实现视频内的和视频间的传播。该方法在视频间的传播中优于特征匹配和单次分割基线,并且在视频内的传播中与专门的跟踪器保持竞争力,展示了其在医学成像等专业领域中实现可扩展注释工作流的潜力。
GreenRFM: Toward a resource-efficient radiology foundation model
Authors: Yingtai Li, Shuai Ming, Mingyue Zhao, Haoran Lai, Rongsheng Wang, Rui Zhou, Rundong Wang, Yujia Li, Wei Wei, Shaohua Kevin Zhou
First: 2026-03-06T16:51:42+00:00 · Latest: 2026-03-06T16:51:42+00:00
Abstract
The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need'' dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.
中文标题/摘要
标题:GreenRFM:朝向资源高效的放射学基础模型
放射学基础模型(RFMs)的发展受到粗暴扩展依赖的阻碍。现有方法通常直接将方法应用于自然图像,这侧重于规模而非精度,因此在临床实践中导致了脆弱且昂贵的模型。为解决这一问题,我们提出了一种资源高效的预训练框架GreenRFM,实现了最先进的性能。该框架确保了在多种患者群体和成像协议下具有稳健的泛化能力,同时将计算需求降低几个数量级,超越了复杂的参数密集型模型。这些能力源自一种基于原则的监督设计,旨在通过More Distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) 监督最大限度地利用监督信号,而不是简单地增加训练数据的数量。我们提供了两种GreenRFM配置:(i) 一种高性能模型,在24小时内使用单个24GB GPU达到新的最先进的性能;(ii) 一种轻量级模型,在4小时内使用6GB VRAM达到现有基准。我们使用来自四个机构的超过200,000张图像进行了广泛的实验,涵盖两种成像模态。GreenRFMs在胸部和腹部CT数据集上表现出优越的性能,无论是在公共还是私人基准上,都超越了多种基线模型。此外,内部肌肉骨骼MRI图像的结果表明,相同的监督原则在不同模态之间具有可转移性。我们的性能和效率挑战了“规模即一切”的教条,并为临床医生在笔记本电脑上开发最先进的RFMs提供了平等的机会。
Summary / 总结
GreenRFM is a resource-efficient pre-training framework for radiology foundation models (RFMs) that addresses the issue of brute-force scaling in existing approaches. It uses a principled supervision design called MUST to achieve state-of-the-art performance while reducing computational requirements. GreenRFM offers two configurations: a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. Extensive experiments on over 200,000 images from four institutions and two modalities show that GreenRFMs outperform a range of baseline models and demonstrate transferability between different modalities.
GreenRFM 是一种资源高效的放射学基础模型(RFMs)预训练框架,解决了现有方法中依赖粗暴扩展的问题。它使用一种名为MUST的原理性监督设计来实现最先进的性能,同时减少计算需求。GreenRFM 提供两种配置:一种高性能模型在单个24GB GPU上24小时内达到新的最先进的水平,另一种轻量级模型在4小时内使用6GB VRAM达到现有基准水平。在来自四个机构和两种模态的超过200,000张图像的广泛实验中,GreenRFMs 在胸部和腹部CT数据集上表现出色,无论是在公共还是私有基准上都超越了多种基线模型,并且在内部肌肉骨骼MRI图像上展示了不同模态之间的可转移性。
SSL-SLR: Self-Supervised Representation Learning for Sign Language Recognition
Authors: Ariel Basso Madjoukeng, Jérôme Fink, Pierre Poitier, Edith Belise Kenmogne, Benoit Frenay
First: 2025-09-05T15:38:19+00:00 · Latest: 2026-03-06T16:50:24+00:00
Abstract
Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.
中文标题/摘要
标题:SSL-SLR: 自监督表示学习在手语识别中的应用
手语识别(SLR)是一项机器学习任务,旨在识别视频中的手势。由于标注数据的稀缺,无监督方法如对比学习在这一领域变得很有前景。它们通过将正样本对(同一实例的两个增强版本)拉近,将负样本对(不同于正样本对)推开,来学习有意义的表示。在SLR中,手势视频中只有某些部分提供了真正有用的信息用于识别。将对比学习方法应用于SLR存在两个问题:(i)对比学习方法对待视频中的所有部分一视同仁,不考虑某些部分的相关性;(ii)不同手势之间的共享动作使负样本对高度相似,增加了手势区分的难度。这些问题导致学习非区分性特征,下游任务结果不佳。为此,本文提出了一种自监督学习框架,旨在为SLR学习有意义的表示。该框架由两个设计用于协同工作的关键组件组成:(i)一种新的自监督方法,带有自由负样本对;(ii)一种新的数据增强技术。该方法在线性评估、半监督学习和手语语言之间的可迁移性方面,相对于多种对比学习和自监督方法,显示出显著的准确率提升。
Summary / 总结
This paper addresses the challenge of sign language recognition (SLR) by proposing SSL-SLR, a self-supervised learning framework. It tackles the issues of treating all video parts equally and the similarity of shared movements between signs by introducing a new self-supervised approach with free-negative pairs and a novel data augmentation technique. The method significantly improves accuracy in SLR tasks, outperforming several contrastive and self-supervised methods in linear evaluation, semi-supervised learning, and cross-language transferability.
该论文通过提出SSL-SLR框架来解决手语识别(SLR)的挑战。该框架通过引入一种新的自监督方法(带有自由负样本对)和一种新型的数据增强技术来解决对比学习方法中的问题。该方法在SLR任务中的准确率得到了显著提升,优于多种对比学习和自监督方法,在线性评估、半监督学习和跨手语语言迁移性方面表现更佳。
Pinterest Canvas: Large-Scale Image Generation at Pinterest
Authors: Yu Wang, Eric Tzeng, Raymond Shiau, Jie Yang, Dmitry Kislyuk, Charles Rosenberg
First: 2026-03-06T16:43:44+00:00 · Latest: 2026-03-06T16:43:44+00:00
Abstract
While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.
中文标题/摘要
标题:Pinterest Canvas:Pinterest的大规模图像生成系统
虽然最近的图像生成模型在处理各种图像生成任务方面表现出惊人的能力,但这种灵活性使得仅通过提示或简单的推理适应来控制它们变得困难,从而使它们不适合具有严格产品要求的应用场景。在本文中,我们介绍了Pinterest Canvas,这是我们为支持Pinterest的图像编辑和增强用例而构建的大规模图像生成系统。Canvas首先在多样化的多模态数据集上进行训练,以生成具有广泛图像编辑能力的基础扩散模型。然而,我们并没有依赖一个通用模型来处理每个下游任务,而是快速在任务特定的数据集上微调该基础模型的变体,从而为每个用例生成专门的模型。我们描述了Canvas的关键组件,并总结了我们在数据集编目、训练和推理方面的最佳实践。我们还通过背景增强和宽高比扩展的具体案例研究展示了任务特定的变体,突出了我们如何解决它们的特定产品要求。在线A/B实验表明,我们的增强图像分别获得了显著的18.0%和12.5%的参与度提升,而与人类评估者的比较进一步验证了我们的模型在这些任务上优于第三方模型。最后,我们展示了其他Canvas变体,包括多图像场景合成和图像到视频生成,证明了我们的方法可以应用于各种潜在的下游任务。
Summary / 总结
Pinterest Canvas is a large-scale image generation system designed to meet strict product requirements for image editing and enhancement at Pinterest. It starts with a diverse, multimodal dataset to train a foundational diffusion model, which is then fine-tuned for specific use cases. Experiments show that Canvas improves engagement by 18.0% and 12.5% for background enhancement and aspect-ratio outpainting, respectively, and outperforms third-party models in these tasks.
Pinterest Canvas 是一个大规模图像生成系统,旨在支持 Pinterest 的图像编辑和增强功能。它首先通过一个多样化的多模态数据集进行训练,以创建一个具有广泛图像编辑能力的基础扩散模型。然后,该模型会针对特定任务的数据集进行微调,以生成专门用于个别用例的模型。实验表明,Canvas 可以将图像的参与度分别提高 18.0% 和 12.5%,并且在背景增强和长宽比扩展任务中优于第三方模型。此外,Canvas 还可以应用于其他任务,如多图像场景合成和图像到视频生成,展示了其广泛的适用性。
How Reliable is Language Model Micro-Benchmarking?
Authors: Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta
Venue: ICLR 2026
First: 2025-10-09T18:37:03+00:00 · Latest: 2026-03-06T16:42:35+00:00
Comments: Published at ICLR 2026
Abstract
Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.
中文标题/摘要
标题:语言模型微基准测试的可靠性如何?
微基准测试提供了一种解决语言模型开发中通常高昂的时间和成本问题的方法:在现有基准的小子集上进行评估。然而,这些微基准测试能否像它们所替代的完整基准测试那样一致地对模型进行排名?或者它们能否比随机选择数据点的子集更一致地对模型进行排名?在许多情况下,我们发现答案是否定的。我们引入了一种元评估指标,用于研究微基准测试如何根据模型在完整基准测试上的性能差异来正确排名两个模型。这种方法可以确定哪些模型对可以由微基准测试正确排名,从而对微基准测试规模与可靠性之间的权衡进行更细致的分析。先前的研究建议选择最少10个示例;我们发现,没有任何微基准测试方法能够在MMLU-Pro上一致地对3.5个准确度点的模型对进行排名,或在BIG-bench Hard上对4个准确度点的模型对进行排名。为了能够一致地对具有相对相似性能的模型对进行排名,我们表明通常需要选择多达250个示例,此时随机抽样与现有的微基准测试方法具有竞争力。当仅在MMLU-Pro微基准测试中比较8B指令调优模型时,我们发现超过一半的成对比较不太可能被保留。我们的研究为微基准测试用户和开发者提供了关于评估效率与可靠性之间权衡的实际指导。
Summary / 总结
This study evaluates the reliability of language model micro-benchmarking by comparing them to full benchmarks and random subsets. It introduces a meta-evaluation measure to assess how well micro-benchmarks can rank model performances. The research finds that micro-benchmarks struggle to consistently rank models with performance differences of 3.5 points on MMLU-Pro or 4 points on BIG-bench Hard. To reliably rank models with similar performances, at least 250 examples are needed, making random sampling competitive. The study provides guidance on balancing evaluation efficiency and reliability.
研究评估了语言模型微基准测试的可靠性,通过将其与完整基准和随机子集进行比较。引入了一种元评估方法来评估模型对基于性能差异的排名一致性。研究发现,没有微基准测试方法可以一致地排名具有显著准确度差异的模型对,至少需要250个示例才能进行可靠的比较。即使在25个示例的情况下,超过一半的模型对比较对于MMLU-Pro微基准测试中的8B指令调优模型也是不可靠的。
Taxonomy-aware Dynamic Motion Generation on Hyperbolic Manifolds
Authors: Luis Augenstein, Noémie Jaquier, Tamim Asfour, Leonel Rozo
Venue: ICRA
First: 2025-09-25T15:03:03+00:00 · Latest: 2026-03-06T16:42:28+00:00
Comments: Accepted for publication in IEEE Conference on Robotics and Automation (ICRA), 8 pages, 6 figures, 1 table
Abstract
Human-like motion generation for robots often draws inspiration from biomechanical studies, which often categorize complex human motions into hierarchical taxonomies. While these taxonomies provide rich structural information about how movements relate to one another, this information is frequently overlooked in motion generation models, leading to a disconnect between the generated motions and their underlying hierarchical structure. This paper introduces the \ac{gphdm}, a novel approach that learns latent representations preserving both the hierarchical structure of motions and their temporal dynamics to ensure physical consistency. Our model achieves this by extending the dynamics prior of the Gaussian Process Dynamical Model (GPDM) to the hyperbolic manifold and integrating it with taxonomy-aware inductive biases. Building on this geometry- and taxonomy-aware frameworks, we propose three novel mechanisms for generating motions that are both taxonomically-structured and physically-consistent: two probabilistic recursive approaches and a method based on pullback-metric geodesics. Experiments on generating realistic motion sequences on the hand grasping taxonomy show that the proposed GPHDM faithfully encodes the underlying taxonomy and temporal dynamics, and it generates novel physically-consistent trajectories.
中文标题/摘要
标题:超曲面上具有分类意识的动态运动生成
类人的机器人运动生成通常受到生物力学研究的启发,这些研究常常将复杂的人类运动分类为分层分类体系。虽然这些分类体系提供了丰富的关于运动之间关系的结构信息,但在运动生成模型中,这些信息经常被忽视,导致生成的运动与其潜在的分层结构脱节。本文介绍了一种新颖的方法——全局超曲面动力模型(GPHDM),该方法通过将高斯过程动力模型(GPDM)的动力学先验扩展到超曲面,并结合分类意识的归纳偏置,学习保留运动的分层结构和时间动态,以确保物理一致性。我们的模型通过在几何和分类意识框架的基础上提出三种新颖机制来生成既具有分类结构又具有物理一致性的运动:两种概率递归方法和基于拉回度量测地线的方法。在手部抓握分类体系生成现实运动序列的实验中表明,提出的GPHDM准确地编码了潜在的分类体系和时间动态,并生成了新颖的物理一致轨迹。
Summary / 总结
This paper addresses the gap between hierarchical taxonomies of human motions and their representation in motion generation models. It introduces the GPHDM, which learns latent representations that preserve both the hierarchical structure and temporal dynamics of motions. The model extends the GPDM dynamics to hyperbolic manifolds and integrates taxonomy-aware inductive biases. Three mechanisms are proposed for generating taxonomically-structured and physically-consistent motions. Experiments demonstrate that GPHDM accurately encodes the underlying taxonomy and temporal dynamics, and generates novel, physically-consistent trajectories.
本文解决了人类运动的层次结构与其在机器人运动生成中的物理一致性之间的差距。它引入了GPHDM,该模型学习保留运动的层次结构和时间动态的潜在表示。该模型将GPDM的动力学先验扩展到双曲流形,并结合了层次结构感知的归纳偏置。提出了三种机制:两种概率递归方法和基于拉回度量测地线的方法,以生成具有层次结构和物理一致性的运动。实验表明,GPHDM准确地编码了层次结构和时间动态,并生成了新的、物理上一致的轨迹。
What if? Emulative Simulation with World Models for Situated Reasoning
Authors: Ruiping Liu, Yufan Chen, Yuheng Zhang, Junwei Zheng, Kunyu Peng, Chengzhi Wu, Chenguang Huang, Di Wen, Jiaming Zhang, Kailun Yang, Rainer Stiefelhagen
First: 2026-03-06T16:37:15+00:00 · Latest: 2026-03-06T16:37:15+00:00
Abstract
Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.
中文标题/摘要
标题:如果呢?基于世界模型的模仿模拟以实现情境推理
情境推理通常依赖于积极的探索,但在许多现实场景中,由于机器人物理限制或视力受损用户的安全问题,这种探索往往是不可行的。仅凭有限的观察,智能体能否在心中模拟一条通往目标情境的未来轨迹并回答空间的“如果”问题?我们引入了WanderDream,这是首个用于模仿模拟心理探索的大规模数据集,使模型能够在没有积极探索的情况下进行推理。WanderDream-Gen 包含来自HM3D、ScanNet++和真实场景捕获的1,088个真实场景中的15,800个全景视频,描绘了从当前视角到目标情境的想象轨迹。WanderDream-QA 包含158,000个问题-答案对,涵盖了每个轨迹的起始状态、路径和结束状态,全面评估探索性推理。广泛的实验表明:(1) 心理探索对于情境推理至关重要;(2) 世界模型在WanderDream-Gen 上表现出色;(3) 想象显著促进了WanderDream-QA 上的推理;(4) WanderDream 数据在真实场景中表现出显著的迁移性。源代码和所有数据将被发布。
Summary / 总结
The paper addresses the challenge of situated reasoning under physical constraints by introducing WanderDream, a dataset for emulative simulation. It includes 15.8K panoramic videos and 158K question-answer pairs, enabling models to reason without active exploration. Experiments show that mental exploration is crucial for situated reasoning, world models perform well on the dataset, imagination aids reasoning, and the data has transferability to real-world scenarios.
论文旨在解决无需主动探索的定位推理问题,引入了WanderDream数据集用于模拟性模拟。WanderDream-Gen包含15.8K全景视频,覆盖1,088个真实场景,而WanderDream-QA包含158K问题-答案对。实验表明,世界模型和MLLMs在该数据集上表现良好,强调了心智探索和想象对于处理真实世界场景中的推理的重要性。关键发现包括心智探索的必要性、世界模型的有效性、想象对推理的促进作用以及数据集在真实世界中的可迁移性。