MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
Authors: Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang, Xinyan Chen, Ruichuan An, Ruofan Xing, Jiayi Zhang, Haojie Huang, Pheng-Ann Heng, Jonathan Tremblay, Lawson L. S. Wong
First: 2026-03-20T17:59:56+00:00 · Latest: 2026-03-20T17:59:56+00:00
Abstract
Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/
中文标题/摘要
标题:MME-CoF-Pro:使用文本和视觉提示评估视频生成模型中的推理连贯性
视频生成模型展示了新兴的推理行为。确保生成的事件在帧间保持因果一致性对于可靠部署至关重要,我们将其定义为推理连贯性。为弥补文献中缺乏推理连贯性评估的空白,我们提出了MME-CoF-Pro,一个全面的视频推理基准,用于评估视频模型中的推理连贯性。具体而言,MME-CoF-Pro 包含16个类别中的303个样本,从视觉逻辑到科学推理不等。它引入了推理评分作为评估指标,用于评估过程级必要的中间推理步骤,并包括三种评估设置:(a) 无提示 (b) 文本提示和 (c) 视觉提示,从而可以控制地调查推理提示指导的内在机制。在7个开源和闭源视频模型中的评估结果揭示了以下见解:(1) 视频生成模型在推理连贯性方面表现较弱,与生成质量无关。(2) 文本提示提高了显性的正确性,但往往导致不一致和虚构的推理。(3) 视觉提示有利于结构化的感知任务,但在细粒度感知方面存在困难。网站:https://video-reasoning-coherence.github.io/
Summary / 总结
The research aims to evaluate reasoning coherence in video generative models, a critical property for their reliable deployment. MME-CoF-Pro, a new benchmark, assesses reasoning coherence through a Reasoning Score and three evaluation settings: no hint, text hint, and visual hint. The study finds that video generative models generally lack reasoning coherence, with text hints improving apparent correctness but causing inconsistencies, while visual hints help with structured tasks but not fine-grained perception.
研究旨在评估视频生成模型中的推理连贯性,这是可靠部署的关键属性。提出了MME-CoF-Pro这一全面基准,通过303个样本和16个类别,使用推理评分指标来评估推理连贯性。研究发现,视频生成模型的推理连贯性较弱,文本提示可以提高表面正确性但常导致不一致,而视觉提示有利于结构化的感知任务但难以处理精细的感知任务。
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Authors: Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen
Venue: CVPR 2026
First: 2026-03-20T17:59:54+00:00 · Latest: 2026-03-20T17:59:54+00:00
Comments: Code and data at: https://github.com/VILA-Lab/PIXAR (Accepted in CVPR 2026 Findings, but not opted in)
Abstract
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.
中文标题/摘要
标题:从面具到像素和意义:VLM 图像篡改的新分类、基准和度量
现有的篡改检测基准主要依赖于对象掩码,这严重偏离了真正的编辑信号:掩码内的许多像素未被修改或仅被轻微修改,而掩码外的细微但具有重大影响的编辑则被视为自然的。我们从粗略的区域标签重新定义VLM图像篡改为基于像素、具有意义和语言意识的任务。首先,我们引入了一种分类体系,涵盖编辑原语(替换/删除/拼接/修复/属性/着色等)及其篡改对象的语义类别,将低级变化与高级理解联系起来。其次,我们发布了一个新的基准,包含每个像素的篡改图和配对的类别监督,以在统一协议下评估检测和分类。第三,我们提出了一种训练框架和评估指标,量化像素级别的正确性并进行定位,以评估对真实编辑强度的信心或预测,并进一步通过语义感知分类和自然语言描述预测区域来衡量篡改意义的理解。我们还重新评估了现有的强大分割/定位基线在最近的强篡改检测器上的表现,并揭示了仅使用掩码指标的过度评分和不足评分,以及在微小编辑和掩码外变化上的失败模式。我们的框架将领域从掩码推进到像素、意义和语言描述,建立了篡改定位、语义分类和描述的严格标准。代码和基准数据可在https://github.com/VILA-Lab/PIXAR 获取。
Summary / 总结
The research aims to improve tampering detection in images by addressing the limitations of existing mask-based benchmarks. It introduces a new taxonomy of tampering types and a pixel-grounded benchmark with per-pixel tamper maps and category supervision. Key findings include the development of a training framework and evaluation metrics that assess both pixel-level correctness and semantic understanding of tampered regions, revealing the inadequacies of mask-only metrics in evaluating recent tamper detectors and highlighting the importance of considering subtle and off-mask edits.
本文针对现有依赖对象掩码的篡改检测基准存在的问题,这些问题往往与真实的编辑信号不匹配。它引入了一种篡改原语及其语义类别的新分类体系,并提出了一个新的基准,该基准包含逐像素篡改图和配对的类别监督。作者还开发了一个训练框架和评估指标,以评估像素级的正确性和语义理解。结果表明,仅依赖掩码的指标对现有强大的分割/定位基线的评价存在过高和过低的情况,并且在微小编辑和非掩码区域的变化方面暴露出失败模式。这项工作通过从掩码到像素、意义和语言描述的进步,建立了篡改定位、语义分类和描述的严格标准。
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Authors: Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu
Venue: ICLR 2026
First: 2026-03-20T17:59:46+00:00 · Latest: 2026-03-20T17:59:46+00:00
Comments: ICLR 2026 Camera Ready Version. Code and Models: https://jiazheng-xing.github.io/lumosx-home/
Abstract
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
中文标题/摘要
标题:LumosX:通过身份与其属性关联实现个性化视频生成
扩散模型的最新进展显著提高了文本到视频生成的效果,使得在精细控制前景和背景元素的同时实现个性化内容创作成为可能。然而,跨主体的精确面部属性对齐仍然具有挑战性,因为现有方法缺乏确保组内一致性的确切机制。解决这一缺口需要明确建模策略和面部属性感知的数据资源。因此,我们提出了LumosX框架,该框架在数据和模型设计方面均有所推进。在数据方面,定制化的采集流程协调来自独立视频的描述和视觉提示,而多模态大型语言模型(MLLMs)推断并分配主题特定的依赖关系。提取的关联先验施加了更精细的结构,增强了个性化视频生成的表达控制,并使构建全面基准成为可能。在建模方面,关系自注意力和关系交叉注意力将位置感知嵌入与精细的注意力动态交织在一起,以明确嵌入主题属性依赖关系,确保组内一致性并增强不同主题簇之间的分离。在我们基准上的全面评估表明,LumosX 在细粒度、身份一致性和语义对齐的个性化多主体视频生成方面达到了最先进的性能。代码和模型可在 https://jiazheng-xing.github.io/lumosx-home/ 获取。
Summary / 总结
LumosX addresses the challenge of precise face-attribute alignment in personalized video generation by proposing a framework that includes a tailored data collection pipeline and model design. The framework uses a collection pipeline to integrate captions and visual cues from independent videos and multimodal large language models to infer subject-specific dependencies. On the modeling side, it employs Relational Self-Attention and Relational Cross-Attention to enhance intra-group consistency and separation between subject clusters. Evaluations show that LumosX outperforms existing methods in generating fine-grained, identity-consistent, and semantically aligned personalized multi-subject videos.
LumosX 提出了一种框架,通过定制化的数据收集管道和模型设计来解决个性化视频生成中精确面部属性对齐的挑战。数据方面使用了收集管道和多模态大型语言模型来推断主体特定的依赖关系,而模型方面则采用了关系自注意力和关系交叉注意力来强化组内的一致性并增强不同主体集群之间的分离。评估结果显示,LumosX 在生成细粒度、身份一致且语义对齐的个性化多主体视频方面优于现有方法。
CoVR-R:Reason-Aware Composed Video Retrieval
Authors: Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan
Venue: CVPR 2026
First: 2026-03-20T17:59:25+00:00 · Latest: 2026-03-20T17:59:25+00:00
Comments: CVPR 2026 (findings)
Abstract
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.
中文标题/摘要
标题:CoVR-R:基于推理的组合视频检索
组合视频检索(CoVR)旨在根据参考视频和文本修改找到目标视频。先前的工作假设修改文本完全规定了视觉变化,忽视了编辑带来的后效和隐含后果(例如,运动、状态转换、视角或持续时间线索)。我们认为成功的CoVR需要考虑这些后效。我们提出了一种以推理为主导的零样本方法,利用大规模多模态模型(i)推断编辑所暗示的因果和时间后果,(ii)将推理出的查询与候选视频对齐,无需特定任务的微调。为了评估CoVR中的推理,我们还提出了CoVR-Reason基准,它将每个(参考,编辑,目标)三元组与结构化的内部推理痕迹和需要预测后效而非关键词匹配的具有挑战性的干扰项配对。实验表明,我们的零样本方法在召回率K上优于强大的检索基线,并且特别擅长处理隐含效应子集。我们的自动和人工分析证实了我们检索结果的步骤一致性和后效事实性更高。我们的研究结果表明,将推理纳入通用多模态模型可以有效进行CoVR,因为它明确地考虑了因果和时间后效。这减少了对特定任务监督的依赖,提高了对具有挑战性的隐含效应案例的泛化能力,并增强了检索结果的可解释性。这些结果指出了可扩展且原理上的框架,用于解释视频搜索。该模型、代码和基准可在https://github.com/mbzuai-oryx/CoVR-R上获得。
Summary / 总结
CoVR-R introduces a reasoning-first approach for Composed Video Retrieval (CoVR) that leverages large multimodal models to infer causal and temporal consequences from edits and align reasoned queries to candidate videos without task-specific fine-tuning. The method outperforms strong retrieval baselines, especially on implicit-effect subsets, and improves step consistency and effect factuality in retrieved results. The approach reduces dependence on task-specific supervision and enhances interpretability, pointing towards a scalable framework for explainable video search.
CoVR-R 提出了一种因果推理先行的方法来解决组合视频检索 (CoVR) 问题,能够从文本编辑中推断出因果和时间上的后果,并在无需特定任务微调的情况下与候选视频对齐。该方法在召回率上优于强大的检索基线,特别是在隐含效果案例中表现尤为出色,并且检索结果具有更高的步骤一致性和效果真实性。这项工作表明,将推理纳入通用多模态模型可以有效解决 CoVR 问题,通过考虑因果和时间上的后果,提高泛化能力和检索结果的可解释性。该模型和基准已公开发布。
MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms
Authors: Anqi Dong, Yongxin Chen, Karl H. Johansson, Johan Karlsson
First: 2026-03-20T17:59:18+00:00 · Latest: 2026-03-20T17:59:18+00:00
Abstract
Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.
中文标题/摘要
标题:MeanFlow与控制的融合:扩展采样数据控制以适应群集
在仅需少数几次控制更新的情况下引导大规模群集极具挑战性,因为实际系统以采样数据形式运行:控制输入是间歇性更新并在有限时间内应用。在这种情况下,自然对象不是瞬时速度场,而是每个采样间隔内捕捉系统响应的有限窗口控制量。受MeanFlow启发,我们提出了一种在线性时不变动力学下进行群集引导的控制空间学习框架。所学习的对象是参数化每个间隔内最小能量控制的系数。我们证明了该系数沿桥梁轨迹既具有积分表示形式,又具有局部微分身份,这导致了一个简单的反梯度训练目标。在实现时,所学习的系数直接用于采样数据更新,因此所规定的动力学和执行映射会通过构造得到遵守。由此产生的框架提供了一种与实际控制系统的采样数据结构一致的可扩展的多步群集引导方法。
Summary / 总结
The research addresses the challenge of steering large-scale swarms with only a few control updates, considering the sampled-data nature of real systems. It introduces a control-space learning framework inspired by MeanFlow, which learns the coefficient for finite-horizon minimum-energy control over each sampling interval. The key finding is that this coefficient can be used directly in sampled-data updates, ensuring that the prescribed dynamics and actuation map are respected, thus providing a scalable approach to few-step swarm steering.
本文旨在解决在只有少数几次控制更新的情况下引导大规模群集的问题,考虑到实际系统中的采样数据特性。该文引入了一个受MeanFlow启发的控制空间学习框架,该框架学习每个采样间隔内最小能量控制的系数参数。该学习到的系数直接用于采样数据更新,确保所规定的动力学和执行映射得以遵守。主要发现是,这种方法提供了一种与实际控制系统的采样数据结构相一致的可扩展的多步群集引导方法。
MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
Authors: Yuan Zhou, Yongzhi Li, Yanqi Dai, Xingyu Zhu, Yi Tan, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang
First: 2026-03-20T17:59:06+00:00 · Latest: 2026-03-20T17:59:06+00:00
Abstract
Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: https://github.com/zhouyuan888888/MuSteerNet.
中文标题/摘要
标题:MuSteerNet: 观测-反应互导生成视频中的人类反应
基于视频的人类反应生成旨在合成直接对观察到的视频序列作出反应的3D人体动作,这对于构建类人交互式AI系统至关重要。然而,现有方法往往无法有效利用视频输入来引导人类反应的生成,导致反应动作与视频内容不匹配。我们发现这一局限性源于视觉观察与反应类型之间严重的相关性失真。鉴于此,我们提出MuSteerNet,这是一种简单而有效的框架,通过观测-反应互导从视频中生成3D人类反应。具体而言,我们首先提出了一种原型反馈引导机制,通过带有门控δ修正调节器和关系边际约束的原型向量引导的视觉观察精炼,来缓解相关性失真。然后,我们引入了双耦合反应精炼,充分利用修正后的视觉线索进一步引导生成反应动作的精炼,从而有效提高反应质量,并使MuSteerNet达到竞争性性能。广泛的实验和消融研究验证了我们方法的有效性。代码即将发布:https://github.com/zhouyuan888888/MuSteerNet。
Summary / 总结
MuSteerNet is designed to generate 3D human reactions that accurately respond to video sequences by addressing the relational distortion between visual observations and reaction types. It uses a Prototype Feedback Steering mechanism to refine visual observations and a Dual-Coupled Reaction Refinement to further improve reaction quality. Experiments show that MuSteerNet outperforms existing methods in generating human-like reactions.
MuSteerNet旨在通过解决视觉观察与反应类型之间的关系失真问题,生成能够对视频序列做出反应的3D人体动作。它使用Prototype Feedback Steering机制来细化视觉观察,并使用Dual-Coupled Reaction Refinement进一步引导生成的反应动作。实验表明,MuSteerNet在生成对视频输入做出的人类反应方面优于现有方法。
Improving Image-to-Image Translation via a Rectified Flow Reformulation
Authors: Satoshi Iizuka, Shun Okamoto, Kazuhiro Fukui
First: 2026-03-20T17:59:03+00:00 · Latest: 2026-03-20T17:59:03+00:00
Abstract
In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.
中文标题/摘要
标题:通过修正流重构提高图像到图像翻译
在本文中,我们提出了图像到图像修正流重构(I2I-RFR),这是一种实用的插件重构方法,将标准的I2I回归网络重新表述为连续时间传输模型。虽然像素级的I2I回归简单、稳定且易于跨任务适应,但它往往对病态和多模态目标过度平滑,而生成性替代方案通常需要额外的组件、任务特定的调整以及更复杂的训练和推理管道。我们的方法通过通道级连接噪声污染的目标真实值来增强主干输入,并优化一个简单的t加权像素损失。该目标通过诱导的速度场具有修正流的解释,允许在推理时基于ODE进行逐步细化,同时主要保留标准的监督训练管道。在大多数情况下,采用I2I-RFR只需要扩展输入通道,推理可以通过几步显式求解步骤(例如,3步)完成,而无需蒸馏。在多个图像到图像翻译和视频恢复任务上的广泛实验表明,I2I-RFR通常在各种任务和主干网络上提高了性能,特别是在感知质量和细节保留方面有明显的改进。总体而言,I2I-RFR提供了一种轻量级的方法,可以在传统的I2I模型中引入连续时间的细化,而无需使用复杂的生成管道。
Summary / 总结
The research aims to enhance image-to-image translation by proposing I2I-RFR, which reformulates standard I2I regression networks into continuous-time transport models. The method concatenates the backbone input with a noise-corrupted version of the ground-truth target and optimizes a t-reweighted pixel loss, enabling ODE-based progressive refinement. Experiments across various tasks demonstrate that I2I-RFR improves performance, especially in perceptual quality and detail preservation, with minimal changes to the training pipeline and efficient inference.
研究旨在通过提出I2I-RFR来改进图像到图像的转换,该方法将标准的I2I回归网络重新表述为连续时间传输模型。该方法通过将骨干输入与目标的噪声版本按通道连接,并优化t加权像素损失,使推理时能够进行逐级细化。实验表明,I2I-RFR在各种任务中提高了性能,特别是在感知质量和细节保留方面,且对训练管道的改动较小,推理步骤也较为高效。
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Authors: Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, Emad Barsoum
Venue: CVPR 2026
First: 2026-03-20T17:58:47+00:00 · Latest: 2026-03-20T17:58:47+00:00
Comments: Accepted at CVPR 2026
Abstract
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
中文标题/摘要
标题:VideoSeek:具有工具引导搜索的长时视频智能体
视频智能体模型在挑战性的视频-语言任务中取得了进展。然而,大多数智能体方法仍然高度依赖于对密集采样视频帧的贪婪解析,导致高计算成本。我们提出了VideoSeek,一种利用视频逻辑流主动寻找答案关键证据的长时视频智能体,而不是对整个视频进行耗时的解析。这一洞察使模型能够在使用更少帧的情况下,甚至保持或提高其视频理解能力。VideoSeek 在一个思考-行动-观察的循环中运行,并配备了一个精心设计的工具包来收集多粒度的视频观察。这种设计使模型能够对累积观察进行查询感知的探索,并支持实际的视频理解和推理。在四个具有挑战性的视频理解和推理基准上的实验表明,VideoSeek 在使用远少于先前视频智能体和独立的LMM的帧数的情况下,仍能实现强大的准确性。值得注意的是,与基线模型GPT-5相比,VideoSeek 在LVBench 上实现了10.2的绝对分值提升,同时使用了93%更少的帧。进一步的分析强调了利用视频逻辑流、强大的推理能力和工具包设计互补作用的重要性。
Summary / 总结
VideoSeek is a long-horizon video agent that uses video logic flow to actively seek critical evidence, reducing the need for exhaustive frame parsing and improving computational efficiency. Experiments show that VideoSeek outperforms previous video agents and standalone language models on four benchmarks, using significantly fewer frames. Specifically, it improves accuracy by 10.2 points on LVBench while using 93% fewer frames than its base model, GPT-5.
VideoSeek 是一种利用视频逻辑流主动寻找答案关键证据的长时视频代理,减少了所需帧数的同时保持或提升了视频理解能力。实验表明,VideoSeek 在四个基准测试中优于之前的视频代理和独立的语言模型,使用了显著更少的帧数。具体来说,VideoSeek 在 LVBench 上比其基础模型 GPT-5 提高了 10.2 个点的准确率,同时使用了 93% 更少的帧数。
Kolmogorov-Arnold causal generative models
Authors: Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz, Santiago Zazo, Juan Parras
First: 2026-03-20T17:58:38+00:00 · Latest: 2026-03-20T17:58:38+00:00
Comments: 14 pages, 8 figures, 3 tables, 5 algorithms, preprint
Abstract
Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high-stakes domains. We propose KaCGM, a causal generative model for mixed-type tabular data where each structural equation is parameterized by a Kolmogorov--Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent--child relationships, while preserving query-agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi-synthetic benchmarks show competitive performance against state-of-the-art methods. A real-world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings. Code: https://github.com/aalmodovares/kacgm
中文标题/摘要
标题:柯尔莫哥洛夫-阿诺尔德因果生成模型
因果生成模型提供了一种从观测数据中回答观察性、干预性和反事实查询的原理性框架。然而,许多深度因果模型依赖于具有不透明机制的高度表达性架构,限制了在高风险领域中的可审计性。我们提出了KaCGM,一种用于混合型表格数据的因果生成模型,其中每个结构方程由柯尔莫哥洛夫-阿诺尔德网络(KAN)参数化。这种分解使得可以直接检查学习到的因果机制,包括符号近似和父节点-子节点关系的可视化,同时保持查询无关的生成语义。我们基于分布匹配和推断出的外生变量的独立性诊断引入了一个验证管道,允许仅使用观测数据进行评估。在合成和半合成基准上的实验显示,KaCGM 的性能与最先进的方法相当。一个实际的心血管案例研究进一步证明了简化结构方程和可解释因果效应的提取。这些结果表明,高度表达性的因果生成建模和功能透明性可以同时实现,支持在表格决策制定环境中可信部署。代码:https://github.com/aalmodovares/kacgm
Summary / 总结
The research aims to develop a causal generative model that maintains transparency and auditability in high-stakes domains. KaCGM uses Kolmogorov--Arnold Networks to parameterize structural equations, enabling direct inspection of causal mechanisms. Experiments show competitive performance on synthetic and semi-synthetic benchmarks, and a real-world cardiovascular study demonstrates the extraction of interpretable causal effects. This suggests that expressive causal modeling and functional transparency can be achieved jointly.
研究旨在通过使用Kolmogorov-Arnold Network (KAN) 参数化结构方程来开发一种透明的因果生成模型,用于混合型表格数据。这种方法允许直接检查学习到的因果机制,包括符号近似和父-子关系的可视化,同时保持生成语义。合成和半合成基准上的实验表明,提出的KaCGM模型在性能上与最先进的方法相当。心血管领域的实际案例进一步展示了可解释的因果效应的提取,证明了该模型在决策制定环境中的可信部署潜力。
Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning
Authors: Jianan Huang, Rodolfo V. Valentim, Luca Vassio, Matteo Boffa, Marco Mellia, Idilio Drago, Dario Rossi
First: 2026-03-20T17:57:00+00:00 · Latest: 2026-03-20T17:57:00+00:00
Comments: Submitted to Euro S&P - 5th International Workshop on Designing and Measuring Security in Systems with AI
Abstract
The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.
中文标题/摘要
标题:使用多模态对比学习提高网络安全任务的一般化能力
在网络安全领域,机器学习(ML)长期以来受到泛化问题的困扰:在受控场景中表现良好的模型在生产环境中无法保持性能。问题的根本原因通常是ML算法学习表面模式(捷径)而非网络安全概念。我们研究对比多模态学习作为提高网络安全任务ML性能的第一步。我们旨在将数据丰富的模态(如文本)的知识转移到数据稀缺的模态(如载荷)上。我们针对威胁分类进行了案例研究,并提出了一种两阶段多模态对比学习框架,使用文本漏洞描述来指导载荷分类。首先,我们使用描述上的对比学习构建了一个语义上有意义的嵌入空间。然后,我们将载荷对齐到这个空间,将文本中的知识转移到载荷上。我们在一个大规模的私有数据集和一个基于公共CVE描述和LLM生成的载荷构建的合成基准上评估了该方法。该方法在两个基准上似乎减少了相对于基线的捷径学习。我们发布了我们的合成基准和源代码作为开源。
Summary / 总结
The paper addresses the generalization issues in cybersecurity ML models by proposing a multi-modal contrastive learning framework. It uses textual vulnerability descriptions to guide payload classification, constructing a semantically meaningful embedding space and then aligning payloads to it. The approach is evaluated on a large-scale private dataset and a synthetic benchmark, showing reduced shortcut learning compared to baselines. The synthetic benchmark and source code are released as open source.
论文通过提出多模态对比学习框架来解决网络安全ML模型的泛化问题,利用漏洞描述文本指导恶意负载分类,构建语义相关的嵌入空间并将其对齐给恶意负载。该方法在大规模私有数据集和基于公共CVE描述和LLM生成恶意负载构建的合成基准上进行评估,显示出与基线相比减少了捷径学习。合成基准和源代码已开源发布。
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
First: 2025-11-20T18:59:25+00:00 · Latest: 2026-03-20T17:56:05+00:00
Abstract
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
中文标题/摘要
标题:驯服长尾效应:通过自适应招募能手进行高效推理RL训练
大型语言模型(LLMs)的出现标志着推理能力的显著提升,开启了复杂问题解决的新领域。然而,使用强化学习(RL)训练这些推理模型时,遇到了关键的效率瓶颈:RL训练中的响应生成呈现出持久的长尾分布,其中少数非常长的响应主导了执行时间,浪费了资源并增加了成本。为了解决这一问题,我们提出了一种名为TLT的系统,通过集成自适应推测性解码来无损地加速推理RL训练。在RL中应用推测性解码具有挑战性,因为工作负载动态变化、目标模型不断演进以及招募能手模型的训练开销。TLT通过两个协同工作的组件克服了这些障碍:(1)自适应招募能手,一种在长尾生成期间连续在空闲GPU上训练的轻量级招募能手模型,以零成本保持与目标模型的对齐;(2)自适应展开引擎,维护一个内存高效的CUDAGraph池,并为每个输入批次适配选择合适的推测性解码策略。评估表明,与最先进的系统相比,TLT实现了超过1.7倍的端到端RL训练加速,保持了模型的准确性,并且生成了一个高质量的招募能手模型作为免费的副产品,适合高效部署。代码发布在https://github.com/mit-han-lab/fastrl/。
Summary / 总结
The paper addresses the efficiency bottleneck in training reasoning models using RL, where long responses dominate execution time. It introduces TLT, which uses adaptive speculative decoding with an Adaptive Drafter and Adaptive Rollout Engine to maintain alignment with the target model and reduce training time by over 1.7x without compromising accuracy. The draft model is also useful for efficient deployment as a byproduct.
论文提出TLT系统,通过集成自适应推测解码来解决使用强化学习(RL)训练推理模型时的效率瓶颈。TLT包括自适应草稿模型Adaptive Drafter,该模型在空闲GPU上持续训练以保持与目标模型的对齐,以及自适应展开引擎Adaptive Rollout Engine,该引擎使用内存高效的CUDAGraphs来为每个输入批次选择合适的推测解码策略。实验结果显示,TLT在RL训练中实现了超过1.7倍的加速,保持了模型的准确性,并提供了可用于高效部署的高质量草稿模型。
Adaptive Greedy Frame Selection for Long Video Understanding
Authors: Yuning Huang, Fengqing Zhu
First: 2026-03-20T17:55:32+00:00 · Latest: 2026-03-20T17:55:32+00:00
Abstract
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
中文标题/摘要
标题:长视频理解中的自适应贪婪帧选择
大型视觉-语言模型(VLMs)在长视频问答中越来越被应用,但推理往往受限于输入帧的数量和由此产生的视觉标记数量。简单的稀疏采样可能会错过关键时刻,而纯粹基于相关性的选择经常陷入近似重复的帧中,并牺牲了时间上相距较远的证据的覆盖范围。我们提出了一种问题自适应的贪婪帧选择方法,该方法在固定帧预算下联合优化查询相关性和语义代表性。我们的方法构建了一个1~FPS候选池(最多1000个),具有精确的时间戳对齐,将候选者嵌入两个互补的空间(SigLIP用于问题相关性,DINOv2用于语义相似性),并通过贪婪地最大化加权和的模块化相关性项和设施位置覆盖项来选择帧。该目标是归一化的、单调的和次模的,提供了标准的(1-1/e)贪婪近似保证。为了考虑问题之间相关性和覆盖之间的依赖性权衡,我们引入了四种预设策略和一个轻量级的仅文本问题类型分类器,将每个查询路由到其表现最佳的预设策略。在MLVU上的实验显示,在各种帧预算下,与均匀采样和一个强大的近期基线相比,具有持续的准确率提升,特别是在预算紧张的情况下,改进最为显著。
Summary / 总结
The paper addresses the challenge of efficient frame selection for long-video question answering using large vision-language models. It proposes an adaptive greedy frame selection method that optimizes both query relevance and semantic representativeness under a fixed frame budget. The method constructs a 1 FPS candidate pool, embeds candidates in two spaces (SigLIP for relevance and DINOv2 for semantic similarity), and selects frames by maximizing a weighted sum of relevance and coverage terms. Experiments show consistent accuracy gains over uniform sampling and a strong baseline, with the largest improvements under tight budgets.
论文针对使用大型视觉-语言模型进行长视频问题回答时的帧选择效率问题,提出了一种自适应贪婪帧选择方法,该方法同时优化查询相关性和语义代表性。该方法构建了一个1 FPS 候选池,将候选帧嵌入两个空间,并通过最大化相关性和覆盖性的加权和来选择帧。实验结果显示,该方法在均匀采样和一个强大的基线方法上的一致性准确率提升,特别是在帧预算紧张的情况下,提升效果最为显著。
LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
Authors: Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi
First: 2026-03-20T17:53:06+00:00 · Latest: 2026-03-20T17:53:06+00:00
Comments: IEEE CVF Conference on Computer Vision and Pattern Recognition 2026. Project page with code, models and examples: szymanowiczs.github.io/lagernvs
Abstract
Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware' latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.
中文标题/摘要
标题:LagerNVS:潜在几何用于全神经实时新颖视图合成
近期研究表明,神经网络可以在无需显式3D重建的情况下执行3D任务,如新颖视图合成(NVS)。尽管如此,我们仍认为在这些网络的设计中引入强3D归纳偏置是有帮助的。我们通过引入LagerNVS来证明这一点,LagerNVS是一种基于‘3D感知’潜在特征的编码器-解码器神经网络。编码器是从使用显式3D监督预训练的3D重建网络初始化的。这与一个轻量级解码器配对,并通过光度损失端到端训练。LagerNVS在确定性前向新颖视图合成中达到最先进的性能(包括在Re10k上的31.4 PSNR),无论是否知道相机,实时渲染,泛化到野外数据,并且可以与扩散解码器配对以进行生成性外推。
Summary / 总结
LagerNVS is an encoder-decoder neural network for real-time novel view synthesis that leverages 3D-aware latent features. The encoder is initialized from a pre-trained 3D reconstruction network and paired with a lightweight decoder. Trained end-to-end with photometric losses, LagerNVS achieves state-of-the-art PSNR of 31.4 on Re10k and can render in real time, generalize to in-the-wild data, and be used with a diffusion decoder for generative extrapolation.
LagerNVS 是一个利用 3D 意识潜特征的编码器-解码器神经网络,用于实时新颖视角合成。编码器基于预训练的 3D 重建网络初始化,并配以轻量级解码器。通过光度损失端到端训练,LagerNVS 在 Re10k 上达到 31.4 的 PSNR,并能实时渲染、泛化到野外数据,还可与扩散解码器结合进行生成性外推。
TinyML Enhances CubeSat Mission Capabilities
Authors: Luigi Capogrosso, Michele Magno
First: 2026-03-20T17:51:07+00:00 · Latest: 2026-03-20T17:51:07+00:00
Comments: Accepted at the 17th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS) 2026
Abstract
Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.
中文标题/摘要
标题:TinyML提升立方星任务能力
地球观测(EO)任务传统上依赖于将原始或少量处理过的图像从卫星传输到地面站进行计算密集型分析。由于立方星系统对机载嵌入式处理器、能源供应和通信带宽有严格的限制,这一范式在这些系统中是不可行的。为克服这些限制,论文提出了一种基于TinyML的卷积神经网络(ConvNets)模型优化和部署管道,用于机载图像分类,能够在立方星级别的限制下实现准确、节能和硬件感知的推理。我们的管道集成了结构化迭代剪枝、后训练INT8量化和硬件感知操作映射,以压缩模型并与其来自STMicroelectronics的STM32N6微控制器的异构计算架构对齐。该微控制器集成了新型Arm Cortex-M55内核和Neural-ART神经处理单元(NPU),为立方星机载计算机提供了一个现实的代理。论文在三个EO基准数据集(即EuroSAT、RS_C11、MEDIC)和四种模型(即SqueezeNet、MobileNetV3、EfficientNet、MCUNetV1)上评估了所提出的方法。我们展示了优化模型的平均RAM使用量减少了89.55%,Flash存储减少了70.09%,显著降低了下行链路带宽需求,同时保持了任务可接受的准确性(与Float32基线相比,精度下降范围从0.4到8.6个百分点)。每次推理的能量消耗范围从0.68毫焦到6.45毫焦,延迟范围从3.22毫秒到30.38毫秒。这些结果完全满足了高效机载EO处理所需的严格能量预算和实时约束。
Summary / 总结
The paper addresses the limitations of traditional Earth observation missions by proposing a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for CubeSat systems. The pipeline includes structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models for the STM32N6 microcontroller. The approach significantly reduces RAM usage (89.55%) and Flash memory (70.09%) while maintaining task-acceptable accuracy and achieving energy consumption per inference ranging from 0.68 mJ to 6.45 mJ with latency from 3.22 ms to 30.38 ms, thus satisfying CubeSat energy and real-time constraints.
论文通过提出基于TinyML的卷积神经网络(ConvNets)模型优化和部署管道来解决传统地球观测任务的限制,该管道包括结构化迭代剪枝、后训练INT8量化和硬件感知操作映射,以优化STM32N6微控制器上的模型。该方法将RAM使用量减少了89.55%,Flash存储减少了70.09%,同时保持了可接受的任务准确性,并实现了每推理的能耗从0.68 mJ到6.45 mJ,延迟从3.22 ms到30.38 ms,从而满足了CubeSat严格的能耗和实时约束。
A Unified Framework to Quantify Cultural Intelligence of AI
Authors: Sunipa Dev, Vinodkumar Prabhakaran, Rutledge Chin Feman, Aida Davani, Remi Denton, Charu Kalia, Piyawat Lertvittayakumjorn, Madhurima Maji, Rida Qadri, Negar Rostamzadeh, Renee Shelby, Romina Stella, Hayk Stepanyan, Erin van Liemt, Aishwarya Verma, Oscar Wahltinez, Edem Wornyo, Andrew Zaldivar, Saška Mojsilović
First: 2026-03-01T18:14:52+00:00 · Latest: 2026-03-20T17:50:30+00:00
Abstract
As generative AI technologies are increasingly being launched across the globe, assessing their competence to operate in different cultural contexts is exigently becoming a priority. While recent years have seen numerous and much-needed efforts on cultural benchmarking, these efforts have largely focused on specific aspects of culture and evaluation. While these efforts contribute to our understanding of cultural competence, a unified and systematic evaluation approach is needed for us as a field to comprehensively assess diverse cultural dimensions at scale. Drawing on measurement theory, we present a principled framework to aggregate multifaceted indicators of cultural capabilities into a unified assessment of cultural intelligence. We start by developing a working definition of culture that includes identifying core domains of culture. We then introduce a broad-purpose, systematic, and extensible framework for assessing cultural intelligence of AI systems. Drawing on theoretical framing from psychometric measurement validity theory, we decouple the background concept (i.e., cultural intelligence) from its operationalization via measurement. We conceptualize cultural intelligence as a suite of core capabilities spanning diverse domains, which we then operationalize through a set of indicators designed for reliable measurement. Finally, we identify the considerations, challenges, and research pathways to meaningfully measure these indicators, specifically focusing on data collection, probing strategies, and evaluation metrics.
中文标题/摘要
标题:一种统一框架以量化AI的文化智能
随着生成型AI技术在全球范围内不断推出,评估其在不同文化背景下的操作能力变得日益紧迫。尽管近年来在文化基准测试方面做出了大量且必要的努力,但这些努力主要集中在文化的具体方面和评估上。虽然这些努力有助于我们理解文化能力,但作为一个领域,我们需要一种统一和系统的评估方法,以便全面评估多元文化维度。基于测量理论,我们提出了一种原则性的框架,将多维度的文化能力指标综合为统一的文化智能评估。我们首先制定了文化的工作定义,包括识别文化的核心领域。然后,我们引入了一个广泛用途、系统且可扩展的框架,用于评估AI系统的文化智能。基于心理测量有效性理论的理论框架,我们将背景概念(即文化智能)与其通过测量的操作化区分开来。我们将文化智能概念化为一系列跨越不同领域的核心能力,然后通过一组设计用于可靠测量的指标来操作化。最后,我们确定了衡量这些指标时需要考虑的因素、挑战和研究路径,特别是数据收集、探查策略和评估指标。
Summary / 总结
The research aims to develop a unified framework to assess the cultural intelligence of AI systems across diverse cultural contexts. The method involves defining core domains of culture and introducing a systematic framework that operationalizes cultural intelligence through a set of indicators. Key findings include the identification of core capabilities spanning various cultural domains and the development of reliable measurement strategies for these indicators.
研究旨在开发一个统一框架,以评估AI系统在不同文化背景下的文化智能。方法包括定义文化的核心领域,并开发一个系统框架,其中包括衡量文化能力的多方面指标。关键发现包括识别核心文化能力,并开发一个可以扩展以可靠地衡量这些能力的框架,从而为全面评估AI系统在多元文化环境中的表现做出贡献。
Learning Dynamic Belief Graphs for Theory-of-mind Reasoning
Authors: Ruxiao Chen, Xilei Zhao, Thomas J. Cova, Frank A. Drews, Susu Xu
First: 2026-03-20T17:46:55+00:00 · Latest: 2026-03-20T17:46:55+00:00
Abstract
Theory of Mind (ToM) reasoning with Large Language Models (LLMs) requires inferring how people's implicit, evolving beliefs shape what they seek and how they act under uncertainty -- especially in high-stakes settings such as disaster response, emergency medicine, and human-in-the-loop autonomy. Prior approaches either prompt LLMs directly or use latent-state models that treat beliefs as static and independent, often producing incoherent mental models over time and weak reasoning in dynamic contexts. We introduce a structured cognitive trajectory model for LLM-based ToM that represents mental state as a dynamic belief graph, jointly inferring latent beliefs, learning their time-varying dependencies, and linking belief evolution to information seeking and decisions. Our model contributes (i) a novel projection from textualized probabilistic statements to consistent probabilistic graphical model updates, (ii) an energy-based factor graph representation of belief interdependencies, and (iii) an ELBO-based objective that captures belief accumulation and delayed decisions. Across multiple real-world disaster evacuation datasets, our model significantly improves action prediction and recovers interpretable belief trajectories consistent with human reasoning, providing a principled module for augmenting LLMs with ToM in high-uncertainty environment. https://anonymous.4open.science/r/ICML_submission-6373/
中文标题/摘要
标题:学习动态信念图进行心智理论推理
心智理论(ToM)推理与大型语言模型(LLMs)需要推断人们隐含且不断演变的信念如何影响他们的需求和行为,尤其是在灾难响应、紧急医学和人类在环自主系统等高风险环境中。先前的方法要么直接提示LLMs,要么使用潜在状态模型将信念视为静态且独立的,这通常会导致时间上的不连贯的心理模型和动态环境中的薄弱推理。我们提出了一种基于LLM的心智理论结构化认知轨迹模型,将心理状态表示为动态信念图,联合推断潜在信念,学习其时间变化的依赖关系,并将信念演变与信息寻求和决策联系起来。我们的模型贡献了(i)一种从文本化概率陈述到一致的概率图形模型更新的新投影,(ii)一种基于能量因子图表示信念相互依赖关系,以及(iii)一种基于ELBO的目标函数,捕捉信念积累和延迟决策。在多个真实世界的灾难疏散数据集中,我们的模型显著提高了行动预测,并恢复了与人类推理一致的可解释信念轨迹,为在高不确定性环境中增强LLMs的心智理论提供了一个原则性的模块。https://anonymous.4open.science/r/ICML_submission-6373/
Summary / 总结
The research aims to improve theory-of-mind reasoning with Large Language Models (LLMs) by representing mental states as dynamic belief graphs, which jointly infer latent beliefs, learn their time-varying dependencies, and link belief evolution to information seeking and decisions. The key method involves a structured cognitive trajectory model with a novel probabilistic graphical model update projection, an energy-based factor graph representation, and an ELBO-based objective. Experimental results show significant improvements in action prediction and interpretable belief trajectory recovery across various disaster evacuation datasets, enhancing LLMs with ToM in high-uncertainty environments.
该研究旨在使用大型语言模型(LLMs)解决动态情境下的理论心智(ToM)推理问题。它提出了一种结构化的认知轨迹模型,将心理状态表示为动态信念图,联合推断潜在信念并学习其时间变化的依赖关系。该模型显著提高了行动预测,并恢复了可解释的心理状态轨迹,为在高不确定性环境中增强LLMs提供了一个原理性的模块。
EgoForge: Goal-Directed Egocentric World Simulator
Authors: Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, Xu Cao, Ismini Lourentzou
First: 2026-03-20T17:46:55+00:00 · Latest: 2026-03-20T17:46:55+00:00
Abstract
Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
中文标题/摘要
标题:EgoForge:目标导向的主观世界模拟器
生成式世界模型在模拟动态环境方面显示出潜力,但主观视频由于视角快速变化、频繁的手物交互以及依赖于潜在人类意图的目标导向程序,仍然具有挑战性。现有方法要么专注于手为中心的指令合成,场景演化有限;要么进行静态视图转换而不建模动作动力学;要么依赖密集监督,如摄像机轨迹、长视频前缀、同步多摄像机捕捉等。在本文中,我们引入了EgoForge,这是一种目标导向的主观世界模拟器,可以从最少的静态输入生成连贯的第一人称视频快照:一个主观图像、一个高层指令和一个可选的辅助外视角。为了提高意图对齐和时间一致性,我们提出了VideoDiffusionNFT,这是一种轨迹级奖励引导的细化方法,在扩散采样过程中优化目标完成、时间因果性、场景一致性和感知保真度。广泛的实验表明,EgoForge在语义对齐、几何稳定性和运动保真度方面相对于强基线实现了持续改进,并在现实世界的智能眼镜实验中表现出稳健的性能。
Summary / 总结
EgoForge is a goal-directed egocentric world simulator that generates coherent first-person video rollouts from minimal inputs, including a single egocentric image, a high-level instruction, and an optional exocentric view. It uses VideoDiffusionNFT, a trajectory-level reward-guided refinement, to optimize goal completion, temporal causality, scene consistency, and perceptual fidelity. Experiments demonstrate EgoForge outperforms strong baselines in semantic alignment, geometric stability, and motion fidelity, and shows robust performance in real-world smart-glasses experiments.
EgoForge 是一个目标导向的主观世界模拟器,可以从单一的主观图像、高层次指令和可选的外视角输入生成连贯的第一人称视频。它使用 VideoDiffusionNFT,一种轨迹级别的奖励引导细化,来优化目标完成、时间因果性、场景一致性和感知保真度。实验表明,EgoForge 在语义对齐、几何稳定性和运动保真度方面优于强大的基线模型,并在现实世界的智能眼镜实验中表现出稳健的性能。
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Authors: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta
First: 2026-03-12T17:11:22+00:00 · Latest: 2026-03-20T17:46:44+00:00
Abstract
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
中文标题/摘要
标题:战略导航还是随机搜索?代理和人类在文档集合中的推理方式
多模态代理为自动化复杂文档密集型工作流提供了有希望的途径。然而,一个关键问题仍然存在:这些代理是否展示了真正的战略推理,还是仅仅进行了随机的试错搜索?为了解决这个问题,我们引入了MADQA基准,包含2250个人撰写的基于800份异构PDF文档的问题。根据经典测验理论,我们设计它以最大化在不同代理能力水平上的区分力。为了评估代理行为,我们引入了一种新的评估协议,衡量准确性和努力之间的权衡。使用这一框架,我们表明,虽然最好的代理在纯准确度上可以与人类搜索者匹敌,但它们回答的问题类型不同,并依赖于暴力搜索来弥补薄弱的战略规划。它们未能缩小与Oracle性能近20%的差距,持续陷入无效循环。我们发布了数据集和评估框架,以帮助促进从暴力检索向校准、高效的推理过渡。
The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning
Authors: Jiyu Lim, Youngwoo Yoon, Kwanghyun Park
Venue: ICRA 2026
First: 2026-03-20T17:40:21+00:00 · Latest: 2026-03-20T17:40:21+00:00
Comments: Accepted to ICRA 2026. 8 pages, 9 figures, Project page: https://limjiyu99.github.io/inner-critic/
Abstract
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability.
Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/
中文标题/摘要
标题:机器人的内在批评家:通过基于VLM的重新规划自我完善社会行为
传统的机器人社会行为生成在灵活性和自主性方面受到限制,依赖于预定义的动作或人类反馈。本研究提出了一种名为CRISP(批判与重新规划以实现互动社会存在)的自主框架,其中机器人通过利用视觉-语言模型(VLM)作为“类似人类的社会批评家”来批判和重新规划自己的行为。CRISP整合了以下步骤:(1)通过分析机器人的描述文件(例如,MJCF)提取可移动关节和约束条件;(2)基于情境上下文生成逐步行为计划;(3)通过参考视觉信息(关节活动范围可视化)生成低级关节控制代码;(4)基于VLM评估社会适宜性和自然性,包括指出错误步骤;(5)通过基于奖励的搜索进行行为的迭代完善。该方法不依赖于特定的机器人API,仅使用机器人的结构文件即可在各种平台上生成微妙不同的、类似人类的动作。在涉及五种不同机器人类型和20种场景(包括移动操作臂和类人机器人)的用户研究中,我们提出的方法在偏好和情境适宜性评分方面显著优于以往方法。这项研究提供了一个通用框架,以最小化人类干预,扩展机器人的自主交互能力和跨平台适用性。
详细结果视频和关于此工作的补充信息可在:https://limjiyu99.github.io/inner-critic/
Summary / 总结
The study introduces CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot uses a Vision-Language Model to critique and refine its social behaviors. CRISP extracts robot joint information, generates behavior plans, and refines actions through iterative reward-based search. In a user study, CRISP outperformed previous methods in preference and situational appropriateness across various robot types and scenarios.
研究提出了CRISP框架,该框架使机器人能够利用视觉语言模型自我批评和改进其社交行为。CRISP处理机器人的描述文件以提取关节和约束,根据上下文生成行为计划,并使用VLM评估社交适宜性。在一项用户研究中,CRISP在各种机器人类型和场景中表现出了更高的偏好度和情境适宜性,优于以往的方法。
Community-Informed AI Models for Police Accountability
Authors: Benjamin A. T. Grahama, Lauren Brown, Georgios Chochlakis, Morteza Dehghani, Raquel Delerme, Brittany Friedman, Ellie Graeden, Preni Golazizian, Rajat Hebbar, Parsa Hejabi, Aditya Kommineni, Mayagüez Salinas, Michael Sierra-Arévalo, Jackson Trager, Nicholas Weller, Shrikanth Narayanan
First: 2024-01-24T19:56:20+00:00 · Latest: 2026-03-20T17:39:55+00:00
Comments: 33 pages, 4 figures, 2 tables
Abstract
Face-to-face interactions between police officers and the public affect both individual well-being and democratic legitimacy. Many government-public interactions are captured on video, including interactions between police officers and drivers captured on bodyworn cameras (BWCs). New advances in AI technology enable these interactions to be analyzed at scale, opening promising avenues for improving government transparency and accountability. However, for AI to serve democratic governance effectively, models must be designed to include the preferences and perspectives of the governed. This article proposes a community-informed, approach to developing multi-perspective AI tools for government accountability. We illustrate our approach by describing the research project through which the approach was inductively developed: an effort to build AI tools to analyze BWC footage of traffic stops conducted by the Los Angeles Police Department. We focus on the role of social scientists as members of multidisciplinary teams responsible for integrating the perspectives of diverse stakeholders into the development of AI tools in the domain of police -- and government -- accountability.
中文标题/摘要
标题:社区导向的AI模型以提高警队问责制
警员与公众之间的面对面互动影响个人福祉和民主合法性。许多政府与公众的互动被视频记录下来,包括警员与驾驶员之间的互动,这些互动被执法记录仪(BWC)记录。新的AI技术进步使这些互动可以大规模分析,为提高政府透明度和问责制开辟了新的途径。然而,为了使AI有效服务于民主治理,模型必须设计成包括被治理者的偏好和视角。本文提出了一种社区导向的方法,用于开发多视角的AI工具以提高政府问责制。我们通过描述该方法通过归纳开发的研究项目来说明这一方法:旨在开发分析洛杉矶警局交通拦截行动执法记录仪视频的AI工具的努力。我们重点介绍了社会科学家作为多学科团队成员的角色,他们负责将不同利益相关者的视角整合到警察——以及政府——问责制领域的AI工具开发中。
Summary / 总结
The research aims to develop AI models for police accountability that incorporate the perspectives of the public to enhance democratic governance. The method involves a community-informed approach, integrating social scientists into multidisciplinary teams to ensure diverse stakeholder perspectives are included. Key findings include the successful development of AI tools to analyze body-worn camera footage from traffic stops in the Los Angeles Police Department, demonstrating the feasibility of community-informed AI for improving transparency and accountability.
研究旨在开发纳入公众视角的AI模型,以提升民主治理。方法是采用社区导向的方法,将社会科学家纳入多学科团队,确保不同利益相关者的视角被纳入警察问责领域的AI工具开发中。主要发现包括成功开发了分析洛杉矶警察部门交通执法录像的AI工具,证明了社区导向的AI在提高透明度和问责制方面的可行性。
Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning
Authors: Hui Zhong, Yichun Gao, Luyan Liu, Hai Yang, Wang Wang, Haowei Zhang, Xinhu Zheng
First: 2026-03-20T17:24:46+00:00 · Latest: 2026-03-20T17:24:46+00:00
Abstract
Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where"). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.
中文标题/摘要
标题:大型多模态模型能否检查建筑?一种结构病理学推理的分层基准
自动建筑外立面检查是城市韧性和智慧城市维护的关键组成部分。传统上,该领域依赖于专门的判别模型(例如YOLO、Mask R-CNN),这些模型在像素级定位方面表现出色,但缺乏视觉理解能力,无法解释结构拓扑关系。大型多模态模型(LMMs)有望实现主动推理的范式转变,但在如此高风险的工程领域,其应用缺乏严格的评估标准。为弥合这一差距,我们引入了一种带有人工智能辅助的半自动化注释框架,利用专家提案验证统一了12个碎片化的数据集,形成了标准化的分层本体。在此基础上,我们提出了DefectBench,这是第一个多维度基准,旨在超越基本语义识别来质疑LMMs。DefectBench评估了18个最先进的(SOTA)LMMs在三个递进的认知维度上的表现:语义感知、空间定位和生成几何分割。大量实验表明,当前的LMMs在拓扑意识和语义理解方面表现出色(有效诊断“是什么”和“如何”),但在度量定位精度方面存在显著缺陷(“在哪里”)。然而,我们验证了零样本生成分割的有效性,表明通用基础模型可以在无需领域特定训练的情况下与专门的监督网络相媲美。本研究提供了严格的基准测试标准和高质量的开源数据库,为自主AI代理在土木工程中的发展设立了新的基准。
Summary / 总结
The research aims to evaluate the capabilities of large multimodal models (LMMs) in inspecting building facades, addressing the limitations of traditional discriminative models. The study introduces DefectBench, a hierarchical benchmark that assesses LMMs across three cognitive dimensions: semantic perception, spatial localization, and generative geometry segmentation. Key findings show that while LMMs excel in diagnosing structural issues, they struggle with precise localization, indicating a need for improvement in metric precision. Additionally, the study demonstrates the potential of zero-shot generative segmentation, suggesting that general-purpose models can perform comparably to specialized networks without domain-specific training.
研究旨在通过引入层次基准DefectBench来评估大型多模态模型(LMMs)在建筑外墙检测中的能力。方法是将12个分散的数据集统一到一个注释框架中,以评估LMMs在语义感知、空间定位和生成几何分割这三个认知维度上的表现。研究发现,尽管LMMs在拓扑意识和语义理解方面表现出色,但在精确定位方面存在明显不足。值得注意的是,研究展示了通用模型在零样本生成分割中的潜力,表明它们可以在无需领域特定训练的情况下与专门的监督网络相媲美。
LHAW: Controllable Underspecification for Long-Horizon Tasks
Authors: George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton
First: 2026-02-11T04:49:50+00:00 · Latest: 2026-03-20T17:00:33+00:00
Abstract
Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
中文标题/摘要
标题:LHAW:长时程任务可控欠指定
能够有效长期运行的工作流代理对于真正自主的系统至关重要。它们可靠执行的关键在于能够通过需要澄清的模糊情况来推理,以确保任务执行的正确性。然而,由于缺乏可扩展且任务通用的框架来系统地收集和衡量模糊性对自定义工作流的影响,进展受到限制。我们通过引入LHAW(长时程增强工作流),一种模块化、数据集通用的合成管道,解决了这一缺口。LHAW将任何明确指定的任务系统地在四个维度——目标、约束、输入和上下文——中按可配置的严重程度级别移除信息,从而转换为可控的欠指定变体。与依赖于大语言模型预测模糊性的方法不同,LHAW通过实证代理试验来验证变体,根据观察到的终端状态差异将其分类为结果关键型、发散型或良性。我们根据我们的分类法从TheAgentCompany、SWE-Bench Pro和MCP-Atlas中发布了285个任务变体,并提供了当前代理检测、推理和解决模糊性在模糊环境中的形式分析。LHAW提供了第一个系统框架,用于在长时程设置中对代理澄清行为进行成本敏感评估,从而促进可靠自主系统的开发。
Summary / 总结
The paper introduces LHAW, a modular synthetic pipeline that transforms well-specified tasks into controllable underspecified variants by systematically removing information across four dimensions: Goals, Constraints, Inputs, and Context. This framework enables the evaluation of agents' ability to handle ambiguity in long-horizon tasks through empirical trials, classifying variants as outcome-critical, divergent, or benign based on terminal state divergence. The study releases 285 task variants and measures how current agents detect, reason about, and resolve underspecification in ambiguous settings, providing a systematic framework for evaluating agent clarification behavior in long-horizon tasks.
研究旨在开发一种框架,以创建可控的不明确任务,提高长期工作流代理处理模糊情况的能力。方法是使用LHAW,一种模块化的合成管道,通过系统地在目标、约束、输入和上下文四个维度上移除信息,将明确的任务转化为可控的不明确变体。关键发现表明,当前的代理在检测、推理和解决模糊设置中的不明确性方面存在困难,突显了在长期任务中需要更好的澄清行为。
Pseudo-Simulation for Autonomous Driving
Authors: Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Venue: CoRL 2025
First: 2025-06-04T17:57:53+00:00 · Latest: 2026-03-20T17:00:17+00:00
Comments: CoRL 2025, updated with leaderboard snapshot from March 2026
Abstract
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations ($R^2=0.8$) than the best existing open-loop approach ($R^2=0.7$). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
中文标题/摘要
标题:伪模拟在自动驾驶评估中的应用
现有的自动驾驶车辆(AVs)评估范式面临重大限制。由于安全问题和缺乏可重复性,实际评估往往具有挑战性;而闭环模拟则可能因不现实或计算成本高而受限。开环评估虽然高效且数据驱动,但通常依赖于忽略累积误差的指标。本文提出了一种新的伪模拟范式,以解决这些限制。伪模拟基于真实数据集,类似于开环评估,但通过3D高斯点绘制生成先验合成观察,对它们进行增强。我们的核心思想是通过生成在位置、方向和速度上变化多样的观察,来近似AV可能遇到的潜在未来状态。然后,我们的方法使用一种新颖的基于接近度的加权方案,赋予与AV行为最匹配的合成观察更高的重要性。这使得伪模拟能够在无需顺序交互模拟的情况下评估错误恢复和因果混淆的缓解,类似于闭环基准。我们证明伪模拟与闭环模拟的相关性($R^2=0.8$)优于现有最佳开环方法($R^2=0.7$)。我们还为社区提供了一个公开排行榜,以使用伪模拟评估新方法。我们的代码可在https://github.com/autonomousvision/navsim/获取。
Summary / 总结
This paper addresses the limitations of existing evaluation paradigms for Autonomous Vehicles by proposing pseudo-simulation. This method uses real datasets and adds synthetic observations generated using 3D Gaussian Splatting to evaluate error recovery and causal confusion. The approach assigns higher importance to synthetic observations that closely match the AV's likely behavior, achieving a correlation of $R^2=0.8$ with closed-loop simulations, surpassing the best open-loop approach ($R^2=0.7$).
本文提出了一种伪模拟方法,以解决现有自动驾驶车辆评估方法的局限性。该方法结合了开放环评估的高效性和闭环模拟的真实感,使用真实数据集并添加由3D高斯点绘制生成的合成观察,这些观察基于与车辆可能行为的接近程度进行加权。结果显示,伪模拟与闭环模拟的相关性更高($R^2=0.8$),优于最佳的开放环方法($R^2=0.7$)。
TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness
Authors: Yongxin Zhou, Philippe Mulhem, Didier Schwab
First: 2025-12-01T01:46:36+00:00 · Latest: 2026-03-20T17:00:12+00:00
Comments: LREC 2026, Palma, Mallorca (Spain), 11-16 May 2026
Abstract
The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
中文标题/摘要
标题:TempPerturb-Eval:内部温度和外部干扰在RAG鲁棒性中的联合效应研究
对检索增强生成(RAG)系统的评估通常会单独检查检索质量和生成参数(如温度),而忽略了它们之间的相互作用。本研究系统地探讨了文本扰动(模拟检索噪声)与温度设置在多个LLM运行中的交互作用。我们提出了一种全面的RAG扰动-温度分析框架,该框架对检索到的文档施加了三种不同的扰动类型,并在不同温度设置下进行测试。通过在HotpotQA上的大量实验,使用开源和专有LLM,我们展示了性能下降遵循不同的模式:高温设置始终会放大对扰动的脆弱性,而某些扰动类型在整个温度范围内表现出非线性敏感性。我们的研究提供了三个关键贡献:(1)一种评估RAG鲁棒性的诊断基准,(2)一种量化扰动-温度交互作用的分析框架,以及(3)在检索噪声条件下选择模型和参数调优的实用指南。
Summary / 总结
This study evaluates the joint effects of internal temperature settings and external text perturbations on RAG systems, using a comprehensive framework that assesses performance degradation patterns. Through experiments on HotpotQA, it shows that high-temperature settings increase vulnerability to perturbations, and certain perturbation types have non-linear sensitivity across the temperature range. Key contributions include a diagnostic benchmark, an analytical framework, and practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
该研究评估了内部温度设置和外部文本扰动对RAG系统的联合影响,使用了一个包含三种类型扰动并在不同温度下进行的综合框架。实验在HotpotQA上使用不同LLM表明,高温度设置会加剧对扰动的脆弱性,而某些扰动在温度范围内的敏感性是非线性的。主要贡献包括一个用于评估RAG鲁棒性的诊断基准、一个用于量化扰动-温度交互的分析框架以及在噪声检索条件下选择模型和调整参数的实用指南。
Less is More: Towards Simple Graph Contrastive Learning
Authors: Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Wee Peng Tay
First: 2025-09-30T03:56:50+00:00 · Latest: 2026-03-20T16:58:23+00:00
Abstract
Graph Contrastive Learning (GCL) has shown strong promise for unsupervised graph representation learning, yet its effectiveness on heterophilic graphs, where connected nodes often belong to different classes, remains limited. Most existing methods rely on complex augmentation schemes, intricate encoders, or negative sampling, which raises the question of whether such complexity is truly necessary in this challenging setting. In this work, we revisit the foundations of supervised and unsupervised learning on graphs and uncover a simple yet effective principle for GCL: mitigating node feature noise by aggregating it with structural features derived from the graph topology. This observation suggests that the original node features and the graph structure naturally provide two complementary views for contrastive learning. Building on this insight, we propose an embarrassingly simple GCL model that uses a GCN encoder to capture structural features and an MLP encoder to isolate node feature noise. Our design requires neither data augmentation nor negative sampling, yet achieves state-of-the-art results on heterophilic benchmarks with minimal computational and memory overhead, while also offering advantages in homophilic graphs in terms of complexity, scalability, and robustness. We provide theoretical justification for our approach and validate its effectiveness through extensive experiments, including robustness evaluations against both black-box and white-box adversarial attacks.
中文标题/摘要
标题:少即是多:迈向简洁的图对比学习
图对比学习(GCL)在无监督图表示学习中展现了强大的潜力,但在异ophilic图上,其效果仍然有限,异ophilic图中相连的节点往往属于不同的类别。现有的大多数方法依赖于复杂的增强方案、复杂的编码器或负样本,这引发了这样一个问题:在这样一个具有挑战性的环境中,这种复杂性是否真的必要。在本文中,我们重新审视了图上监督和无监督学习的基础,并发现了一个简单而有效的GCL原则:通过将节点特征噪声与从图拓扑结构中导出的结构特征聚合,来减轻节点特征噪声。这一观察表明,原始节点特征和图结构自然提供了两种互补的对比学习视图。基于这一洞察,我们提出了一种极其简单的GCL模型,使用GCN编码器捕获结构特征,并使用MLP编码器隔离节点特征噪声。我们的设计不需要数据增强或负样本,但在异ophilic基准上仍能取得最先进的结果,同时在计算和内存开销上也最小,并且在同ophilic图上也具有复杂性、可扩展性和鲁棒性方面的优势。我们为我们的方法提供了理论依据,并通过广泛的实验验证了其有效性,包括对黑盒和白盒对抗攻击的鲁棒性评估。
Summary / 总结
This paper addresses the challenge of Graph Contrastive Learning (GCL) on heterophilic graphs by proposing a simple model that mitigates node feature noise through structural feature aggregation. The model uses a GCN encoder for structural features and an MLP encoder for isolating node feature noise, without relying on complex augmentation or negative sampling. It achieves state-of-the-art results on heterophilic benchmarks with minimal computational and memory overhead, and offers advantages in homophilic graphs in terms of complexity, scalability, and robustness. Theoretical justification and extensive experiments, including robustness evaluations, support the effectiveness of this approach.
本文针对异ophilic图上的图对比学习(GCL)挑战,提出了一种简单模型,通过结构特征聚合来缓解节点特征噪声。该模型使用GCN编码器捕获结构特征,使用MLP编码器隔离节点特征噪声,避免了复杂的增强和负样本采样。该模型在异ophilic基准上取得了最先进的结果,且具有最小的计算和内存开销,并在同ophilic图上具有复杂性、可扩展性和鲁棒性方面的优势。理论分析和广泛的实验,包括对抗攻击的鲁棒性评估,支持该模型的有效性。
Understanding and Optimizing Multi-Stage AI Inference Pipelines
Authors: Abhimanyu Rajeshkumar Bambhaniya, Hanjiang Wu, Suvinay Subramanian, Sudarshan Srinivasan, Souvik Kundu, Amir Yazdanbakhsh, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna
First: 2025-04-14T00:29:49+00:00 · Latest: 2026-03-20T16:55:01+00:00
Comments: Inference System Design for Multi-Stage AI Inference Pipelines. 13 Pages, 15 Figues, 3 Tables
Abstract
The rapid evolution of Large Language Models (LLMs) has driven the need for increasingly sophisticated inference pipelines and hardware platforms. Modern LLM serving extends beyond traditional prefill-decode workflows, incorporating multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, dynamic model routing, and multi step reasoning. These stages exhibit diverse computational demands, requiring distributed systems that integrate GPUs, ASICs, CPUs, and memory-centric architectures. However, existing simulators lack the fidelity to model these heterogeneous, multi-engine workflows, limiting their ability to inform architectural decisions.
To address this gap, we introduce MIST, a Heterogeneous Multi-stage LLM inference Execution Simulator. MIST models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. MIST supports heterogeneous clients executing multiple models concurrently unlike prior frameworks while incorporating advanced batching strategies and multi-level memory hierarchies. By integrating real hardware traces with analytical modeling, MIST captures critical trade-offs such as memory bandwidth contention, inter-cluster communication latency, and batching efficiency in hybrid CPU-accelerator deployments. Through case studies, we explore the impact of reasoning stages on end-to-end latency, optimal batching strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval. MIST empowers system designers to navigate the evolving landscape of LLM inference, providing actionable insights into optimizing hardware-software co-design for next-generation AI workloads.
中文标题/摘要
标题:理解与优化多阶段AI推理管道
大型语言模型(LLMs)的快速发展推动了日益复杂的推理管道和硬件平台的需求。现代LLM服务超越了传统的预填充-解码工作流,整合了诸如检索增强生成(RAG)、键值(KV)缓存检索、动态模型路由和多步推理等多阶段过程。这些阶段表现出不同的计算需求,需要结合GPU、ASIC、CPU和内存中心架构的分布式系统。然而,现有的模拟器缺乏建模这些异构、多引擎工作流的精度,限制了它们对架构决策的指导能力。
为了解决这一差距,我们引入了MIST,一种异构多阶段LLM推理执行模拟器。MIST 模拟了包括RAG、KV检索、推理、预填充和解码在内的各种请求阶段,跨越复杂的硬件层次结构。MIST 支持异构客户端并发执行多个模型,同时结合高级批处理策略和多级内存层次结构。通过将实际硬件跟踪与分析建模相结合,MIST 捕捉到关键权衡,如内存带宽争用、跨集群通信延迟和混合CPU加速器部署中的批处理效率。通过案例研究,我们探讨了推理阶段对端到端延迟的影响、混合管道中的最优批处理策略以及远程KV缓存检索的架构影响。MIST 使系统设计师能够应对LLM推理不断变化的环境,提供有关优化硬件-软件协同设计以适应下一代AI工作负载的实用见解。
Summary / 总结
The paper addresses the need for sophisticated inference pipelines for Large Language Models (LLMs) that incorporate multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, and multi-step reasoning. To model these heterogeneous workflows, the authors introduce MIST, a simulator that supports diverse request stages and complex hardware hierarchies. Key findings include the impact of reasoning stages on latency, optimal batching strategies, and the architectural implications of remote KV cache retrieval, enabling system designers to optimize hardware-software co-design for LLM inference.
论文针对大型语言模型(LLM)需要复杂的多阶段推理管道,包括检索增强生成(RAG)、键值(KV)缓存检索和多步推理等过程。作者引入了MIST模拟器,支持多样化的请求阶段和复杂的硬件层次结构。主要发现包括推理阶段对端到端延迟的影响、最优批量策略以及远程KV缓存检索的架构影响,帮助系统设计师优化硬件和软件的协同设计以适应下一代AI工作负载。
ReMoT: Reinforcement Learning with Motion Contrast Triplets
Authors: Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong
Venue: CVPR 2026
First: 2026-02-28T04:42:34+00:00 · Latest: 2026-03-20T16:54:46+00:00
Comments: CVPR 2026
Abstract
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
中文标题/摘要
标题:ReMoT:基于运动对比三元组的强化学习
我们提出了ReMoT,一种统一的训练范式,系统地解决了VLMs在时空一致性方面的根本缺陷——这是导航、机器人技术和自动驾驶中的一个关键失败点。ReMoT集成了两个核心组件:(1) 一个基于规则的自动框架,生成ReMoT-16K,这是一个大规模(16.5K 三元组)的运动对比数据集,源自视频元注释,超越了昂贵的手动或模型生成。 (2) 组相对策略优化,我们实验证明其在学习这种对比推理方面具有最佳性能和数据效率,远超标准的监督微调。我们还构建了第一个细粒度运动对比三元组基准,用于衡量VLM对细微运动属性(如相反方向)的区分能力。最终模型在我们的新基准和多个标准VLM基准上均取得了最先进的性能,时空推理任务上的性能跃升高达25.1%。
Summary / 总结
ReMoT is a training paradigm that addresses the spatio-temporal consistency issues in Vision-Language Models (VLMs) for navigation, robotics, and autonomous driving. It uses a rule-based framework to generate a large-scale motion-contrast dataset, ReMoT-16K, and employs Group Relative Policy Optimization for contrastive reasoning, outperforming standard Supervised Fine-Tuning. The model achieves state-of-the-art performance on a new benchmark for fine-grained motion contrast triplets and multiple standard benchmarks, with a 25.1% improvement on spatio-temporal reasoning tasks.
ReMoT 提出了一种统一的训练范式,以增强 VLM 在时空一致性方面的表现,解决导航和机器人技术中的关键问题。它结合了一个基于规则的数据集生成框架和 Group Relative Policy Optimization 方法,后者在对比推理方面优于监督微调。该模型在新基准和多个标准基准上实现了最先进的性能,时空推理任务上的性能提升了 25.1%。
Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives
Authors: Wanqi Yuan, Omkar Sharad Mayekar, Connor Pennington, Nianyi Li
First: 2026-03-20T16:50:16+00:00 · Latest: 2026-03-20T16:50:16+00:00
Abstract
Neural Radiance Fields (NeRF) achieve photorealistic novel view synthesis but become costly when high-resolution (HR) rendering is required, as HR outputs demand dense sampling and higher-capacity models. Moreover, naively super-resolving per-view renderings in 2D often breaks multi-view consistency. We propose Generalizable NGP-SR, a 3D-aware super-resolution framework that reconstructs an HR radiance field directly from low-resolution (LR) posed images. Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens, enabling recovery of high-frequency details within the radiance field and producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. Importantly, our model is generalizable: once trained, it can be applied to unseen scenes and rendered from novel viewpoints without per-scene optimization. Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.
中文标题/摘要
标题:通用NGP-SR:基于神经图形基元的通用神经辐射场超分辨率
神经辐射场(NeRF)实现了逼真的新颖视图合成,但在需要高分辨率(HR)渲染时变得昂贵,因为HR输出需要密集采样和更高容量的模型。此外,简单地在2D中对每个视图的渲染进行超分辨率处理往往会破坏多视图一致性。我们提出了一种基于3D感知的超分辨率框架Generalizable NGP-SR,该框架直接从低分辨率(LR)姿态图像中重建HR辐射场。基于神经图形基元(NGP),NGP-SR在3D坐标和学习到的局部纹理标记上条件化辐射预测,从而在辐射场中恢复高频细节,并在没有外部HR参考或后处理2D上采样的情况下生成视图一致的HR新颖视图。重要的是,我们的模型是通用的:一旦训练完成,它可以在未见过的场景中应用,并从新的视角进行渲染,而无需针对每个场景进行优化。在多个数据集上的实验表明,NGP-SR在重建质量和运行时效率方面都优于基于NeRF的先前超分辨率方法,提供了一种可扩展的高分辨率新颖视图合成的实用解决方案。
Summary / 总结
The research aims to address the computational cost and multi-view consistency issues in high-resolution rendering using Neural Radiance Fields (NeRF). The proposed Generalizable NGP-SR framework directly reconstructs a high-resolution radiance field from low-resolution images, leveraging Neural Graphics Primitives to condition radiance prediction on 3D coordinates and local texture tokens. Experiments demonstrate that NGP-SR outperforms previous NeRF-based super-resolution methods in terms of both reconstruction quality and runtime efficiency, providing a practical solution for scalable high-resolution novel view synthesis.
研究旨在解决使用神经辐射场(NeRF)进行高分辨率渲染时的计算成本和多视角一致性问题。提出的Generalizable NGP-SR框架直接从低分辨率图像中重建高分辨率辐射场,通过神经图形基元将辐射预测条件化在3D坐标和局部纹理令牌上。实验表明,NGP-SR在重建质量和运行时效率上均优于之前的基于NeRF的超分辨率方法,提供了一种可扩展的高分辨率新视角合成的实用解决方案。
A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks
Authors: Huayu Zheng, Guangzhao Li, Baixuan Zhao, Siqi Luo, Hantao Jiang, Guangtao Zhai, Xiaohong Liu
First: 2026-03-11T11:54:13+00:00 · Latest: 2026-03-20T16:46:34+00:00
Abstract
We propose A^2-Edit, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale multi-category dataset, UniEdit-500K, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the Mixture of Transformer module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a Mask Annealing Training Strategy (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A^2-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.
中文标题/摘要
标题:A$^2$-Edit:任意对象和模糊掩码的精确参考引导图像编辑
我们提出了A$^2$-Edit,这是一种统一的修复框架,适用于任意对象类别,允许用户仅使用粗略掩码将任何目标区域替换为参考对象。为了解决现有数据集中严重的同质化和类别覆盖范围有限的问题,我们构建了一个大规模多类别数据集UniEdit-500K,其中包括8个主要类别、209个细粒度子类别和总共500,104对图像。如此丰富的类别多样性为模型带来了新的挑战,要求它能够自动学习跨类别之间的语义关系和区别。为此,我们引入了混合变换器模块,该模块通过动态专家选择对各种对象类别进行差异化建模,并通过专家之间的协作进一步增强跨类别语义转移和泛化能力。此外,我们提出了掩码退火训练策略(MATS),该策略在训练过程中逐步放松掩码精度,减少模型对准确掩码的依赖,并提高在各种编辑任务中的鲁棒性。在VITON-HD和AnyInsertion等基准上的广泛实验表明,A$^2$-Edit在所有指标上都优于现有方法,提供了一种新的高效解决方案,用于任意对象编辑。
Summary / 总结
A^2-Edit is a unified inpainting framework designed for editing arbitrary object categories using a coarse mask and a reference object. It addresses the limitations of existing datasets by creating a large-scale multi-category dataset, UniEdit-500K, which includes 8 major categories and 209 subcategories. The framework introduces a Mixture of Transformer module to handle diverse object categories and a Mask Annealing Training Strategy (MATS) to improve robustness. Experiments show that A^2-Edit outperforms existing methods across various metrics on benchmarks like VITON-HD and AnyInsertion.
A^2-Edit 是一个统一的修复框架,用于使用粗略的掩码和参考对象编辑任意对象类别。它通过创建包含8个主要类别和209个子类别的大规模多类别数据集 UniEdit-500K 来解决现有数据集的限制。该框架引入了一个混合的 Transformer 模块,能够动态选择专家来建模不同的对象类别,并通过专家间的协作增强跨类别语义转移。此外,还提出了一种掩码退火训练策略(MATS),以减少模型对精确掩码的依赖。在 VITON-HD 和 AnyInsertion 上的实验表明,A^2-Edit 在各种指标上均优于现有方法。
Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
Authors: Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong
First: 2026-03-20T16:38:37+00:00 · Latest: 2026-03-20T16:38:37+00:00
Abstract
Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
中文标题/摘要
标题:链式适应:基于强化学习的手术领域视觉-语言适应
在领域特定数据集上的常规微调可能会无意中改变模型的预训练多模态先验,导致泛化能力降低。为了解决这一问题,我们提出了链式适应(CoA)框架,该框架旨在整合领域知识同时保持模型固有的推理和感知能力。CoA 引入了一种结构化的推理格式,通过强化学习增强领域对齐,而不牺牲多模态的通用能力。在标准的手术基准测试中,无论是分布内还是分布外设置,CoA 都实现了更高的准确率、更强的泛化能力和更稳定的行为,进一步的消融研究证实,CoA 有效地保留了模型的核心视觉-语言能力,为 VLM 的领域专业化提供了一条可靠途径。
Summary / 总结
The research aims to improve the generalization of vision-language models by avoiding the alteration of pretrained priors through conventional fine-tuning. The proposed Chain-of-Adaptation (CoA) framework uses reinforcement learning to integrate domain knowledge while preserving the model's core visual-language abilities. Experiments show that CoA outperforms supervised fine-tuning in terms of accuracy, generalization, and stability in both in-distribution and out-of-distribution settings. Ablation studies confirm that CoA effectively maintains the model's visual and language competencies during adaptation.
研究旨在通过避免通过领域特定微调改变预训练先验来提高视觉语言模型的泛化能力。提出的链式适应(CoA)框架使用强化学习整合领域知识,同时保留模型的核心视觉语言能力。实验表明,CoA在准确性、泛化能力和稳定性方面均优于监督微调,适用于分布内和分布外设置。
GO-GenZip: Goal-Oriented Generative Sampling and Hybrid Compression
Authors: Pietro Talli, Qi Liao, Alessandro Lieto, Parijat Bhattacharjee, Federico Chiariotti, Andrea Zanella
First: 2026-03-20T16:33:15+00:00 · Latest: 2026-03-20T16:33:15+00:00
Abstract
Current network data telemetry pipelines consist of massive streams of fine-grained Key Performance Indicators (KPIs) from multiple distributed sources towards central aggregators, making data storage, transmission, and real-time analysis increasingly unsustainable. This work presents a generative AI (GenAI)-driven sampling and hybrid compression framework that redesigns network telemetry from a goal-oriented perspective. Unlike conventional approaches that passively compress fully observed data, our approach jointly optimizes what to observe and how to encode it, guided by the relevance of information to downstream tasks. The framework integrates adaptive sampling policies, using adaptive masking techniques, with generative modeling to identify patterns and preserve critical features across temporal and spatial dimensions. The selectively acquired data are further processed through a hybrid compression scheme that combines traditional lossless coding with GenAI-driven, lossy compression. Experimental results on real network datasets demonstrate over 50$\%$ reductions in sampling and data transfer costs, while maintaining comparable reconstruction accuracy and goal-oriented analytical fidelity in downstream tasks.
中文标题/摘要
标题:GO-GenZip: 目标导向的生成采样和混合压缩
当前的网络数据遥测管道由来自多个分布式源的大量细粒度关键性能指标(KPI)组成,这些指标流向中央聚合器,使得数据存储、传输和实时分析变得越来越不可持续。本研究提出了一种生成AI(GenAI)驱动的采样和混合压缩框架,从目标导向的角度重新设计网络遥测。与传统的被动压缩完全观测数据的方法不同,我们的方法联合优化了需要观察什么以及如何编码,由下游任务的相关性进行指导。该框架结合了自适应采样策略,使用自适应掩码技术与生成建模相结合,以识别模式并保留跨时间和空间维度的关键特征。进一步获取的数据通过结合传统无损编码和GenAI驱动的有损压缩的混合压缩方案进行处理。在实际网络数据集上的实验结果表明,采样和数据传输成本降低了超过50%,同时保持了与下游任务目标导向分析精度相当的重建准确性。
Summary / 总结
This work introduces GO-GenZip, a generative AI-driven framework that redesigns network telemetry from a goal-oriented perspective. It jointly optimizes what to observe and how to encode it, using adaptive sampling policies and generative modeling to preserve critical features. The framework further applies a hybrid compression scheme combining lossless and lossy compression. Experiments show over 50% reductions in sampling and data transfer costs while maintaining comparable reconstruction accuracy and analytical fidelity for downstream tasks.
该工作提出了GO-GenZip框架,从目标导向的角度重新设计网络遥测。它联合优化数据观测和编码,使用自适应采样和生成建模来保留关键特征。该框架进一步应用结合了无损和有损压缩的混合压缩方案。实验结果显示,在保持与下游任务相当的重建准确性和分析保真度的同时,数据采样和传输成本降低了超过50%。
Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition
Authors: Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa, Agata Kaczmarek, Dawid Płudowski, Piotr Wilczyński, Artur Janicki, Przemysław Biecek, Ambros Marzetta, Atul Pande, Lalit Chandra Routhu, Swapnil Srivastava, Evridiki Ntagiou
First: 2026-03-20T16:32:47+00:00 · Latest: 2026-03-20T16:32:47+00:00
Comments: 43 pages, 18 figures
Abstract
Forecasting plays a crucial role in modern safety-critical applications, such as space operations. However, the increasing use of deep forecasting models introduces a new security risk of trojan horse attacks, carried out by hiding a backdoor in the training data or directly in the model weights. Once implanted, the backdoor is activated by a specific trigger pattern at test time, causing the model to produce manipulated predictions. We focus on this issue in our \textit{Trojan Horse Hunt} data science competition, where more than 200 teams faced the task of identifying triggers hidden in deep forecasting models for spacecraft telemetry. We describe the novel task formulation, benchmark set, evaluation protocol, and best solutions from the competition. We further summarize key insights and research directions for effective identification of triggers in time series forecasting models. All materials are publicly available on the official competition webpage https://www.kaggle.com/competitions/trojan-horse-hunt-in-space.
中文标题/摘要
标题:深预测模型中的特洛伊木马搜索:欧洲空间局竞赛的见解
预测在现代关键安全应用中扮演着至关重要的角色,例如空间操作。然而,深度预测模型的广泛应用引入了一种新的安全风险——特洛伊木马攻击,攻击者通过在训练数据中隐藏后门或直接在模型权重中植入后门。一旦植入,后门会在测试时由特定触发模式激活,导致模型产生被操纵的预测。我们在“特洛伊木马搜索”数据科学竞赛中关注这一问题,超过200支队伍面临识别隐藏在深预测模型中的航天器遥测触发器的任务。我们描述了新颖的任务形式、基准集、评估协议以及竞赛中的最佳解决方案。我们进一步总结了有效识别时间序列预测模型中触发器的关键见解和研究方向。所有材料均可在官方竞赛网页https://www.kaggle.com/competitions/trojan-horse-hunt-in-space上公开获取。
Summary / 总结
The research addresses the security risk of trojan horse attacks in deep forecasting models used in space operations. It presents a data science competition called 'Trojan Horse Hunt' where teams were tasked to identify triggers in deep forecasting models for spacecraft telemetry. The competition provided a novel task formulation, benchmark set, and evaluation protocol. Key findings include insights into effective trigger identification methods and research directions for future work in this area.
论文探讨了在航天操作中使用的深度预测模型中存在的一种新型安全风险——特洛伊木马攻击。文中描述了一个名为‘特洛伊木马猎手’的数据科学竞赛,参赛队伍的任务是在航天器遥测的深度预测模型中识别隐藏的触发器。竞赛提供了新的任务形式、基准集和评估协议。主要发现包括有效的触发器识别方法及其未来研究方向的总结。
DETECT: Data-Driven Evaluation of Treatments Enabled by Classification Transformers
Authors: Yuanheng Mao, Lillian Yang, Stephen Yang, Ethan Shao, Zihan Li
Venue: 2025 IEEE International Conference on Data Mining Workshops (ICDMW), Washington, DC, USA, 2025, pp. 3207-3211
First: 2025-11-10T15:38:32+00:00 · Latest: 2026-03-20T16:32:23+00:00
Comments: 5 pages, 4 figures, 2 tables, presented and awarded Best Paper Runner-Up at the IEEE ICDM 2025 UGHS Symposium
Abstract
Chronic pain is a global health challenge affecting millions of individuals, making it essential for physicians to have reliable and objective methods to measure the functional impact of clinical treatments. Traditionally used methods, like the numeric rating scale, while personalized and easy to use, are subjective due to their self-reported nature. Thus, this paper proposes DETECT (Data-Driven Evaluation of Treatments Enabled by Classification Transformers), a data-driven framework that assesses treatment success by comparing patient activities of daily life before and after treatment. We use DETECT on public benchmark datasets and simulated patient data from smartphone sensors. Our results demonstrate that DETECT is objective yet lightweight, making it a significant and novel contribution to clinical decision-making. By using DETECT, independently or together with other self-reported metrics, physicians can improve their understanding of their treatment impacts, ultimately leading to more personalized and responsive patient care.
中文标题/摘要
标题:DETECT:由分类变换器驱动的治疗评估
慢性疼痛是影响数百万个体的全球健康挑战,因此医生需要可靠且客观的方法来衡量临床治疗的功能影响。传统方法,如数字评分量表,虽然个性化且易于使用,但由于其自我报告的性质,因此具有主观性。因此,本文提出了DETECT(由分类变换器驱动的治疗评估),这是一种数据驱动的框架,通过比较治疗前后患者的日常生活活动来评估治疗成功。我们使用DETECT对公共基准数据集和来自智能手机传感器的模拟患者数据进行了测试。我们的结果表明,DETECT既客观又轻量级,是临床决策的重要且新颖的贡献。通过使用DETECT,单独或与其他自我报告的指标结合,医生可以更好地理解其治疗的影响,最终实现更个性化和响应性的患者护理。
Summary / 总结
DETECT is a data-driven framework that evaluates the success of clinical treatments by comparing patient activities of daily life before and after treatment. It uses classification transformers on public benchmark datasets and simulated patient data from smartphone sensors. The results show that DETECT is objective and lightweight, enhancing clinical decision-making and leading to more personalized patient care.
DETECT 是一个数据驱动的框架,通过比较患者在治疗前后日常生活活动来评估治疗效果。利用分类变换器,DETECT 应用于公共基准数据集和来自智能手机传感器的模拟患者数据。结果表明,DETECT 是客观且轻量的,有助于临床决策并促进更个性化的患者护理。
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Authors: Seyed Mahdi B. Azad, Jasper Hoffmann, Iman Nematollahi, Hao Zhu, Abhinav Valada, Joschka Boedecker
First: 2026-03-20T16:28:33+00:00 · Latest: 2026-03-20T16:28:33+00:00
Abstract
Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts as a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
中文标题/摘要
标题:通过时间抽象在前向-后向表示中的光谱对齐
前向-后向(FB)表示提供了一种强大的框架,用于通过强制低秩因式分解在连续空间中学习后继表示(SR)。然而,连续环境中的高秩过渡动力学与FB架构中的低秩瓶颈之间经常存在基本的光谱不匹配,这使得准确的低秩表示学习变得困难。在本文中,我们分析时间抽象作为缓解这种不匹配的机制。通过表征过渡算子的光谱特性,我们表明时间抽象充当低通滤波器,抑制高频光谱分量。这种抑制降低了诱导SR的有效秩,同时保留了结果值函数误差的形式界。实验上,我们表明这种对齐是FB学习稳定的关键因素,特别是在高折扣因子下,回推变得容易出错。我们的结果将时间抽象识别为塑造潜在MDP的光谱结构并使连续控制中的长期表示有效的一种原理性机制。
Summary / 总结
This work addresses the challenge of spectral mismatch in forward-backward (FB) representations for learning successor representations (SR) in continuous spaces. By analyzing temporal abstraction, the study demonstrates that it acts as a low-pass filter, reducing the effective rank of the SR while maintaining a formal bound on value function error. Empirically, the research shows that this spectral alignment is crucial for stable FB learning, especially at high discount factors where bootstrapping can be error-prone.
该研究通过分析时间抽象来解决前向-后向(FB)表示中的频谱不匹配问题。研究表明,时间抽象作为一种低通滤波器,可以降低后续表示的有效秩,同时保持价值函数误差的正式界。实验结果表明,这种频谱对齐对于高折扣因子下的稳定FB学习至关重要,因为此时的递推可能会产生误差。
How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models
Authors: Luca Ambrogioni
First: 2026-03-20T16:19:10+00:00 · Latest: 2026-03-20T16:19:10+00:00
Abstract
In this work, we propose a theoretical framework that interprets the generation process in trained diffusion models as an instance of out-of-equilibrium phase transitions. We argue that, rather than evolving smoothly from noise to data, reverse diffusion passes through a critical regime in which small spatial fluctuations are amplified and seed the emergence of large-scale structure. Our central insight is that architectural constraints, such as locality, sparsity, and translation equivariance, transform memorization-driven instabilities into collective spatial modes, enabling the formation of coherent patterns beyond the training data. Using analytically tractable patch score models, we show how classical symmetry-breaking bifurcations generalize into spatially extended critical phenomena described by softening Fourier modes and growing correlation lengths. We further connect these dynamics to effective field theories of the Ginzburg-Landau type and to mechanisms of pattern formation in non-equilibrium physics. Empirical results on trained convolutional diffusion models corroborate the theory, revealing signatures of criticality including mode softening and rapid growth of spatial correlations. Finally, we demonstrate that this critical regime has practical relevance: targeted perturbations, such as classifier-free guidance pulses applied at the estimated critical time, significantly improve generation control. Together, these findings position non-equilibrium critical phenomena as a unifying principle for understanding, and potentially improving, the behavior of modern diffusion models.
中文标题/摘要
标题:非平衡相变如何在训练的扩散模型中引发模式形成
在本工作中,我们提出了一种理论框架,将训练的扩散模型的生成过程解释为非平衡相变的一个实例。我们认为,与从噪声平滑地演化到数据不同,逆向扩散会经过一个临界区域,在该区域中,小的空间波动被放大并引发大规模结构的出现。我们的核心见解是,诸如局部性、稀疏性和平移不变性等架构约束,将记忆驱动的不稳定性转化为集体空间模式,从而在训练数据之外形成一致的模式。通过使用可解析处理的块评分模型,我们展示了经典的对称性破缺分岔如何一般化为由软化傅里叶模式和增长的相关长度描述的空间扩展临界现象。我们进一步将这些动力学与Ginzburg-Landau类型的有效场理论以及非平衡物理中模式形成的机制联系起来。训练的卷积扩散模型的实验证据支持了这一理论,揭示了临界性的特征,包括模式软化和空间相关性的快速增长。最后,我们证明了这一临界区域具有实际意义:在估计的临界时间应用目标扰动,如分类器自由引导脉冲,可以显著提高生成控制。这些发现将非平衡临界现象定位为理解并可能改进现代扩散模型行为的统一原则。
Summary / 总结
This study proposes a theoretical framework interpreting the generation process in trained diffusion models as an instance of out-of-equilibrium phase transitions. The authors argue that reverse diffusion passes through a critical regime where small spatial fluctuations are amplified, leading to the emergence of large-scale structures. Using analytically tractable patch score models, they show that architectural constraints transform memorization-driven instabilities into collective spatial modes, enabling coherent pattern formation beyond the training data. Empirical results reveal signatures of criticality, such as mode softening and rapid growth of spatial correlations, which are significant for improving generation control.
本文提出了一种理论框架,将训练好的扩散模型的生成过程解释为非平衡相变的一个实例。研究表明,逆向扩散会经过一个临界阶段,在这个阶段,小的空间波动被放大,从而导致大尺度结构的出现。架构约束将记忆驱动的不稳定性转化为集体的空间模式,从而在训练数据之外形成有组织的模式。实验证据证实了这一理论,揭示了模式软化和空间相关性的快速增长等临界现象的特征。在估计的临界时间应用扰动可以改善生成控制。
Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment
Authors: Shiqi Gao, Kang Fu, Zitong Xu, Huiyu Duan, Xiongkuo Min, Jia Wang, Guangtao Zhai
First: 2026-03-20T16:07:34+00:00 · Latest: 2026-03-20T16:07:34+00:00
Abstract
Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.
中文标题/摘要
标题:基于偏好引导的无参考增强图像质量评估去偏方法
当前用于增强图像的无参考图像质量评估(NR-IQA)模型往往难以泛化,因为它们倾向于过度拟合特定增强算法的独特模式,而不是评估真实的感知质量。为了解决这一问题,我们提出了一种基于偏好引导的无参考增强图像质量评估(EIQA)去偏框架。具体而言,我们首先使用监督对比学习学习一个连续的增强偏好嵌入空间,其中通过相似增强风格生成的图像被鼓励具有更接近的表示。在此基础上,我们进一步估计原始质量表示中包含的增强诱导的噪声成分,并在质量回归之前将其去除。这样,模型被引导关注算法不变的感知质量线索,而不是特定增强视觉指纹。为了促进稳定的优化,我们采用两阶段训练策略,首先学习增强偏好空间,然后进行去偏质量预测。在公共EIQA基准上的广泛实验表明,所提出的方法有效地减轻了由算法引起的表示偏差,并在鲁棒性和跨算法泛化方面优于现有方法。
Summary / 总结
The research aims to improve the generalization ability of no-reference image quality assessment models for enhanced images by addressing their tendency to overfit to specific enhancement algorithms. The proposed preference-guided debiasing framework uses supervised contrastive learning to create an enhancement-preference embedding space, which helps in removing enhancement-specific visual fingerprints from the quality representation. This approach enables the model to focus on algorithm-invariant perceptual quality cues. Experimental results on public EIQA benchmarks show that the method effectively reduces algorithm-induced representation bias and outperforms existing approaches in terms of robustness and cross-algorithm generalization.
论文提出了一种偏好导向的去偏差框架,以解决增强图像的无参考图像质量评估(NR-IQA)模型中过度拟合的问题。该框架通过监督对比学习学习连续的增强偏好嵌入空间,从而估计并去除增强引起的噪声成分。模型随后被引导关注算法不变的感知质量线索。实验表明,所提出的方法有效地减少了算法引起的偏差,并在鲁棒性和跨算法泛化方面优于现有方法。
Hidden Breakthroughs in Language Model Training
Authors: Sara Kangaslahti, Elan Rosenfeld, Naomi Saphra
Venue: ICLR 2026
First: 2025-06-18T20:40:16+00:00 · Latest: 2026-03-20T16:05:56+00:00
Comments: ICLR 2026 Camera-ready
Abstract
Loss curves are smooth during most of model training, so visible discontinuities stand out as possible conceptual breakthroughs. Studying these breakthroughs enables a deeper understanding of learning dynamics, but only when they are properly identified. This paper argues that similar breakthroughs occur frequently throughout training but they are obscured by a loss metric that collapses all variation into a single scalar. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace. We use our method to identify clusters of samples that share similar changes in loss during training, disaggregating the overall loss into that of smaller groups of conceptually similar data. We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model's capabilities. We demonstrate the promise of these hidden phase transitions as a tool for unsupervised interpretability.
中文标题/摘要
标题:语言模型训练中的隐秘突破
在大部分模型训练过程中,损失曲线是平滑的,因此可见的不连续性可能代表潜在的概念性突破。研究这些突破有助于更深入地理解学习动态,但前提是必须正确识别它们。本文认为,在训练过程中频繁发生类似的突破,但这些突破被损失度量所掩盖,该度量将所有变化压缩为单一标量。为了发现这些隐藏的过渡,我们引入了POLCA方法,用于沿低秩训练子空间的任意基分解损失的变化。我们使用该方法识别在训练过程中具有相似损失变化的样本群,将整体损失分解为更小组的概念性相似数据的损失。我们在合成算术和自然语言任务上验证了该方法,表明POLCA恢复了代表可解释突破的群集。我们展示了这些隐藏的相变作为无监督可解释性工具的潜力。
Summary / 总结
The research aims to uncover hidden breakthroughs in language model training by analyzing loss curves, which are often smooth, to identify discontinuities that represent conceptual advancements. The POLCA method is introduced to decompose loss changes along arbitrary bases of the training subspace, enabling the disaggregation of overall loss into smaller, conceptually similar groups. Experiments on synthetic tasks show that POLCA can recover clusters representing interpretable breakthroughs in model capabilities, suggesting its potential as a tool for unsupervised interpretability.
研究旨在通过分析损失曲线来发现语言模型训练中的隐藏突破。POLCA方法沿训练子空间的任意基分解损失变化,以识别具有相似损失变化的数据簇。实验表明,POLCA可以恢复代表模型概念进步的可解释簇,暗示其在无监督解释中的潜力。
Discovering Intersectional Bias via Directional Alignment in Face Recognition Embeddings
Authors: Ignacio Serna
First: 2025-10-17T10:49:50+00:00 · Latest: 2026-03-20T16:03:29+00:00
Abstract
Modern face recognition models embed identities on a unit hypersphere, where identity variation forms tight clusters. Conversely, shared semantic attributes can often be effectively approximated as linear directions in the latent space. Existing bias evaluation methods rely on predefined attribute labels, synthetic counterfactuals, or proximity-based clustering, all of which fail to capture intersectional subpopulations that emerge along latent directions. We introduce LatentAlign, an attribute-free algorithm that discovers semantically coherent and interpretable subpopulations by iteratively aligning embeddings along dominant latent directions. Unlike distance-based clustering, LatentAlign exploits the geometry of hyperspherical embeddings to isolate directional structures shared across identities, allowing for the interpretable discovery of attributes. Across four state-of-the-art recognition backbones (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LatentAlign consistently yields more semantically coherent groups than $k$-means, spherical $k$-means, nearest-neighbor search, and DBSCAN. Crucially, the discovered subpopulations expose severe intersectional vulnerabilities, with False Match Rates up to 4x higher than groups defined by explicit annotations. Our results show that by treating semantic attributes as directional features rather than spatial clusters, we can effectively isolate intersectional subpopulations and expose hidden biases that standard audits miss.
中文标题/摘要
标题:通过面部识别嵌入的方向对齐发现交叉偏见
现代面部识别模型将身份嵌入到单位超球面上,其中身份变化形成紧密的簇。相反,共享的语义属性通常可以有效地近似为潜在空间中的线性方向。现有的偏见评估方法依赖于预定义的属性标签、合成的反事实或基于距离的聚类,所有这些都无法捕捉到沿着潜在方向出现的交叉子群体。我们引入了LatentAlign,这是一种无属性算法,通过迭代沿主导潜在方向对齐嵌入来发现语义上一致且可解释的子群体。与基于距离的聚类不同,LatentAlign利用超球面嵌入的几何结构来隔离身份间共享的方向结构,从而实现可解释的属性发现。在四种最先进的识别骨干网络(ArcFace、CosFace、ElasticFace、PartialFC)和两个基准(RFW、CelebA)上,LatentAlign始终比k-means、球形k-means、最近邻搜索和DBSCAN生成更语义上一致的群体。至关重要的是,发现的子群体揭示了严重的交叉脆弱性,错误匹配率最高可达4倍于由显式注释定义的群体。我们的结果表明,通过将语义属性视为方向特征而不是空间聚类,我们可以有效地隔离交叉子群体并揭示标准审计所忽略的隐藏偏见。
Summary / 总结
The research aims to identify intersectional biases in face recognition models by developing LatentAlign, an attribute-free algorithm that aligns embeddings along dominant latent directions. Unlike traditional clustering methods, LatentAlign leverages the geometry of hyperspherical embeddings to discover semantically coherent subpopulations. Experiments on four face recognition models and two benchmarks demonstrate that LatentAlign outperforms $k$-means, spherical $k$-means, nearest-neighbor search, and DBSCAN in uncovering more semantically coherent groups. Importantly, the discovered subpopulations reveal significant intersectional vulnerabilities, with up to 4x higher False Match Rates compared to groups defined by explicit annotations.
研究旨在通过开发LatentAlign方法来发现面部识别模型中的交叉偏见,该方法不依赖属性标签,而是沿主导的潜在方向对嵌入进行对齐。该方法比传统聚类技术更一致地生成语义上更连贯的组,并且与通过显式注释定义的组相比,暴露的交叉偏见的假匹配率高达4倍更高。
Fast 3D Diffusion for Scalable Granular Media Synthesis
Authors: Muhammad Moeeze Hassan, Régis Cottereau, Filippo Gatti, Patryk Dec
First: 2025-08-27T10:27:36+00:00 · Latest: 2026-03-20T16:01:22+00:00
Abstract
Discrete Element Method (DEM) simulations of granular media are computationally intensive, particularly during initialization phases dominated by large displacements and kinetic energy. This paper presents a novel generative pipeline based on 3D diffusion models that directly synthesizes arbitrarily large granular assemblies in mechanically realistic configurations. The approach employs a two-stage pipeline. First, an unconditional diffusion model generates independent 3D voxel grids representing granular media; second, a 3D inpainting model, adapted from 2D techniques using masked inputs and repainting strategies, seamlessly stitches these grids together. The inpainting model uses the outputs of the unconditional diffusion model to learn from the context of adjacent generations and creates new regions that blend smoothly into the context region. Both models are trained on binarized 3D occupancy grids derived from a database of small-scale DEM simulations, scaling linearly with the number of output voxels. Simulations that spanned over days can now run in hours, practically enabling simulations containing more than 200k ballast particles. The pipeline remains fully compatible with existing DEM workflows as it post-processes the diffusion generated voxel grids into DEM compatible particle meshes. Being mechanically consistent on key granulometry metrics with the original DEM simulations, the pipeline is also compatible with many other applications in the field of granular media, with capability of generating both convex and non-convex particles. Showcased on two examples (railway ballast and lunar regolith), the pipeline reimagines the way initialization of granular media simulations is performed, enabling scales of generation previously unattainable with traditional DEM simulations.
中文标题/摘要
标题:快速3D扩散用于可扩展的颗粒介质合成
离散元方法(DEM)模拟颗粒介质在初始化阶段,特别是在由大量位移和动能主导的情况下,计算密集度高。本文提出了一种基于3D扩散模型的新型生成管道,可以直接合成任意大小的颗粒组装体,并处于机械上现实的配置中。该方法采用两阶段管道。首先,无条件扩散模型生成独立的3D体素网格,代表颗粒介质;其次,一种3D修补模型,采用从2D技术改编的带有掩码输入和重涂策略,无缝地将这些网格拼接在一起。修补模型利用无条件扩散模型的输出,从相邻生成的上下文中学习,并创建能够平滑融入上下文区域的新区域。两个模型均在来自小规模DEM模拟数据库的二值化3D占用网格上进行训练,其规模与输出体素的数量成线性关系。原本需要数天才能完成的模拟现在可以在数小时内完成,实际上使包含超过20万块压载颗粒的模拟成为可能。该管道完全兼容现有的DEM工作流程,因为它将扩散生成的体素网格后处理为与DEM兼容的颗粒网格。在关键颗粒度量上保持机械一致性,该管道还与其他颗粒介质领域的许多其他应用兼容,能够生成凸性和非凸性颗粒。在两个示例(铁路压载和月球风化层)上展示,该管道重新定义了颗粒介质模拟的初始化方式,使以前无法通过传统DEM模拟实现的生成规模成为可能。
Summary / 总结
This paper addresses the computational challenges of initializing large-scale granular media simulations using Discrete Element Method (DEM). It introduces a two-stage generative pipeline based on 3D diffusion models that can synthesize granular assemblies efficiently. The first stage generates independent 3D voxel grids, and the second stage uses inpainting to stitch these grids together, ensuring mechanical consistency. The pipeline scales linearly with the number of output voxels and can generate simulations containing over 200k particles, reducing simulation times from days to hours. It remains compatible with existing DEM workflows and can produce both convex and non-convex particles, making it suitable for various applications in granular media simulation.
该论文解决了使用离散元方法(DEM)进行大规模颗粒介质初始化时的计算难题。它提出了一种基于3D扩散模型的两阶段生成管道,可以高效地合成大型颗粒组装体。第一阶段生成独立的3D体素网格,第二阶段使用修补技术将这些网格无缝拼接。这两种模型都基于从DEM模拟中提取的二值化3D占用网格进行训练,允许大规模且机械上一致的模拟。这种方法可以生成包含超过20万个颗粒的模拟,在几小时内完成,而传统方法则需要数天时间。
A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic, Optical, and Electromagnetic Tracking
Authors: Lewis Howell, Manisha Waterston, Tze Min Wah, James H. Chandler, James R. McLaughlan
First: 2026-03-20T15:56:50+00:00 · Latest: 2026-03-20T15:56:50+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.
中文标题/摘要
标题:3D超声重建与机器人、光学和电磁跟踪的统一平台及质量保证框架
三维(3D)超声(US)可以促进诊断、治疗计划和图像引导治疗。然而,当前的研究很少提供对体积准确性和可重复性的全面评估,突显了需要强大的质量保证(QA)框架,特别是对于使用自由手或机器人获取的跟踪3D US重建。本研究提出了一种3D US重建的QA框架和一个灵活的开源平台,用于跟踪US研究。一个自定义的模型包含具有不同对称性的几何包涵物,可以方便地评估光学、电磁和机器人运动跟踪在不同扫描速度和入射角下的3D US重建。一个标准化的工作流程在不使用GPU加速的情况下实时分割和重建几何目标(DSC = 0.97, FPS = 46),然后进行自动配准并与真实几何形状进行比较。应用此框架表明,我们的机器人3D US实现了最先进的重建性能(DSC-3D = 0.94 ± 0.01, HD95 = 1.17 ± 0.12),接近由换能器设定的空间分辨率限制。本研究建立了一个灵活的实验平台和可重复的验证方法,用于3D US重建。提出的框架使跨平台比较变得稳健,并提高了报告实践,支持3D超声在诊断和图像引导治疗应用中的安全和有效临床转化。
Summary / 总结
This study introduces a QA framework and an open source platform for evaluating 3D ultrasound reconstruction using robotic, optical, and electromagnetic tracking. A custom phantom was used to assess tracking accuracy and volumetric reconstruction quality. The framework demonstrated that the robotic 3D US system achieved state-of-the-art reconstruction performance, with high Dice Similarity Coefficient (DSC) and Hausdorff distance (HD95) values, approaching the spatial resolution limit of the transducer. This work provides a robust method for cross-platform comparisons and improved reporting practices in 3D ultrasound research.
该研究提出了一种质量保证框架和开源平台,用于评估使用机器人、光学和电磁跟踪的3D超声重建性能。通过使用定制的模型评估了不同跟踪方法在不同条件下的表现。该框架显示,机器人3D US系统实现了最先进的重建性能,Dice相似系数(DSC-3D = 0.94 ± 0.01)和Hausdorff距离(HD95 = 1.17 ± 0.12)均表现出色,接近超声波探头的空间分辨率限制。该工作提供了一种可靠的比较和验证3D超声重建系统的办法,支持其临床转化。