arXiv 论文速递

Snapshot: 20260201_0328

RedSage: A Cybersecurity Generalist LLM

Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani

Venue: ICLR 2026

First: 2026-01-29T18:59:57+00:00 · Latest: 2026-01-29T18:59:57+00:00

Comments: Accepted on ICLR 2026; Project page: https://risys-lab.github.io/RedSage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.

中文标题/摘要

标题：RedSage：网络安全通才大语言模型

网络安全操作需要能够支持多样化工作流程而不泄露敏感数据的辅助大语言模型。现有解决方案要么依赖于存在隐私风险的专有API，要么基于缺乏领域适应性的开源模型。为弥合这一差距，我们通过大规模网络过滤和手动收集高质量资源，整理了118亿个与网络安全相关的持续预训练数据，覆盖了28600份文档，涉及框架、攻击技术和安全工具。在此基础上，我们设计了一种代理增强流水线，模拟专家工作流程生成266000个网络安全多轮对话样本，用于监督微调。结合通用开源大语言模型数据，这些资源使我们能够训练出RedSage，这是一个开源、本地部署的网络安全助手，具有领域感知的预训练和后训练。为了严格评估模型，我们引入了RedSage-Bench基准，包含30000个多项选择题和240个开放式问答项，涵盖网络安全知识、技能和工具专长。RedSage还在建立的网络安全基准（如CTI-Bench、CyberMetric、SECURE）和通用大语言模型基准上进行了评估，以评估其更广泛的泛化能力。在80亿规模下，RedSage在网络安全基准上的表现始终优于基线模型，分别在网络安全基准和Open LLM Leaderboard任务上提高了最多5.59分和5.05分。这些发现表明，领域感知的代理增强和预/后训练不仅可以增强网络安全特定的专业知识，还可以帮助提高一般推理和指令遵循能力。所有模型、数据集和代码均已公开。

Summary / 总结

RedSage is designed to support diverse cybersecurity workflows while maintaining data privacy. It leverages 11.8 billion tokens of curated cybersecurity data and an agentic augmentation pipeline to generate 266K multi-turn samples for fine-tuning. RedSage outperforms baseline models by up to 5.59 points on cybersecurity benchmarks and 5.05 points on general LLM tasks, showing enhanced cybersecurity-specific expertise and broader generalization capabilities.

RedSage旨在解决需要一个既能支持多样化工作流程又能够保护数据隐私的网络安全助手LLM的问题。它利用了11.8B个网络安全相关的数据令牌，并通过一个代理增强管道生成了266K样本进行监督微调。RedSage在网络安全基准测试中的表现比基线模型高出最多5.59分，在通用LLM基准测试中的表现高出5.05分，证明了领域意识增强前/后训练和代理增强的有效性。该模型是开源的，并可通过RedSage-Bench和其他基准测试进行评估。

UEval: A Benchmark for Unified Multimodal Generation

Authors: Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

中文标题/摘要

标题：UEval：统一多模态生成基准

我们介绍了UEval，一个用于评估统一模型的基准，即能够生成图像和文本的模型。UEval 包含1000个专家精选的问题，要求模型输出中包含图像和文本，这些问题来源于8个实际任务。我们精选的问题涵盖了广泛的推理类型，从逐步指南到教科书解释。对开放式的多模态生成进行评估并不简单，简单的LLM作为评判者的方法可能会忽略细节。不同于以往依赖多模态大型语言模型（MLLMs）来评估图像质量和文本准确性的工作，我们在UEval中设计了一种基于评分标准的评分系统。对于每个问题，提供参考图像和文本答案给MLLM生成初始评分标准，包括多个评估标准，然后由人类专家进一步完善和验证这些评分标准。UEval 包含10,417个验证过的评分标准，使其能够实现可扩展和精细的自动评分。UEval 对当前的统一模型具有挑战性：GPT-5-Thinking仅得分为66.4/100，而最好的开源模型仅达到49.1。我们观察到，推理模型通常优于非推理模型，从推理模型向非推理模型转移推理痕迹可以显著缩小差距。这表明，对于需要复杂多模态理解和生成的任务，推理可能是重要的。

Summary / 总结

UEval is a benchmark for evaluating unified models that generate both images and text, comprising 1,000 expert-curated questions from 8 real-world tasks. It uses a rubric-based scoring system, where a MLLM generates initial criteria and human experts refine them. UEval challenges current unified models, with GPT-5-Thinking scoring 66.4 and the best open-source model scoring 49.1. Reasoning models outperform non-reasoning ones, and transferring reasoning from a reasoning model to a non-reasoning one significantly improves performance.

UEval 是一个用于评估能够生成图像和文本的统一模型的基准，包含来自 8 个真实世界任务的 1,000 个专家精选问题。它采用基于评分表的评分系统，其中 MLLM 生成初始标准，然后由人类专家进行细化。UEval 对当前统一模型构成挑战，GPT-5-Thinking 的得分为 66.4，而最佳开源模型仅为 49.1。推理模型在复杂多模态任务中表现更优，从推理模型转移推理痕迹可以显著提高性能。

Exploring Reasoning Reward Model for Agents

Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue

First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00

Comments: Project page: https://github.com/kxfan2002/Reagent

Abs · PDF · Code1 · Code2 · Code3

Abstract

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

中文标题/摘要

标题：探索代理推理奖励模型

代理强化学习（Agentic RL）在使代理执行复杂推理和工具使用方面取得了显著成功。然而，大多数方法仍然依赖于稀疏的结果奖励进行训练。这种反馈无法区分中间推理的质量，导致训练结果不佳。在本文中，我们引入了代理推理奖励模型（Agent-RRM），这是一种多方面的奖励模型，为代理轨迹提供结构化的反馈，包括（1）明确的推理轨迹，（2）聚焦的批评，通过突出推理缺陷提供改进指导，以及（3）总体评分，评估过程表现。利用这些信号，我们系统地研究了三种整合策略：Reagent-C（文本增强改进），Reagent-R（奖励增强指导）和Reagent-U（统一反馈整合）。在12个不同的基准测试中进行的广泛评估表明，Reagent-U带来了显著的性能提升，在GAIA上达到43.7%，在WebWalkerQA上达到46.2%，验证了我们推理奖励模型和训练方案的有效性。代码、模型和数据集均已发布，以促进未来的研究。

Summary / 总结

This paper addresses the limitation of sparse outcome-based rewards in Agentic Reinforcement Learning by proposing Agent Reasoning Reward Model (Agent-RRM), which provides structured feedback including reasoning traces, focused critiques, and overall scores. Three integration strategies—Reagent-C, Reagent-R, and Reagent-U—are evaluated, with Reagent-U showing significant performance improvements, achieving 43.7% and 46.2% on GAIA and WebWalkerQA benchmarks, respectively.

本文针对Agentic Reinforcement Learning中由于稀疏结果奖励导致的训练效果不佳问题，提出了Agent Reasoning Reward Model (Agent-RRM)，该模型提供结构化的反馈，包括推理轨迹、聚焦批评和整体评分。三种集成策略被探索：Reagent-C、Reagent-R和Reagent-U。广泛的评估显示，Reagent-U显著提高了性能，分别在GAIA和WebWalkerQA上达到了43.7%和46.2%，验证了推理奖励模型和训练方案的有效性。

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Authors: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu

Venue: www

First: 2026-01-29T18:59:51+00:00 · Latest: 2026-01-29T18:59:51+00:00

Comments: Project Page: https://www.infinitescript.com/project/dynamic-vla/ GitHub: https://github.com/hzxie/DynamicVLA

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.

中文标题/摘要

标题：DynamicVLA：一种用于动态物体操作的视觉-语言-行动模型

对于动态物体的操作，视觉-语言-行动（VLA）模型仍然面临挑战，尽管在静态操作上表现出强大的泛化能力，但在需要快速感知、时间预测和连续控制的动态场景中却难以应对。我们提出了DynamicVLA，一种通过三个关键设计结合时间推理和闭环适应的框架：1）一个紧凑的0.4B VLA，使用卷积视觉编码器进行空间高效、结构忠实的编码，实现快速多模态推理；2）连续推理，实现重叠的推理和执行，以降低延迟并及时适应物体运动；3）潜在感知动作流式传输，通过强制执行时间对齐的动作执行来弥合感知-执行差距。为了填补动态操作数据的空白，我们引入了Dynamic Object Manipulation（DOM）基准，该基准从头开始构建，使用自动数据收集管道高效地收集了跨越2800个场景和206个物体的20万合成集，并能够快速收集2000个无需远程操作的真实世界集。广泛的评估表明，在响应速度、感知和泛化方面取得了显著改进，将DynamicVLA定位为一种统一框架，适用于各种体态下的通用动态物体操作。

Summary / 总结

DynamicVLA is designed to address the challenge of dynamic object manipulation for VLA models by integrating temporal reasoning and closed-loop adaptation. It uses a compact vision-language-action model with a convolutional vision encoder for efficient inference and introduces Continuous Inference and Latent-aware Action Streaming to handle dynamic scenarios. The framework demonstrates significant improvements in response speed, perception, and generalization through extensive evaluations and a newly created DOM benchmark with 200K synthetic and 2K real-world episodes.

DynamicVLA旨在通过集成时间推理和闭环适应来解决VLA模型在动态物体操作中的挑战。它使用具有卷积视觉编码器的紧凑型视觉-语言-行动模型以实现高效推理，并引入连续推理和潜在感知动作流以处理动态场景。该框架通过广泛的评估和一个包含200K合成和2K真实世界场景的DOM基准数据集，展示了在响应速度、感知和泛化方面的显著改进。

Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

First: 2026-01-29T18:59:24+00:00 · Latest: 2026-01-29T18:59:24+00:00

Comments: 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

中文标题/摘要

标题：VLMs 是感知还是回忆？经典视觉错觉探究视觉感知与记忆

大型视觉-语言模型（VLMs）在原始图像上对经典视觉错觉通常会给出正确的回答，但在错觉因素被反转后，仍然坚持相同的回答，尽管这些视觉变化对人类来说是显而易见的。这引发了一个基本问题：VLMs 是感知视觉变化还是仅仅回忆已记忆的模式？尽管已有几项研究注意到了这一现象，但其背后的成因仍然不清楚。为了从观察转向系统理解，本文引入了VI-Probe，这是一种可控的视觉错觉框架，具有分级扰动和匹配的视觉对照（没有错觉诱导器），以解开基于视觉的感知与语言驱动的回忆之间的关系。不同于以往工作主要关注平均准确率，我们使用极性反转一致性、模板固定指数以及与匹配对照标准化后的错觉乘数来衡量稳定性和敏感性。跨不同家族的实验表明，反应持久性是由多种原因而非单一机制引起的。例如，GPT-5 展现了记忆覆盖，Claude-Opus-4.1 显示了感知与记忆的竞争，而 Qwen 变体则表明了视觉处理的限制。我们的发现挑战了单一成因的观点，并激发了基于探针的评估，该评估同时衡量知识和对受控视觉变化的敏感性。数据和代码可在 https://sites.google.com/view/vi-probe/ 获取。

Summary / 总结

This paper investigates whether large vision-language models (VLMs) perceive visual changes or merely recall memorized patterns by introducing VI-Probe, a framework that uses classic visual illusions with graded perturbations and matched visual controls. The study measures stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier. The experiments show that response persistence arises from different causes, such as memory override, perception-memory competition, and visual-processing limits, challenging the notion of a single cause. The findings motivate a probing-based evaluation that assesses both knowledge and sensitivity to controlled visual change.

该研究通过使用可控视觉错觉框架VI-Probe，探讨大型视觉语言模型（VLMs）是感知视觉变化还是仅回忆记忆中的模式。研究使用极性反转一致性、模板固定指数和错觉乘数来衡量稳定性和敏感性。实验表明，VLMs响应持久性的原因多种多样，包括记忆覆盖、感知-记忆竞争和视觉处理限制，这挑战了单一机制的观点。研究结果表明，基于探针的评估对于评估控制视觉变化的知识和敏感性是必要的。

DynaWeb: Model-Based Reinforcement Learning of Web Agents

Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu

First: 2026-01-29T18:59:07+00:00 · Latest: 2026-01-29T18:59:07+00:00

Abs · PDF · Code1 · Code2

Abstract

The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

中文标题/摘要

标题：DynaWeb：基于模型的强化学习网页代理

自主网页代理的发展，由大型语言模型（LLMs）和强化学习（RL）驱动，代表了通用人工智能助手的重要一步。然而，训练这些代理受到与实时互联网交互的挑战限制，这既低效又昂贵且充满风险。基于模型的强化学习（MBRL）通过学习环境模型以实现模拟交互提供了有希望的解决方案。本文介绍了DynaWeb，这是一种新颖的MBRL框架，通过与训练以预测给定代理行为的自然网页表示的网页世界模型交互来训练网页代理。该模型充当合成的网页环境，代理策略可以在其中生成大量回放动作轨迹以实现高效的在线强化学习。除了免费策略回放外，DynaWeb还整合了来自训练数据的真实专家轨迹，这些轨迹在训练期间随机与策略回放交织，以提高稳定性和样本效率。在具有挑战性的WebArena和WebVoyager基准测试中进行的实验表明，DynaWeb能够持续且显著地提高最先进的开源网页代理模型的性能。我们的研究结果证明了通过想象训练网页代理的可行性，提供了扩展在线代理性RL的一种可扩展且高效的方法。

Summary / 总结

DynaWeb is a model-based reinforcement learning framework designed to train web agents using a synthetic web environment. It addresses the challenges of real-world interaction by learning a world model to simulate web page representations. DynaWeb incorporates both free policy rollouts and real expert trajectories to enhance stability and sample efficiency. Experiments on WebArena and WebVoyager show that DynaWeb significantly improves the performance of existing web agent models.

DynaWeb 是一种基于模型的强化学习框架，用于通过合成的网络环境训练网络代理。它通过学习一个能够根据代理动作预测网页表示的世界模型来解决现实世界交互的挑战。DynaWeb 结合了自由策略回放和真实专家轨迹，以提高稳定性和样本效率。实验表明，DynaWeb 在 WebArena 和 WebVoyager 等具有挑战性的基准测试中显著提升了现有网络代理模型的性能。

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Authors: Ajay Patel, Colin Raffel, Chris Callison-Burch

First: 2026-01-29T18:58:47+00:00 · Latest: 2026-01-29T18:58:47+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

中文标题/摘要

标题：FineInstructions: 将预训练规模扩展到合成指令

由于监督训练数据有限，大型语言模型（LLMs）通常通过在大量无结构文本数据上使用自我监督的“预测下一个词”目标进行预训练。为了使最终模型对用户有用，它还会在少量“指令调优”数据上进行进一步训练，这些数据由指令和响应的监督训练示例组成。为了克服有限的监督数据，我们提出了一种程序，可以将互联网规模预训练文档中的知识转化为数十亿组合成指令和答案训练对。由此产生的数据集称为FineInstructions，使用来自真实用户查询和提示的约1800万指令模板。这些指令模板与并实例化自无结构预训练语料库中的人类撰写的源文档。通过在如此大规模的“监督”合成训练数据上进行预训练，LLM可以从头开始仅使用指令调优目标进行预训练，这与LLM预期下游使用情况（响应用户提示）更为一致。我们进行了受控的逐词训练实验，并发现使用FineInstructions进行预训练在衡量自由形式响应质量的标准基准上优于标准预训练和其他提出的合成预训练技术。我们的资源可以在https://huggingface.co/fineinstructions 获取。

Summary / 总结

The research aims to address the limitation of supervised training data for large language models (LLMs) by proposing a method to generate synthetic instruction and answer pairs at a massive scale. The method involves using 18 million instruction templates derived from real user queries and prompts, which are then matched and instantiated with human-written source documents from unstructured pre-training corpora. The resulting FineInstructions dataset enables LLMs to be pre-trained solely with the instruction-tuning objective, leading to better performance on standard benchmarks for free-form response quality compared to standard pre-training and other synthetic pre-training techniques.

研究旨在通过从互联网规模的预训练文档中生成合成的指令和答案对来解决大型语言模型（LLMs）监督训练数据有限的问题。FineInstructions数据集使用大约1800万个来源于真实用户查询的指令模板创建，用于仅通过指令调优目标对LLMs进行从头预训练，这在标准基准测试中提高了自由形式响应的质量，优于标准预训练和其他合成预训练技术。

MORPH: PDE Foundation Models with Arbitrary Data Modality

Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Alexander Scheinker, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas

First: 2025-09-25T22:38:36+00:00 · Latest: 2026-01-29T18:57:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters, MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

中文标题/摘要

标题：MORPH：任意数据模态的偏微分方程基础模型

我们介绍了MORPH，一种模态无关的自回归偏微分方程（PDE）基础模型。MORPH基于卷积视觉变换器骨干网络，能够无缝处理不同数据模态（1D-3D）和不同分辨率的异质时空数据集，以及具有混合标量和矢量分量的多个字段。该架构结合了(i)分量卷积，联合处理标量和矢量通道以捕捉局部交互，(ii)跨字段交叉注意力，建模并选择性地传播不同物理场之间的信息，(iii)轴向注意力，沿个体空间和时间轴分解全时空自注意力，以减少计算负担同时保留表达能力。我们使用多样化的异质PDE数据集对多个模型变体进行预训练，并评估其在一系列下游预测任务中的迁移性能。使用全模型微调和参数高效低秩适配器，MORPH优于从头训练的模型。在广泛的评估中，MORPH匹配或超越了强大的基线和最近的先进模型。这些能力共同展示了学习科学观测的异构和多模态性质的灵活而强大的基础架构，为可扩展和数据高效的科学机器学习铺平了道路。源代码、数据集和模型可在https://github.com/lanl/MORPH/公开获取。

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Authors: Anthony Chen, Naomi Ken Korem, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or

First: 2026-01-29T18:57:13+00:00 · Latest: 2026-01-29T18:57:13+00:00

Comments: Project webpage available at https://justdubit.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

中文标题/摘要

标题：JUST-DUB-IT：联合音视频扩散的视频配音

音视频基础模型，预训练以联合生成声音和视觉内容，最近展示了前所未有的多模态生成和编辑能力，为下游任务开辟了新的机会。在这些任务中，配音可以从这些先验知识中获益良多，但大多数现有解决方案仍然依赖于复杂且特定于任务的管道，在现实世界环境中表现不佳。在本文中，我们介绍了一种单模型方法，通过轻量级LoRA将基础音视频扩散模型适应于视频到视频的配音。LoRA使模型能够根据输入的音视频同时生成翻译后的音频和同步的面部动作。为了训练这种LoRA，我们利用生成模型本身合成了同一说话人的配对多语言视频。具体来说，我们生成了单个片段内的多语言视频，并在每个半部分中填充面部和音频，使其与另一半的语言匹配。通过利用音视频模型丰富的生成先验，我们的方法在保持说话人身份和唇部同步的同时，对复杂运动和现实世界动态具有鲁棒性。我们证明，与现有配音管道相比，我们的方法生成的配音视频具有更高的视觉保真度、唇部同步和鲁棒性。

Summary / 总结

This work introduces JUST-DUB-IT, a single-model approach for video dubbing using a joint audio-visual diffusion model. The model is adapted with a lightweight LoRA to condition on input audio-video while generating synchronized translated audio and facial motion. By leveraging the generative model to synthesize paired multilingual videos, the approach preserves speaker identity and lip synchronization, showing improved visual fidelity and robustness compared to existing dubbing methods.

该研究提出了JUST-DUB-IT，通过使用轻量级LoRA将音频-视觉扩散模型适应于视频配音。该模型在输入音频-视频的基础上生成同步的翻译音频和面部动作。通过在单个片段内切换语言并生成多语言视频进行训练，该方法能够保持说话人身份和唇部同步，并展示了比现有配音管道更高的高质量配音视频，具有更好的视觉保真度和鲁棒性。

StepShield: When, Not Whether to Intervene on Rogue Agents

Authors: Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli

First: 2026-01-29T18:55:46+00:00 · Latest: 2026-01-29T18:55:46+00:00

Comments: 16 pages, 2 figures, 14 tables

Abs · PDF · Code1 · Code2

Abstract

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

中文标题/摘要

标题：StepShield：何时干预，而非是否干预违规代理

现有的代理安全性基准报告二元准确性，混淆了早期干预与事后分析。在第8步检测到违规行为时可以进行干预，而在第48步报告则仅具有法医价值。这种区别至关重要，但当前的基准无法衡量这一点。我们引入了StepShield，这是第一个评估检测到违规行为的时间点，而不仅仅是是否检测到的基准。StepShield包含9,213个代码代理轨迹，包括1,278个精心标注的训练对和一个包含7,935个轨迹的测试集，现实中的违规率为8.1%。违规行为基于六类真实世界的安全事件。我们提出了三个新的时间度量标准：早期干预率（EIR）、干预差距和节省的令牌数。令人惊讶的是，我们的评估表明，基于LLM的法官实现了59%的EIR，而静态分析器仅为26%，标准准确性指标完全无法看到这种2.3倍的性能差距。我们进一步表明，早期检测具有直接的经济效益：我们的级联HybridGuard检测器将监控成本降低了75%，并在企业规模上预计在未来五年内节省1.08亿美元。通过将评估的重点从是否转移到何时，StepShield为构建更安全且更具经济效益的AI代理提供了新的基础。代码和数据在Apache 2.0许可证下发布。

Summary / 总结

StepShield evaluates the timing of violation detection in agent safety, addressing the limitations of existing benchmarks that only measure binary accuracy. It introduces three novel metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. The study finds that an LLM-based judge has a 59% EIR, outperforming a static analyzer by 2.3x, which is not reflected in standard accuracy metrics. Early detection also leads to significant economic benefits, with a cascaded HybridGuard detector reducing monitoring costs by 75% and projecting to $108M in savings over five years at scale. By focusing on when interventions occur, StepShield provides a new framework for safer and more cost-effective AI agents.

StepShield 评估检测违规的时间点，而不是仅仅衡量干预的二元准确性。它引入了三个新的时间度量标准：早期干预率（EIR）、干预间隔和节省的标记数。评估结果显示，基于语言模型的法官比静态分析器更高效地检测违规行为，EIR 达到 59%，而静态分析器仅为 26%，效率提高了 2.3 倍。早期检测还带来了显著的成本节约，级联的 HybridGuard 检测器将监控成本降低了 75%，并在五年的企业规模中预计可节省 10.8 亿美元。

PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Authors: Zhexin Liang, Zhaoxi Chen, Yongwei Chen, Tianyi Wei, Tengfei Wang, Xingang Pan

Venue: ICLR 2026

First: 2026-01-29T18:55:36+00:00 · Latest: 2026-01-29T18:55:36+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ($π$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $π$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

中文标题/摘要

标题：PI-Light：基于物理的扩散模型全图像重新照明

全图像重新照明仍然是一个具有挑战性的问题，原因在于难以收集大规模结构化的配对数据、保持物理合理性困难以及数据驱动先验带来的有限泛化能力。现有的尝试在全场景重新照明的合成到现实差距方面仍然不尽如人意。为了解决这些挑战，我们提出了基于物理的扩散模型全图像重新照明（$π$-Light，或PI-Light），这是一种两阶段框架。我们的设计包括（i）批次感知注意力，提高了图像集合中固有预测的一致性；（ii）一个基于物理的神经渲染模块，确保物理合理的光传输；（iii）基于物理的损失，规范训练动力学向物理有意义的景观发展，从而增强对真实世界图像编辑的泛化能力；（iv）一个精心策划的在受控光照条件下捕捉的多样物体和场景的数据集。这些组件共同使预训练的扩散模型的微调变得高效，同时也为下游评估提供了坚实基准。实验表明，$π$-Light 能够在各种材料上合成高光和漫反射，相比先前的方法在真实世界场景中具有更好的泛化能力。

Summary / 总结

PI-Light addresses the challenges of full-image relighting by introducing a two-stage framework that uses physics-inspired diffusion models. It includes batch-aware attention, a physics-guided neural rendering module, physics-inspired losses, and a curated dataset to improve consistency, enforce physical plausibility, and enhance generalizability. Experiments show that PI-Light outperforms previous methods in synthesizing specular highlights and diffuse reflections across various materials and real-world scenes.

该研究提出了PI-Light，一种用于全图像重新照明的两阶段框架，解决了数据稀缺性和物理合理性的问题。它使用了基于物理的扩散模型、批次感知注意力、物理引导的神经渲染模块以及基于物理的损失函数，以增强泛化能力。实验表明，PI-Light 在各种材料上生成高光和漫反射方面优于先前的方法，并且在真实场景中的泛化能力更强。

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu

First: 2026-01-29T18:52:54+00:00 · Latest: 2026-01-29T18:52:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

中文标题/摘要

标题：付费获取提示，而非答案：LLM 管理以实现成本效益推理

大型语言模型（LLMs）在复杂推理任务上提供最先进的性能，但其推理成本限制了大规模部署。小型语言模型（SLMs）提供显著的成本节省，但在准确性上落后很多。现有方法——路由和级联——将LLM视为全有或全无的资源：查询要么完全绕过LLM，要么LLM以全额成本生成完整响应。我们引入了LLM 管理框架，该框架仅请求LLM提供一个短前缀（提示），并将其提供给SLM。这种简单的机制在数学和编程任务中表现出惊人的效果：即使提示仅占完整LLM响应的10-30%，也能显著提高SLM的准确性。LLM 管理既涵盖了路由和级联，又在最优决策下实现了更低的成本。我们开发了一种两阶段预测器，以共同确定是否需要提示以及请求多少令牌。在广泛使用的数学推理（GSM8K，CNK12）和代码生成（HumanEval，MBPP）基准测试中，LLM 管理将成本降低了42-94%。与最先进的路由和级联基线相比，LLM 管理在成本上最多可减少2.8倍，同时保持相同的准确性。据我们所知，这是首次利用令牌级预算控制实现SLM-LLM协作的工作。

Summary / 总结

The paper addresses the high cost of using Large Language Models (LLMs) for complex reasoning tasks and introduces LLM Shepherding, a method that requests only a short prefix (a hint) from the LLM and provides it to a Small Language Model (SLM). This approach improves SLM accuracy by 10-30% and reduces costs by 42-94% compared to using LLMs alone. The two-stage predictor determines when and how many tokens to request, achieving up to 2.8x cost reduction without sacrificing accuracy.

论文通过引入LLM牧羊人方法，仅从大型语言模型（LLM）请求一个短前缀（提示），并将该提示提供给小型语言模型（SLM），以提高SLM的准确性并降低42-94%的成本，这在GSM8K、CNK12、HumanEval和MBPP等基准测试中得到了验证。该方法在成本降低方面优于现有的路由和级联方法，同时保持了准确性。据我们所知，这是首次利用令牌级预算控制实现SLM-LLM协作的工作。

World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Authors: Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak

First: 2026-01-29T18:51:54+00:00 · Latest: 2026-01-29T18:51:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

中文标题/摘要

标题：工作流世界：将世界模型引入企业系统的基准

前沿的大语言模型（LLMs）在许多领域作为自主代理表现出色，但在复杂的 enterprise 系统中仍未经测试，这些系统中的隐藏工作流会在相互连接的数据库之间产生级联效应。现有的企业基准测试评估表面级别的代理任务完成，类似于一般的消费者基准测试，忽略了企业中的真正挑战，如有限的可观测性、庞大的数据库状态以及隐藏的工作流及其级联副作用。我们引入了“工作流世界”（WoW），这是一个基于 ServiceNow 的现实环境，包含 4000 多个业务规则和 55 个嵌入系统中的活跃工作流，以及 WoW-bench，这是一个包含 234 个任务的基准，评估受限的代理任务完成能力和企业动态建模能力。我们揭示了两个主要发现：（1）前沿的 LLMs 存在动态盲点，一致地未能预测其行为的隐形级联副作用，导致无声的约束违规；（2）在不透明系统中的可靠性需要基于世界建模，代理必须在高保真反馈不可用时在心中模拟隐藏状态的过渡，以弥合可观测性差距。为了实现可靠且有用的企业代理，WoW 促使一种新的范式，即明确学习系统动力学。我们发布了 GitHub 以设置和评估 WoW。

Summary / 总结

The research introduces World of Workflows (WoW), a benchmark for evaluating large language models (LLMs) in complex enterprise systems, which are characterized by hidden workflows and cascading effects. WoW includes a realistic ServiceNow environment with 4,000+ business rules and 55 active workflows, and WoW-bench, a benchmark of 234 tasks. Key findings show that LLMs struggle with predicting cascading side effects, leading to constraint violations, and that reliable enterprise agents need to model the hidden state transitions to address the observability gap. This work highlights the need for LLMs to learn system dynamics explicitly for practical enterprise applications.

研究旨在测试前沿大型语言模型（LLMs）在复杂企业系统中的能力，这些系统中普遍存在隐藏的工作流和连锁效应。研究引入了World of Workflows (WoW)，一个基于ServiceNow的环境，包含4,000多个业务规则和55个活跃的工作流，并提出了WoW-bench，一个包含234个任务的基准。关键发现表明，LLMs在预测连锁效应方面存在困难，导致无声的约束违规，并强调了在不透明系统中进行可靠企业代理所需的世界建模的重要性，以弥补可观测性差距。

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Authors: Yifeng Ding, Lingming Zhang

First: 2026-01-29T18:50:29+00:00 · Latest: 2026-01-29T18:50:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

中文标题/摘要

标题：SWE-Replay: 软件工程代理的高效测试时扩展

测试时扩展已被广泛采用以增强大型语言模型（LLM）代理在软件工程（SWE）任务中的能力。然而，标准方法从头开始重复采样轨迹在计算上非常昂贵。虽然最近的方法试图通过使用专门的价值代理来减轻成本，但它们可能会遭受模型校准不当的问题，并且无法泛化到现代代理，这些代理会合成自定义的bash脚本作为工具。在本文中，我们介绍了SWE-Replay，这是第一个无需依赖可能噪声价值估计的现代代理的高效且可泛化的测试时扩展技术。SWE-Replay通过回收先前试验中的轨迹来优化扩展过程，动态选择从头探索或利用存档经验在关键中间步骤进行分支。这种中间步骤的选择由仓库探索的潜力和推理意义驱动，而不是外部LLM的质量估计。我们的评估表明，在SWE-Bench Verified上，SWE-Replay始终优于简单的扩展，成本最多可降低17.4%，同时保持或甚至提高性能最多3.8%。进一步在SWE-Bench Pro和多语言上的评估验证了SWE-Replay的泛化能力，确立了其作为软件工程代理高效测试时扩展稳健基础的地位。

Summary / 总结

The paper addresses the computational inefficiency of repeatedly sampling trajectories from scratch for test-time scaling in software engineering tasks. It introduces SWE-Replay, a technique that reuses trajectories from previous trials and selectively explores or exploits archived experience. SWE-Replay improves performance by up to 3.8% and reduces costs by up to 17.4% on SWE-Bench Verified, and its generalizability is further validated on SWE-Bench Pro and Multilingual.

论文介绍了SWE-Replay，这是一种用于软件工程代理的高效测试时扩展技术，避免了从头开始重复采样轨迹的高计算成本。与使用价值代理的先前方法不同，SWE-Replay 会从之前的试验中回收轨迹，并动态决定是在关键步骤探索还是利用存档的经验。这种方法由代码库探索的重要性驱动，可以在SWE-Bench Verified上将成本降低高达17.4%，同时保持或提高性能高达3.8%。该技术在不同基准上的进一步验证也证明了其在软件工程代理的高效测试时扩展中的稳健性。

The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Authors: Irsyad Adam, Zekai Chen, David Laprade, Shaun Porwal, David Laub, Erik Reinertsen, Arda Pekis, Kevin Brown

First: 2026-01-29T18:49:37+00:00 · Latest: 2026-01-29T18:49:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

中文标题/摘要

标题：患者不是移动的文档：一种用于纵向EHR的世界模型训练范式

大型语言模型（LLMs）通过下一个词预测训练，在临床基础模型方面取得了成功。这些语言骨干的表示在生物医学任务中表现出强大的线性探针性能，表明患者语义是从大规模的下一个词预测中涌现出来的。然而，这种范式将患者视为需要总结的文档，而不是需要模拟的动力系统；患者的轨迹是从其状态在干预和时间作用下演变而来的，需要能够模拟动力学而不是预测词的模型。为了解决这个问题，我们引入了SMB-Structure，这是一种结构化EHR的世界模型，将联合嵌入预测架构（JEPA）与下一个词预测（SFT）相结合。SFT使我们的模型能够重建患者状态在词空间中的未来状态，而JEPA则仅从初始患者表示中预测这些未来状态在潜在空间中，迫使轨迹动力学在观察到下一个状态之前被编码。我们在两个大规模队列中进行了验证：纪念斯隆凯特林（23,319名肿瘤患者；323,000多个患者年）和INSPECT（19,402名肺栓塞患者）。使用沿疾病轨迹多个点评估的线性探针，我们证明我们的训练范式学习到的嵌入捕捉到了自回归基线无法恢复的疾病动力学，使SMB-Structure能够在高患者异质性特征的复杂任务上实现竞争力的性能。模型权重可在https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure 获取。

Summary / 总结

The paper introduces SMB-Structure, a world model for structured Electronic Health Records (EHR) that simulates patient dynamics rather than treating patients as static documents. It combines next-token prediction with a joint-embedding prediction architecture to capture disease trajectories. Experiments on two large cohorts show that SMB-Structure learns embeddings capturing disease dynamics, outperforming autoregressive baselines on complex, heterogeneous tasks.

论文提出了SMB-Structure，这是一种用于结构化电子健康记录(EHR)的世界模型，能够模拟患者动态而非将其视为静态文档。该模型结合了下一个词预测和联合嵌入预测架构来捕捉疾病轨迹。在两个大型队列上的实验表明，SMB-Structure 学习到的嵌入能够捕捉疾病动态，优于自回归基线在复杂、异质任务上的表现。

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

Authors: John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan, Manuel Toribio, Sandipan Banerjee, Guy Gafni

First: 2026-01-29T18:49:27+00:00 · Latest: 2026-01-29T18:49:27+00:00

Comments: Project page: https://edit-yourself.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

中文标题/摘要

标题：EditYourself：基于音频的对话头视频生成与操控

当前的生成视频模型在从文本和图像提示生成新颖内容方面表现出色，但在编辑现有的预录制视频方面存在关键缺口，其中对讲话脚本的微小修改需要保留运动、时间连贯性、说话者身份和准确的唇部同步。我们介绍了EditYourself，一种基于DiT的音频驱动视频到视频（V2V）编辑框架，该框架能够基于脚本修改对话头视频，包括无缝添加、删除和调整视觉讲话内容的时间。基于通用视频扩散模型，EditYourself通过音频条件和区域感知、编辑导向的训练扩展增强了其V2V能力。这使得通过时空修补精确唇部同步和时间连贯地重新结构现有表演成为可能，包括在新添加的段落中合成现实的人体运动，同时保持视觉保真度和身份一致性。这项工作代表了生成视频模型作为专业视频后期制作实用工具的基础步骤。

Summary / 总结

EditYourself is a DiT-based framework designed for audio-driven video-to-video editing, enabling modifications to existing talking head videos such as adding, removing, or retiming spoken content while preserving lip synchronization and temporal coherence. The system uses a general-purpose video diffusion model and incorporates audio conditioning and region-aware training to maintain visual fidelity and speaker identity over extended durations, representing a significant step towards practical generative video tools for professional video post-production.

EditYourself 是一个基于 DiT 的框架，用于音频驱动的视频到视频编辑，能够对对话头部视频进行添加、删除或调整口述内容等修改，同时保持唇部同步和时间连贯性。该系统利用通用的视频扩散模型，并结合音频条件和区域感知的训练，以保持长时间内的视觉保真度和说话者身份的一致性。

Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics

Authors: Winfried Ripken, Michael Plainer, Gregor Lied, Thorben Frank, Oliver T. Unke, Stefan Chmiela, Frank Noé, Klaus Robert Müller

First: 2026-01-29T18:47:46+00:00 · Latest: 2026-01-29T18:47:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Simulating the long-time evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span $Δt$, enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.

中文标题/摘要

标题：学习哈密顿流映射：大时间步长分子动力学的平均流一致性

模拟哈密顿系统长时间演化受限于所需的较小时间步长以确保数值积分的稳定性。为克服这一限制，我们提出了一种框架，通过预测选定时间跨度$Δt$内的平均相空间演化来学习哈密顿流映射，从而实现远超经典积分器稳定限制的大时间步长更新。为此，我们对时间平均哈密顿动力学施加了平均流一致性条件。与先前的方法不同，这种方法允许在无需访问未来状态的情况下对独立的相空间样本进行训练，从而避免了昂贵的轨迹生成。我们的方法在多种哈密顿系统中得到了验证，特别是在使用机器学习力场（MLFF）改进分子动力学模拟方面表现尤为突出。我们的模型保持了相似的训练和推理成本，但在训练时可以直接使用广泛可用的无轨迹MLFF数据集，从而支持显著更大的积分时间步长。

Summary / 总结

The research addresses the challenge of simulating long-time evolution in Hamiltonian systems with small timesteps. It proposes a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span, which enables stable large-timestep updates. The method imposes a Mean Flow consistency condition and can be trained on independent phase-space samples without future states, reducing the need for expensive trajectory generation. Experimental results show that this approach improves molecular dynamics simulations using machine-learned force fields, supporting larger integration timesteps with comparable training and inference costs.

研究解决了使用小时间步长模拟Hamiltonian系统长时间演化的难题。提出了一种框架，通过预测选定时间跨度内的平均相空间演化来学习Hamiltonian Flow Maps，从而实现稳定的大型时间步长更新。该方法施加了平均流一致性条件，并且可以在不依赖未来状态的情况下独立相空间样本进行训练，减少了昂贵的轨迹生成需求。实验结果表明，这种方法在使用机器学习力场的分子动力学模拟中表现出色，支持更大的积分时间步长，同时训练和推理成本相似。

SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence

Authors: Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi

First: 2026-01-29T18:41:52+00:00 · Latest: 2026-01-29T18:41:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

中文标题/摘要

标题：SINA：使用人工智能的电路原理图图像到网表生成器

当前将电路原理图图像转换为机器可读网表的方法在组件识别和连接推理方面存在困难。在本文中，我们介绍了SINA，这是一个开源的全自动电路原理图图像到网表生成器。SINA结合了深度学习进行准确的组件检测、连通组件标记（CCL）进行精确的连接提取以及光学字符识别（OCR）进行组件参考标识符检索，同时采用视觉语言模型（VLM）进行可靠的参考标识符分配。在我们的实验中，SINA的整体网表生成准确率为96.47%，比最先进的方法高出2.72倍。

Summary / 总结

SINA is an open-source automated circuit schematic image-to-netlist generator that uses deep learning for component detection, CCL for connectivity extraction, OCR for reference designator retrieval, and a VLM for reference designator assignments. It achieves 96.47% overall netlist-generation accuracy, which is 2.72 times higher than existing methods.

SINA 是一个开源工具，利用人工智能将电路原理图图像转换为机器可读的网表。它结合了深度学习进行组件检测、CCL 进行连接提取、OCR 进行参考设计ator检索以及 VLM 进行可靠的参考设计ator分配。实验结果显示，SINA 的整体网表生成准确率为 96.47%，比最先进的方法高出 2.72 倍。

Physics Informed Reconstruction of Four-Dimensional Atmospheric Wind Fields Using Multi-UAS Swarm Observations in a Synthetic Turbulent Environment

Authors: Abdullah Tasim, Wei Sun

First: 2026-01-29T18:40:32+00:00 · Latest: 2026-01-29T18:40:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate reconstruction of atmospheric wind fields is essential for applications such as weather forecasting, hazard prediction, and wind energy assessment, yet conventional instruments leave spatio-temporal gaps within the lower atmospheric boundary layer. Unmanned aircraft systems (UAS) provide flexible in situ measurements, but individual platforms sample wind only along their flight trajectories, limiting full wind-field recovery. This study presents a framework for reconstructing four-dimensional atmospheric wind fields using measurements obtained from a coordinated UAS swarm. A synthetic turbulence environment and high-fidelity multirotor simulation are used to generate training and evaluation data. Local wind components are estimated from UAS dynamics using a bidirectional long short-term memory network (Bi-LSTM) and assimilated into a physics-informed neural network (PINN) to reconstruct a continuous wind field in space and time. For local wind estimation, the bidirectional LSTM achieves root-mean-square errors (RMSE) of 0.064 and 0.062 m/s for the north and east components in low-wind conditions, increasing to 0.122 to 0.129 m/s under moderate winds and 0.271 to 0.273 m/s in high-wind conditions, while the vertical component exhibits higher error, with RMSE values of 0.029 to 0.091 m/s. The physics-informed reconstruction recovers the dominant spatial and temporal structure of the wind field up to 1000 m altitude while preserving mean flow direction and vertical shear. Under moderate wind conditions, the reconstructed mean wind field achieves an overall RMSE between 0.118 and 0.154 m/s across evaluated UAS configurations, with the lowest error obtained using a five-UAS swarm. These results demonstrate that coordinated UAS measurements enable accurate and scalable four-dimensional wind-field reconstruction without dedicated wind sensors or fixed infrastructure.

中文标题/摘要

标题：基于多UAS群组观测在合成湍流环境中的四维大气风场物理信息重建

准确重建大气风场对于天气预报、灾害预测和风能评估等应用至关重要，但传统仪器在下边界层内存在时空空白。无人驾驶航空系统（UAS）提供了灵活的原位测量，但单个平台仅在其飞行轨迹上采样风，限制了风场的全面恢复。本研究提出了一种使用协调UAS群组观测数据重建四维大气风场的框架。使用合成湍流环境和高保真多旋翼模拟生成训练和评估数据。利用双向长短期记忆网络（Bi-LSTM）从UAS动力学估计局部风分量，并将其整合到物理信息神经网络（PINN）中，以重建空间和时间连续的风场。在低风速条件下，双向LSTM在北向和东向分量上的均方根误差（RMSE）分别为0.064和0.062米/秒，中等风速条件下增加到0.122至0.129米/秒，在高风速条件下增加到0.271至0.273米/秒，而垂直分量的误差较高，RMSE值为0.029至0.091米/秒。物理信息重建恢复了1000米高度以下风场的主要空间和时间结构，同时保持了平均流方向和垂直切变。在中等风速条件下，重建的平均风场在评估的UAS配置中总体RMSE在0.118至0.154米/秒之间，使用五UAS群组获得最低误差。这些结果表明，协调的UAS测量能够实现无需专用风速传感器或固定基础设施的四维风场准确且可扩展的重建。

Summary / 总结

This study aims to accurately reconstruct four-dimensional atmospheric wind fields using coordinated unmanned aircraft system (UAS) swarm observations in a synthetic turbulence environment. The framework employs a bidirectional long short-term memory network (Bi-LSTM) for local wind component estimation and a physics-informed neural network (PINN) for reconstructing the continuous wind field. The Bi-LSTM achieves RMSEs of 0.064 to 0.273 m/s for wind components under different wind conditions, while the PINN recovers the wind field's spatial and temporal structure up to 1000 m altitude. Under moderate wind conditions, the reconstructed mean wind field has an overall RMSE of 0.118 to 0.154 m/s using a five-UAS swarm configuration.

本研究旨在通过协调的无人机集群在合成湍流环境中获取的测量数据，准确重建四维大气风场。研究采用双向长短期记忆网络（Bi-LSTM）进行局部风分量估计，并使用物理信息神经网络（PINN）进行风场重建。结果显示，Bi-LSTM在不同风速条件下，风分量的均方根误差（RMSE）值为0.064至0.273 m/s，而PINN在中等风速条件下使用五架无人机集群重建的平均风场误差为0.118至0.154 m/s，有效捕捉了风场的空间和时间结构。

Boosting CVaR Policy Optimization with Quantile Gradients

Authors: Yudong Luo, Erick Delage

First: 2026-01-29T18:33:46+00:00 · Latest: 2026-01-29T18:33:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.

中文标题/摘要

标题：利用分位数梯度提升CVaR策略优化

使用策略梯度优化条件值-at-风险(CVaR)面临显著的样本效率问题。这一问题源于CVaR关注尾部性能而忽略了大量采样轨迹。我们通过将CVaR与期望分位数项结合来解决这一问题。分位数优化允许使用动态规划形式，利用所有采样数据，从而提高样本效率。这不会改变CVaR目标，因为CVaR对应于尾部的分位数期望。在具有可验证风险厌恶行为的领域中，我们的算法在马尔可夫策略类中显著优于CVaR-PG，并且始终优于其他现有方法。

Summary / 总结

The paper addresses the sample inefficiency in optimizing Conditional Value-at-Risk (CVaR) using policy gradients by incorporating an expected quantile term. This approach improves sample efficiency by utilizing all sampled data through a dynamic programming formulation. The empirical results demonstrate that the proposed method, within the Markovian policy class, significantly outperforms CVaR-PG and other existing methods in domains requiring risk-averse behavior.

论文通过引入期望分位数项来解决使用策略梯度优化条件值-at-风险(CVaR)时的样本低效问题，这种方法通过利用所有采样数据来提高样本效率，符合CVaR目标。实验结果表明，提出的算法在风险厌恶领域中显著优于CVaR-PG和其他现有方法。

RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Authors: Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang

Venue: ICLR 2026

First: 2026-01-29T18:30:10+00:00 · Latest: 2026-01-29T18:30:10+00:00

Comments: ICLR 2026. Project page: https://judgementh.github.io/RefAny3D Codes: https://github.com/JudgementH/RefAny3D

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.

中文标题/摘要

标题：RefAny3D：基于3D资产的扩散模型图像生成

在本文中，我们提出了一种基于3D资产的扩散模型，探索如何将3D资产整合到图像扩散模型中。现有的基于参考的图像生成方法利用大规模预训练的扩散模型，并展示了在单张参考图像条件下生成多样化图像的强大能力。然而，这些方法仅限于单张图像参考，无法利用3D资产，限制了其实用的灵活性。为了解决这一差距，我们提出了一种跨域扩散模型，具有双分支感知机制，利用多视角RGB图像和3D资产的点图来联合建模它们的颜色和标准空间坐标，实现生成图像与3D参考之间的精确一致性。我们的空间对齐双分支生成架构和域解耦生成机制确保同时生成两个空间对齐但内容分离的输出，RGB图像和点图，将2D图像属性与3D资产属性联系起来。实验表明，我们的方法有效地利用3D资产作为参考，生成与给定资产一致的图像，为将扩散模型与3D内容创作相结合开辟了新的可能性。

Summary / 总结

This paper introduces RefAny3D, a 3D asset-referenced diffusion model for image generation. It addresses the limitation of existing methods that can only use single-image references and cannot incorporate 3D assets. The proposed model uses a dual-branch perception approach with multi-view RGB images and point maps of 3D assets to generate images that are precisely consistent with the 3D references. Experiments demonstrate that RefAny3D effectively leverages 3D assets as references to produce images that align with the given 3D content, enhancing the practical versatility of image generation models.

论文提出了一种基于3D资产的扩散模型RefAny3D，解决了现有方法依赖单一参考图像的局限性。该模型采用跨域扩散模型和双分支感知机制，利用多视角RGB图像和3D资产的点图来生成与给定资产一致的图像。模型确保了RGB图像和点图之间的空间对齐和内容分离，展示了有效利用3D资产作为图像生成的参考。

Learning Transient Convective Heat Transfer with Geometry Aware World Models

Authors: Onur T. Doganay, Alexander Klawonn, Martin Eigel, Hanno Gottschalk

First: 2026-01-29T18:24:24+00:00 · Latest: 2026-01-29T18:24:24+00:00

Comments: 36 pages, 18 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Partial differential equation (PDE) simulations are fundamental to engineering and physics but are often computationally prohibitive for real-time applications. While generative AI offers a promising avenue for surrogate modeling, standard video generation architectures lack the specific control and data compatibility required for physical simulations. This paper introduces a geometry aware world model architecture, derived from a video generation architecture (LongVideoGAN), designed to learn transient physics. We introduce two key architecture elements: (1) a twofold conditioning mechanism incorporating global physical parameters and local geometric masks, and (2) an architectural adaptation to support arbitrary channel dimensions, moving beyond standard RGB constraints. We evaluate this approach on a 2D transient computational fluid dynamics (CFD) problem involving convective heat transfer from buoyancy-driven flow coupled to a heat flow in a solid structure. We demonstrate that the conditioned model successfully reproduces complex temporal dynamics and spatial correlations of the training data. Furthermore, we assess the model's generalization capabilities on unseen geometric configurations, highlighting both its potential for controlled simulation synthesis and current limitations in spatial precision for out-of-distribution samples.

中文标题/摘要

标题：学习几何感知世界模型中的瞬态对流热传递

偏微分方程（PDE）模拟是工程和物理学的基础，但在实时应用中通常计算成本高昂。虽然生成式AI为替代建模提供了有希望的途径，但标准的视频生成架构缺乏物理模拟所需的特定控制和数据兼容性。本文介绍了一种几何感知世界模型架构，该架构源自视频生成架构（LongVideoGAN），旨在学习瞬态物理现象。我们引入了两个关键架构元素：（1）结合全局物理参数和局部几何掩码的双重条件机制，（2）架构适应以支持任意通道维度，超越了标准的RGB限制。我们通过一个涉及由浮力驱动的流动与固体结构中热流耦合的2D瞬态计算流体动力学（CFD）问题，评估了这种方法。我们证明，条件模型成功地再现了训练数据的复杂时空动态和空间相关性。此外，我们还评估了该模型在未见几何配置上的泛化能力，突显了其在受控模拟合成方面的潜力以及在分布外样本中空间精度的当前局限性。

Summary / 总结

This paper addresses the computational challenges of using partial differential equation simulations in real-time applications by proposing a geometry-aware world model architecture. The model, derived from LongVideoGAN, incorporates a twofold conditioning mechanism and supports arbitrary channel dimensions. Evaluated on a 2D transient CFD problem involving convective heat transfer, the model successfully reproduces complex temporal dynamics and spatial correlations, and demonstrates generalization capabilities on unseen geometric configurations, though it shows limitations in spatial precision for out-of-distribution samples.

本文通过提出一种几何感知的世界模型架构来解决使用偏微分方程（PDE）模拟在实时应用中的计算挑战。该模型基于LongVideoGAN，包含双重条件机制并支持任意通道维度。该方法在涉及对流热传递的2D瞬态CFD问题上进行了评估，展示了模型能够重现复杂的时空动态和空间相关性。此外，模型在未见过的几何配置上也表现出泛化能力，但在处理分布外样本的空间精度方面存在局限性。

Where Do the Joules Go? Diagnosing Inference Energy Consumption

Authors: Jae-Won Chung, Ruofan Wu, Jeff J. Ma, Mosharaf Chowdhury

First: 2026-01-29T18:16:45+00:00 · Latest: 2026-01-29T18:16:45+00:00

Comments: The ML.ENERGY Leaderboard v3.0 is open https://ml.energy/leaderboard

Abs · PDF · Code1 · Code2

Abstract

Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.

中文标题/摘要

标题：能量去向何方？诊断推理能耗消耗

能源现在是关键的ML计算资源。虽然测量能耗和观察趋势是一个有价值的初步步骤，但准确理解并诊断这些差异发生的原因对于优化至关重要。为此，我们首先通过在NVIDIA H100和B200 GPU上对46个模型、7个任务和1,858种不同配置进行大规模测量研究，展示了生成AI领域的推理时间和能耗情况。我们的实证发现涵盖了数量级的差异：LLM任务类型可能导致25倍的能耗差异，视频生成有时消耗超过100倍的图像能耗，而GPU利用率差异可能导致3到5倍的能耗差异。基于我们的观察，我们提出了一种关于时间与能耗消耗背后机制的推理框架。核心在于时间与能耗由潜在指标如内存和利用率决定，而这些指标又受到算法、软件和硬件各层因素的影响。我们的框架还直接扩展到每瓦吞吐量，这是受电力限制的数据中心的关键指标。

Summary / 总结

This study investigates the energy consumption of inference across various generative AI models and tasks, identifying significant variations by task type and model configuration. The research presents a framework that links time and energy consumption to underlying metrics such as memory and utilization, which are influenced by factors across the algorithm, software, and hardware layers. Key findings include order-of-magnitude differences in energy consumption, with LLM tasks consuming up to 25 times more energy than others, and video generation using more than 100 times the energy of image generation. The framework also helps in understanding throughput per watt, a critical metric for power-constrained datacenters.

该研究旨在理解不同生成AI模型和任务在推理任务中的能耗差异。通过在NVIDIA H100和B200 GPU上测量46个模型和1,858种配置的能耗，研究揭示了显著的能耗差异，LLM任务的能耗最高可达到其他任务的25倍，而视频生成的能耗则超过图像生成的100倍。研究提出了一种框架来解释这些差异，重点在于内存和利用率等潜在指标，这些指标受算法、软件和硬件层面上多种因素的影响，并将此框架扩展到每瓦吞吐量，这是对功率受限数据中心至关重要的一个关键指标。

SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

Authors: Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar

First: 2025-11-12T18:25:51+00:00 · Latest: 2026-01-29T17:59:16+00:00

Comments: 10 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

中文标题/摘要

标题：SiDGen：结构导向的扩散模型用于蛋白质配体生成

设计既化学有效又与蛋白质结合口袋结构兼容的配体是计算药物发现中的一个关键瓶颈。现有方法要么忽略结构上下文，要么依赖于昂贵且内存密集的编码，这限制了吞吐量和可扩展性。我们提出了SiDGen（结构导向扩散生成器），这是一种基于蛋白质条件的扩散框架，结合了掩码SMILES生成和轻量级折叠衍生特征以增强口袋意识。为了平衡表达性和效率，SiDGen 支持两种条件路径：一种简化模式，从蛋白质嵌入中聚合粗略的结构信号；另一种完整模式，注入局部成对偏差以实现更强的耦合。通过粗步长折叠机制和最近邻上采样，缓解成对张量的二次内存成本，使模型能够处理现实长度的序列。通过循环内的化学有效性检查和无效性惩罚保持学习稳定性，通过选择性编译、数据加载器调优和梯度累积恢复大规模训练效率。在自动化基准测试中，SiDGen 生成的配体具有高有效性、独特性和新颖性，同时在对接评估中表现出竞争力，并保持合理的分子性质。这些结果表明，SiDGen 可以实现可扩展且具有口袋意识的分子设计，为高通量药物发现提供了一种实用的条件生成途径。

Summary / 总结

SiDGen is designed to generate chemically valid and structurally compatible ligands for proteins, addressing the limitations of existing approaches by integrating masked SMILES generation with lightweight folding-derived features. It supports two conditioning pathways and uses a coarse-stride folding mechanism to reduce memory costs, enabling efficient training on realistic sequence lengths. Experimental results show that SiDGen generates high-quality ligands with good uniqueness and novelty, and performs competitively in docking-based evaluations while maintaining reasonable molecular properties.

SiDGen 是一种基于蛋白质的扩散框架，结合了掩码 SMILES 生成和轻量级折叠衍生特征，以生成化学上有效的且与蛋白质结合口袋结构兼容的小分子。它支持两种调节路径以平衡表达性和效率，并使用粗粒度步长折叠机制来缓解内存成本。实验结果表明，SiDGen 生成的小分子具有高有效性和新颖性，并在对接评估中表现出竞争力，同时保持合理的分子性质。

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Authors: Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang

First: 2026-01-29T17:58:40+00:00 · Latest: 2026-01-29T17:58:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

中文标题/摘要

标题：Vision-DeepResearch: 在多模态大型语言模型中激励深度研究能力

多模态大型语言模型（MLLMs）在广泛的视觉任务中取得了显著的成功。然而，由于其内部世界知识的容量限制，先前的工作通过“推理后调用工具”来增强MLLMs，以视觉和文本搜索引擎获取大量事实信息，从而在需要大量事实信息的任务上取得了显著的进展。然而，这些方法通常在一种简单的设置下定义多模态搜索，假设单一的全层级或实体层级图像查询和少量文本查询足以检索到回答问题所需的关键证据，而在现实世界中，由于大量的视觉噪音，这往往是不现实的。此外，它们通常在推理深度和搜索广度上受到限制，使得解决需要从多种视觉和文本来源汇总证据的复杂问题变得困难。在此基础上，我们提出了Vision-DeepResearch，它提出了一种新的多模态深度研究范式，即进行多轮、多实体和多尺度的视觉和文本搜索，以在大量噪音下稳健地击中现实世界的搜索引擎。我们的Vision-DeepResearch支持数十个推理步骤和数百次引擎交互，通过冷启动监督和强化学习训练将深度研究能力内置于MLLM中，从而形成一个强大的端到端多模态深度研究MLLM。它在现有的多模态深度研究MLLMs和基于强大封闭源基础模型（如GPT-5、Gemini-2.5-pro和Claude-4-Sonnet）的工作流程中表现出显著的优越性。代码将在https://github.com/Osilly/Vision-DeepResearch上发布。

Summary / 总结

The paper proposes Vision-DeepResearch, a new multimodal deep-research paradigm that performs multi-turn, multi-entity, and multi-scale visual and textual search to handle complex questions under heavy noise. It integrates deep-research capabilities into MLLMs through cold-start supervision and RL training, achieving superior performance compared to existing multimodal deep-research MLLMs and strong closed-source foundation models like GPT-5, Gemini-2.5-pro, and Claude-4-Sonnet.

Vision-DeepResearch旨在通过进行多轮、多实体和多尺度的视觉和文本搜索，增强多模态大型语言模型（MLLMs）的深度研究能力，以应对包含大量视觉噪声的复杂问题。该方法支持大量的推理步骤和广泛的引擎交互，其性能优于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等现有模型。关键发现表明，Vision-DeepResearch在处理需要深度研究能力的复杂任务方面显著优于这些模型。

Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models

Authors: Archer Wang, Emile Anand, Yilun Du, Marin Soljačić

First: 2026-01-29T17:57:06+00:00 · Latest: 2026-01-29T17:57:06+00:00

Comments: 28 pages, 16 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.

中文标题/摘要

标题：无监督分解与重组：基于鉴别器驱动的扩散模型

将复杂数据分解为因子表示可以揭示可重用的组件，并通过组件重组生成新的样本。我们研究了在无需因子级别监督的情况下，基于扩散模型学习因子化的潜在空间。在图像中，因子可以捕捉背景、照明和对象属性；在机器人视频中，它们可以捕捉可重用的动作组件。为了提高潜在因子发现的质量和组合生成的质量，我们通过训练一个鉴别器来区分单一来源样本和通过重组因子生成的样本，引入了一种对抗训练信号。通过优化生成器来欺骗这个鉴别器，我们鼓励重组结果中的物理和语义一致性。我们的方法在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D上优于先前基线的实现，实现了更低的FID分数和更好的解耦合，如MIG和MCC所测量的。此外，我们展示了机器人视频轨迹的一个新应用：通过重组学习到的动作组件，我们生成了多样化的序列，显著增加了LIBERO基准上的状态空间覆盖。

Summary / 总结

This paper explores unsupervised decomposition and recombination of complex data using diffusion-based models. The method introduces an adversarial training signal through a discriminator to improve latent factor discovery and compositional generation quality. Experiments on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D show that the proposed method outperforms prior implementations with lower FID scores and better disentanglement. Additionally, the method is applied to robotic video trajectories, generating diverse sequences that enhance state-space coverage for exploration on the LIBERO benchmark.

该论文研究了使用基于扩散的模型进行无监督分解和重组复杂数据的方法。方法通过引入一个判别器来改善潜在因子发现和组合生成的质量。实验结果显示，该方法在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D上的表现优于先前的实现，具有更低的FID分数和更好的去纠缠性。此外，该方法还应用于机器人视频轨迹，生成多样化的序列，从而增强LIBERO基准上的状态空间覆盖。

PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials

Authors: Teddy Koker, Abhijeet Gangan, Mit Kotak, Jaime Marian, Tess Smidt

First: 2026-01-12T17:20:09+00:00 · Latest: 2026-01-29T17:54:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP by 55% on average across phonon thermodynamic properties and achieves state-of-the-art accuracy among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.

中文标题/摘要

标题：PFT：声子微调用于机器学习原子势

许多材料性质依赖于势能面的高阶导数，而通过标准损失函数（能量、力和应力误差）训练的机器学习原子势（MLIPs）可能会在曲率上出现误差，从而降低振动性质的预测精度。我们引入了声子微调（PFT），它直接监督材料的二阶力常数，通过将MLIP的能量海森矩阵与从有限位移声子计算中得到的DFT计算得到的力常数进行匹配。为了扩展到大型超晶胞，PFT随机采样海森矩阵的列，并通过单个海森矩阵-向量乘积计算损失。我们还使用简单的协同训练方案来整合上游数据，以减轻灾难性遗忘。在MDR声子基准测试中，PFT在声子热力学性质上的平均改进幅度为55%，并在基于材料项目轨迹训练的模型中达到了最先进的精度。PFT还能够泛化以改进超出二阶导数的性质，从而提高依赖于势能三次导数的热导率预测。

Summary / 总结

The research aims to improve the accuracy of machine learned interatomic potentials (MLIPs) in predicting higher-order derivatives of the potential energy surface, which are crucial for vibrational properties. The method, phonon fine-tuning (PFT), directly supervises second-order force constants by matching MLIP energy Hessians to DFT-computed force constants. PFT achieves significant improvements, with a 55% average enhancement in phonon thermodynamic property predictions on the MDR Phonon benchmark and state-of-the-art accuracy. It also generalizes to improve predictions of thermal conductivity, which depend on third-order derivatives.

研究旨在提高机器学习原子势（MLIPs）在预测势能曲面上的高阶导数时的准确性，这对于振动性质至关重要。方法是通过直接监督第二阶力常数，使MLIP的能量海森矩阵与从有限位移声子计算得到的DFT计算力常数匹配的声子微调（PFT）。在MDR声子基准测试中，PFT显著提高了Nequix MP的平均改进幅度为55%，并在所有声子热力学性质上达到了最先进的准确性。此外，PFT还能够推广到改进依赖于第三阶导数的热导率预测。

Corrective Diffusion Language Models

Authors: Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, Grigorios G. Chrysos

First: 2025-12-17T17:04:38+00:00 · Latest: 2026-01-29T17:52:44+00:00

Comments: 21 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Diffusion Language Models (DLMs) are theoretically well-suited for iterative refinement due to their non-causal structure, they often fail to reliably revise incorrect tokens in practice. The key challenge lies in the model's inability to distinguish between correct and erroneous tokens in a visible sequence. Standard masked diffusion language model (MDLM) training is restricted to the objective of unmasking, undermining the effectiveness of refinement guided by confidence. Based on this observation, we study corrective behavior in DLMs, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a post-training principle oriented by correction that explicitly supervises visible incorrect tokens, enabling discriminative confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark, a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and parallel decoding scenarios demonstrate that models trained with our approach substantially outperform standard MDLMs, with gains that are most pronounced when parallel decoding introduces substantial uncertainty and iterative refinement becomes essential. Our code is publicly available at https://github.com/zhangshuibai/CDLM.

中文标题/摘要

标题：纠正性扩散语言模型

尽管扩散语言模型（DLMs）由于其非因式结构在理论上非常适合逐步改进，但在实践中往往无法可靠地修正错误的标记。关键挑战在于模型无法区分可见序列中的正确和错误标记。标准的掩码扩散语言模型（MDLM）训练仅限于解掩的目的，削弱了由置信度引导的改进效果。基于这一观察，我们研究了DLMs的纠正行为，定义为能够对错误标记赋予较低置信度，并在保持正确内容的同时逐步修正它们。我们表明，这种能力不是由传统的掩码扩散目标诱导的，并提出了一种以纠正为导向的后训练原则，明确监督可见的错误标记，从而实现区分性置信度和目标修正。为了评估纠正行为，我们引入了代码修订基准，这是一个可控且可执行的基准，用于评估错误定位和就地修正。在代码修订任务和并行解码场景上的实验表明，使用我们方法训练的模型显著优于标准MDLM，尤其是在并行解码引入大量不确定性且逐步修正变得至关重要的情况下，性能提升最为明显。我们的代码已公开发布在https://github.com/zhangshuibai/CDLM。

Summary / 总结

The research aims to improve the ability of Diffusion Language Models (DLMs) to correct incorrect tokens by addressing their difficulty in distinguishing correct from erroneous tokens. The study introduces a post-training principle that explicitly supervises visible incorrect tokens, enabling models to assign lower confidence to errors and iteratively refine them. Experiments show that models trained with this approach outperform standard masked diffusion language models, especially in scenarios with high uncertainty requiring iterative refinement.

研究旨在提高扩散语言模型(DLMs)纠正文本中错误标记的能力。主要方法是提出一个后训练原则，明确监督可见的错误标记，从而实现区分性的置信度和目标化的修正。实验表明，使用此方法训练的模型在高不确定性场景和需要迭代修正的情况下，比标准的掩码扩散语言模型表现更优。

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Authors: Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen

First: 2026-01-29T17:52:41+00:00 · Latest: 2026-01-29T17:52:41+00:00

Comments: Project Page: https://metric-anything.github.io/metric-anything-io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

中文标题/摘要

标题：MetricAnything：通过嘈杂异构数据源扩展度量深度预训练

扩展性推动了视觉基础模型的最新进展，但在将这一范式扩展到度量深度估计方面仍然具有挑战性，原因在于异构传感器噪声、相机相关的偏差以及嘈杂跨源3D数据中的度量模糊性。我们提出了Metric Anything，这是一种简单且可扩展的预训练框架，可以从嘈杂的、多样的3D数据源中学习度量深度，而无需手动工程化的提示、相机特定的建模或任务特定的架构。我们方法的核心是稀疏度量提示，它是通过随机遮蔽深度图创建的，作为通用接口，将空间推理与传感器和相机偏差解耦。使用来自10000种相机模型的重建、捕获和渲染3D数据的约2000万张图像-深度对，我们首次展示了度量深度跟踪中的清晰扩展趋势。预训练模型在深度完成、超分辨率和雷达-相机融合等提示驱动任务中表现出色，而其精简的无提示学生模型在单目深度估计、相机内参恢复、单目/多视图度量3D重建和VLA规划方面达到了最先进的结果。我们还展示了使用Metric Anything的预训练ViT作为视觉编码器可以显著提升多模态大型语言模型在空间智能方面的性能。这些结果表明，度量深度估计可以从驱动现代基础模型的相同扩展法则中受益，从而开辟了一条通向可扩展和高效现实世界度量感知的新途径。我们开源了MetricAnything，网址为http://metric-anything.github.io/metric-anything-io/，以支持社区研究。

Summary / 总结

MetricAnything is a pretraining framework for metric depth estimation that leverages noisy, diverse 3D sources without manual prompts or task-specific architectures. By using a Sparse Metric Prompt, it decouples spatial reasoning from sensor biases. The model demonstrates a clear scaling trend with over 20 million image-depth pairs and excels in various tasks like depth completion and super-resolution. It also improves multimodal large language models in spatial intelligence.

MetricAnything 是一种用于度量深度估计的预训练框架，利用了噪声和异构的 3D 数据源，无需进行相机特定建模或任务特定架构。它使用稀疏度量提示来解耦空间推理和传感器偏差，并展示了度量深度估计中的清晰扩展趋势。预训练模型在深度补全和超分辨率等任务中表现出色，其精简版本在单目深度估计和 3D 重建等任务中达到了最先进的结果。这项工作表明，度量深度估计可以从现代基础模型的扩展规律中受益，从而开辟了一条可扩展和高效的现实世界度量感知的新途径。

PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Authors: Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, Mulin Yu

First: 2026-01-29T17:47:26+00:00 · Latest: 2026-01-29T17:47:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .

中文标题/摘要

标题：PLANING：一种松耦合三角高斯框架用于流式3D重建

单目图像序列的流式重建仍然具有挑战性，因为现有方法通常更倾向于高质量的渲染或准确的几何结构，但两者很少同时兼顾。我们提出了PLANING，这是一种基于混合表示的高效实时重建框架，该框架松散地将显式的几何原语与神经高斯函数耦合在一起，使得几何结构和外观可以以解耦的方式建模。这种解耦支持一种在线初始化和优化策略，将几何结构和外观更新分离，从而实现结构冗余大幅减少的稳定流式重建。PLANING在稠密网格Chamfer-L2上比PGSR提高了18.52%，在PSNR上比ARTDECO提高了1.31 dB，并在不到100秒内重建了ScanNetV2场景，比2D高斯点积快5倍以上，同时保持了与单场景离线优化相当的质量。除了重建质量，PLANING的结构清晰度和计算效率使其适用于广泛的下游应用，如大规模场景建模和为具身AI准备的模拟环境。项目页面：https://city-super.github.io/PLANING/ 。

Summary / 总结

The research addresses the challenge of streaming 3D reconstruction from monocular image sequences, where existing methods often prioritize either high-quality rendering or accurate geometry. PLANING introduces a hybrid representation combining explicit geometric primitives with neural Gaussians, allowing for decoupled modeling of geometry and appearance. This approach enables stable streaming reconstruction with reduced structural redundancy, improving dense mesh Chamfer-L2 by 18.52% over PGSR and surpassing ARTDECO by 1.31 dB PSNR. PLANING reconstructs ScanNetV2 scenes in under 100 seconds, significantly faster than 2D Gaussian Splatting, while maintaining offline optimization quality. Beyond reconstruction, PLANING's structural clarity and computational efficiency support various downstream applications, including large-scale scene modeling and embodied AI environments.

研究针对单目图像序列的流式3D重建难题，现有方法通常侧重于高质量渲染或精确几何。PLANING提出了一种结合显式几何体和神经高斯的混合表示，允许几何和外观的解耦建模。这种方法支持在线初始化和优化，从而实现稳定且结构冗余减少的流式重建。PLANING在密集网格Chamfer-L2上优于PGSR 18.52%，在PSNR上超过ARTDECO 1.31 dB，同时在不到100秒内重建ScanNetV2场景，比2D高斯点云快五倍以上，并且与离线场景优化质量相当。

Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion

Authors: Da Li, Chen Yao, Tong Mao, Jiacheng Bao, Houjun Sun

First: 2026-01-29T17:47:07+00:00 · Latest: 2026-01-29T17:47:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.

中文标题/摘要

标题：基于3D SAR融合的受限稀疏航空影像城市神经表面重建

神经表面重建（NSR）近年来在多视角航空影像的城市三维重建中显示出强大的潜力。然而，现有的NSR方法在稀疏视角条件下往往存在几何模糊和不稳定的问题。这一问题在大规模城市遥感中尤为关键，因为航空影像获取受限于飞行路径、地形和成本。为了解决这一挑战，我们提出了第一个将3D合成孔径雷达（SAR）点云与航空影像融合的城市NSR框架，以在受限、稀疏视角设置下实现高保真重建。3D SAR可以从单侧飞行路径高效地捕获大规模几何结构，提供与图像光度线索互补的鲁棒先验。我们的框架将雷达提取的空间约束整合到基于SDF的NSR骨干网络中，指导结构感知射线选择和自适应采样，以实现稳定和高效的优化。我们还构建了第一个包含配准3D SAR点云和航空影像的基准数据集，便于系统评估跨模态三维重建。大量实验表明，在高度稀疏和偏视角条件下，结合3D SAR显著提高了重建的准确度、完整性和鲁棒性，突显了利用先进机载和星载光学-SAR成像实现可扩展高保真城市重建的可行途径。

Summary / 总结

The research aims to improve urban 3D reconstruction from sparse aerial imagery by integrating 3D synthetic aperture radar (SAR) data. The method fuses SAR point clouds with aerial imagery to provide robust geometric priors, addressing the limitations of existing neural surface reconstruction methods under constrained view conditions. Experiments demonstrate that this approach significantly enhances reconstruction accuracy, completeness, and robustness compared to single-modality baselines in highly sparse and oblique-view scenarios.

研究旨在通过整合3D合成孔径雷达(SAR)数据，提高从稀疏航空影像中进行城市三维重建的效果。方法将SAR点云与航空影像融合，提供稳健的几何先验，增强在受限条件下的重建准确性和稳定性。实验表明，这种方法在稀疏和侧视视角场景下显著提高了重建质量，优于单一模态方法。

SIA: Symbolic Interpretability for Anticipatory Deep Reinforcement Learning in Network Control

Authors: MohammadErfan Jabbari, Abhishek Duttagupta, Claudio Fiandrino, Leonardo Bonati, Salvatore D'Oro, Michele Polese, Marco Fiore, Tommaso Melodia

First: 2026-01-29T17:46:46+00:00 · Latest: 2026-01-29T17:46:46+00:00

Comments: 10 pages, 12 figures, accepted at IEEE INFOCOM 2026

Abs · PDF · Code1 · Code2

Abstract

Deep reinforcement learning (DRL) promises adaptive control for future mobile networks but conventional agents remain reactive: they act on past and current measurements and cannot leverage short-term forecasts of exogenous KPIs such as bandwidth. Augmenting agents with predictions can overcome this temporal myopia, yet uptake in networking is scarce because forecast-aware agents act as closed-boxes; operators cannot tell whether predictions guide decisions or justify the added complexity. We propose SIA, the first interpreter that exposes in real time how forecast-augmented DRL agents operate. SIA fuses Symbolic AI abstractions with per-KPI Knowledge Graphs to produce explanations, and includes a new Influence Score metric. SIA achieves sub-millisecond speed, over 200x faster than existing XAI methods. We evaluate SIA on three diverse networking use cases, uncovering hidden issues, including temporal misalignment in forecast integration and reward-design biases that trigger counter-productive policies. These insights enable targeted fixes: a redesigned agent achieves a 9% higher average bitrate in video streaming, and SIA's online Action-Refinement module improves RAN-slicing reward by 25% without retraining. By making anticipatory DRL transparent and tunable, SIA lowers the barrier to proactive control in next-generation mobile networks.

中文标题/摘要

标题：SIA：网络控制中预见性深度强化学习的符号可解释性

深度强化学习（DRL）为未来的移动网络提供了自适应控制的潜力，但传统的代理仍然是反应性的：它们基于过去的和当前的测量值行动，无法利用外生KPI（如带宽）的短期预测。通过预测增强代理可以克服这种时间短视，但在网络中采用却很少，因为预测感知的代理作为黑箱操作，运营商无法判断预测是否指导决策或增加复杂性。我们提出了SIA，这是第一个实时揭示预测增强DRL代理操作方式的解释器。SIA 结合了符号AI 抽象与每个KPI的知识图谱，生成解释，并包含一个新的影响分数指标。SIA 达到了亚毫秒级的速度，比现有解释性人工智能方法快200多倍。我们在三个不同的网络应用场景上评估了SIA，发现了隐藏的问题，包括预测集成的时间对齐问题和奖励设计偏差，这些偏差触发了反生产性的策略。这些见解使我们能够进行针对性的修复：重新设计的代理在视频流媒体中平均比特率提高了9%，SIA 的在线行动细化模块在不重新训练的情况下将RAN切片奖励提高了25%。通过使预见性DRL变得透明和可调，SIA 降低了下一代移动网络中主动控制的门槛。

Summary / 总结

SIA is designed to enhance the interpretability of deep reinforcement learning (DRL) agents in network control by integrating short-term forecasts of KPIs such as bandwidth. It uses Symbolic AI and per-KPI Knowledge Graphs to generate real-time explanations and introduces an Influence Score metric. SIA significantly improves the speed of explanation generation, achieving sub-millisecond speeds compared to existing methods. Experimental results show that SIA helps identify and address issues like temporal misalignment and reward-design biases, leading to performance improvements in video streaming and RAN-slicing reward by 25% without retraining.

SIA旨在通过整合KPI（如带宽）的短期预测来增强网络控制中深度强化学习（DRL）代理的可解释性。它利用符号AI和每项KPI的知识图谱来生成实时解释，并引入了影响得分指标。SIA显著提高了预测感知DRL代理的透明度，使其能够进行更快、更针对性的修复。在三个网络应用案例中的评估揭示了诸如时间对齐问题和奖励设计偏差等问题，这些问题被解决以提升性能，实现了9%的平均比特率提升和25%的RAN切片奖励提升，无需重新训练。

Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Authors: Manuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

First: 2026-01-29T17:44:23+00:00 · Latest: 2026-01-29T17:44:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.

中文标题/摘要

标题：理解多模态互补性以实现单帧动作预判

人类动作预判通常被视为一个视频理解问题，隐含地假设需要密集的时间信息来推断未来动作。在本文中，我们通过研究动作预判在仅限于单个视觉观察时能实现什么，挑战了这一假设。我们提出一个基本问题：未来的信息在单帧中已经编码了多少，如何有效利用？基于我们之前关于一瞥动作预判（AAG）的工作，我们系统地研究了单帧动作预判，结合了互补的信息来源。我们分析了RGB外观、基于深度的几何线索和过去动作的语义表示的贡献，并探讨了不同的多模态融合策略、关键帧选择策略和过去动作历史来源如何影响预判性能。根据这些发现，我们将最有效的设计选择整合到AAG+中，这是一个改进的单帧预判框架。尽管仅基于单帧，AAG+在原始AAG的基础上始终表现出改进，并在包括IKEA-ASM、Meccano和Assembly101在内的具有挑战性的预判基准上达到了与最先进的基于视频的方法相当或更优的性能。我们的结果为单帧动作预判的局限性和潜力提供了新的见解，并澄清了何时需要密集的时间建模以及何时一个精心选择的瞥视就足够。

Summary / 总结

This work challenges the common assumption that dense temporal information is necessary for action anticipation by investigating single-frame action anticipation. The authors explore the information encoded in a single frame and how it can be effectively utilized. They develop AAG+, which integrates RGB appearance, depth-based geometric cues, and semantic representations, and show that it outperforms the original AAG and matches or exceeds state-of-the-art video-based methods on challenging benchmarks.

这项工作挑战了将动作预见视为视频理解问题的传统方法，专注于单帧动作预见。作者研究了单帧中包含了多少未来动作信息以及如何有效利用这些信息。他们以AAG为基础，探索了RGB外观、基于深度的几何线索以及过去动作的语义表示的集成。研究发现，一个改进的单帧预见框架AAG+可以在具有挑战性的基准测试中达到与最先进的基于视频的方法相当或更好的性能。这表明，在某些情况下，精心选择的单帧足以进行动作预见。

Optimizing Agentic Workflows using Meta-tools

Authors: Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

First: 2026-01-29T17:43:08+00:00 · Latest: 2026-01-29T17:43:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.

中文标题/摘要

标题：使用元工具优化代理工作流

代理AI使LLM能够动态推理、规划并与其他工具交互以解决复杂任务。然而，代理工作流通常需要许多迭代推理步骤和工具调用，导致显著的操作成本、端到端延迟和由于幻觉导致的失败。本研究引入了代理工作流优化(AWO)框架，该框架识别并优化冗余的工具执行模式，以提高代理工作流的效率和鲁棒性。AWO分析现有的工作流轨迹，发现重复的工具调用序列，并将它们转换为元工具，这是一种确定性的复合工具，将多个代理操作打包成一次调用。元工具绕过了不必要的中间LLM推理步骤，降低了操作成本，同时缩短了执行路径，减少了失败次数。在两个代理AI基准测试上的实验表明，AWO将LLM调用次数最多减少了11.9%，同时将任务成功率提高了4.2个百分点。

Summary / 总结

This work addresses the inefficiencies and operational costs associated with agentic workflows by introducing Agent Workflow Optimization (AWO). AWO identifies and optimizes redundant tool execution patterns, transforming them into meta-tools that bundle multiple agent actions into a single invocation. This approach reduces the number of LLM calls by up to 11.9% and increases task success rates by up to 4.2 percentage points, thereby improving efficiency and robustness. Experiments on two agentic AI benchmarks validate these findings.

这项工作通过引入Agent Workflow Optimization (AWO)来解决剂型工作流中的低效率和运营成本问题。AWO 识别并优化重复的工具执行模式，将其转换为将多个代理操作打包成单一调用的元工具。这种方法最多可减少11.9%的LLM调用次数，并将任务成功率提高4.2个百分点。

Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

Authors: Longxuan Yu, Yu Fu, Shaorong Zhang, Hui Liu, Mukund Varma T, Greg Ver Steeg, Yue Dong

First: 2026-01-29T17:40:58+00:00 · Latest: 2026-01-29T17:40:58+00:00

Comments: 18 pages, 13 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.

中文标题/摘要

标题：逆序思考：当输出顺序不再反映推理顺序时在扩散语言模型中的表现

自回归（AR）语言模型强制执行固定的从左到右生成顺序，当所需的输出结构与自然推理冲突时（例如，由于呈现或结构约束，先给出答案后解释），这种固定顺序成为基本限制。在这种情况下，AR模型必须在生成中间推理之前就做出答案的承诺，这种刚性约束迫使它们过早地做出承诺。掩码扩散语言模型（MDLMs），它们并行细化所有标记，提供了一种将计算顺序与输出结构脱钩的方法。我们在GSM8K、Math500和ReasonOrderQA基准上验证了这一能力，这是一个我们引入的具有受控难度和层次评估的基准。当提示要求先给出答案后进行推理时，AR模型与标准思维链排序相比表现出巨大的准确度差距（最高达67%的相对下降），而MDLMs保持稳定（≤14%的相对下降），我们称这一特性为“顺序稳健性”。使用ReasonOrderQA，我们展示了MDLMs通过在扩散过程中更早地稳定更简单的标记（例如推理步骤）而不是复杂的标记（例如最终答案），使推理标记在答案承诺之前稳定，从而实现顺序稳健性。最后，我们确定了这种优势减弱的失败条件，概述了顺序稳健性所需的限制。

Summary / 总结

The research addresses the limitation of autoregressive (AR) language models in generating outputs that conflict with natural reasoning order. Masked diffusion language models (MDLMs) are introduced as a solution, allowing parallel refinement of all tokens without strict generation order. Experiments on GSM8K, Math500, and ReasonOrderQA show that AR models suffer significant accuracy drops (up to 67%) when answers are required before reasoning, whereas MDLMs maintain stability (up to 14% relative drop), a property termed 'order robustness'. This robustness is attributed to MDLMs stabilizing simpler tokens earlier, allowing reasoning to occur before final answer commitment.

论文探讨了自回归（AR）语言模型在处理输出结构与自然推理顺序冲突时的局限性。它引入了掩码扩散语言模型（MDLMs），该模型并行迭代细化所有标记，从而解耦计算顺序与输出结构。实验表明，当要求在推理之前给出答案时，AR模型的准确性会显著下降（最高达67%的相对下降），而MDLMs则保持稳定（最高14%的相对下降），这一特性被称为‘顺序稳健性’。研究显示，MDLMs通过在扩散过程中更早地稳定简单的标记（如推理步骤），使推理标记在答案承诺之前稳定下来，从而实现这一优势，但这种优势在某些条件下会减弱。

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Authors: Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu

First: 2026-01-29T17:39:20+00:00 · Latest: 2026-01-29T17:39:20+00:00

Abs · PDF · Code1 · Code2

Abstract

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

中文标题/摘要

标题：Drive-JEPA：视频JEPA结合多模态轨迹蒸馏实现端到端驾驶

端到端自动驾驶越来越多地利用自我监督的视频预训练来学习可转移的规划表示。然而，用于场景理解的视频世界模型的预训练到目前为止仅带来了有限的改进。这一限制进一步加剧了驾驶固有的不确定性：每个场景通常只提供一条人类轨迹，使得学习多模态行为变得困难。在本文中，我们提出了一种Drive-JEPA框架，该框架将视频联合嵌入预测架构（V-JEPA）与多模态轨迹蒸馏相结合，以实现端到端驾驶。首先，我们将V-JEPA适应于端到端驾驶，通过大规模驾驶视频预训练ViT编码器，生成与轨迹规划对齐的预测表示。其次，我们引入了一种以提案为中心的规划器，该规划器蒸馏多样化的模拟器生成轨迹和人类轨迹，采用一种具有动量感知的选择机制，以促进稳定和安全的行为。在NAVSIM上评估时，V-JEPA表示与简单的基于变换器的解码器结合使用，在无感知设置中优于先前的方法3 PDMS。完整的Drive-JEPA框架在v1上达到93.3 PDMS，在v2上达到87.8 EPDMS，创造了新的最佳水平。

Summary / 总结

The research aims to improve end-to-end autonomous driving by leveraging video pretraining and multimodal trajectory distillation. Drive-JEPA integrates V-JEPA with a proposal-centric planner to distill diverse trajectories, enhancing the model's ability to learn multimodal behaviors. The framework outperforms previous methods on NAVSIM with 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

研究旨在通过利用视频预训练和多模态轨迹蒸馏来提升端到端自动驾驶。Drive-JEPA 将 V-JEPA 与提议为中心的规划者结合，蒸馏多样化的轨迹，增强场景理解和行为预测。该框架在 v1 上达到 93.3 PDMS，在 v2 上达到 87.8 EPDMS，超越了之前的各项方法，在无感知设置中表现出色。

The Ensemble Inverse Problem: Applications and Methods

Authors: Zhengyan Huan, Camila Pazos, Martin Klassen, Vincent Croft, Pierre-Hugues Beauchemin, Shuchin Aeron

First: 2026-01-29T17:34:41+00:00 · Latest: 2026-01-29T17:34:41+00:00

Comments: 26 pages, 11 figures, in peer review

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non-iterative inference-time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth-observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods.

中文标题/摘要

标题：集合逆问题：应用与方法

我们引入了一个新的多元统计问题，称之为集合逆问题（EIP）。EIP 的目标是反演一个根据前向过程的先验推前得到的集合分布。在高能物理（HEP）中，这与一个广泛熟知的问题——解包问题——相关，该问题旨在从被探测器效应扭曲的测量值中重构真实的物理分布，例如动量和角度。在最近的应用中，EIP 也出现在全波形反演（FWI）和未知先验的逆成像中。我们提出了非迭代的推理时方法，基于一种新的条件生成模型类构建后验采样器，我们称之为集合逆生成模型。对于后验建模，这些模型还利用了观测集中包含的集合信息，而不仅仅是单个测量值。与现有方法不同，我们提出的方法在推理时不显式且迭代地使用前向模型，而是通过在与同一前向模型一致但来自广泛先验的一系列真值-观测对上进行训练来实现。我们证明了这种训练过程隐式地编码了似然模型。利用集合信息有助于后验推断，并使方法能够泛化到未见过的先验。我们在逆成像、HEP 和 FWI 中的几个合成和真实数据集上对提出的方法进行了基准测试。代码可在 https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods/ 获取。

Summary / 总结

The paper introduces the Ensemble Inverse Problem (EIP) aimed at reconstructing an ensemble from measurements distorted by detector effects. It proposes non-iterative inference-time methods using ensemble inverse generative models, which incorporate ensemble information for posterior modeling. The methods avoid using the forward model explicitly at inference time by training on multiple sets of truth-observation pairs. Experiments on synthetic and real datasets in inverse imaging, high energy physics, and full waveform inversion show improved posterior inference and generalization to unseen priors.

论文引入了集合逆问题（EIP），并提出了一种非迭代的推理时方法，使用集合逆生成模型来解决它。这些方法通过在多个真理-观测对上进行训练，避免在推理时显式使用前向模型。关键发现表明，结合集合信息可以提高后验推理并能够泛化到未见过的先验，已在逆成像、高能物理和全波形反演等多个合成和真实数据集上进行了验证。

From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

Authors: Haoran Tang, Rajiv Khanna

First: 2026-01-29T17:34:37+00:00 · Latest: 2026-01-29T17:34:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.

中文标题/摘要

标题：从逻辑到潜在特征：对比表示塑形在大模型去学习中的应用

大多数大模型去学习方法旨在通过最小的分布偏移来近似重新训练的行为，通常通过在预测空间中定义对齐式目标来实现。虽然这些方法在减少遗忘内容生成方面非常有效，但它们可能会起到抑制作用：遗忘的概念可能会在表示中持续存在，并与保留的知识纠缠在一起。我们引入了CLReg，这是一种对比表示正则化器，能够识别遗忘特征并将其远离保留特征，从而在最小化保留特征偏移的情况下显式地减少遗忘-保留干扰。我们提供了关于表示塑形与纠缠减少之间关系的初步理论见解。在不同规模的大模型去学习基准测试中，CLReg减少了遗忘-保留表示纠缠，从而促进了主流去学习方法的应用，而无需提出额外的隐私风险，启发了未来工作，通过重塑表示空间来移除遗忘概念。

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Authors: Johannes Kirmayr, Lukas Stappen, Elisabeth André

First: 2026-01-29T17:33:42+00:00 · Latest: 2026-01-29T17:33:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.

中文标题/摘要

标题：CAR-bench：评估大型语言模型代理在现实世界不确定性下的连贯性和极限意识

现有的大型语言模型（LLM）代理基准主要关注理想条件下的任务完成，而忽视了面向用户的实际应用中的可靠性。在诸如车载语音助手等领域，用户经常发出不完整或含糊不清的请求，这为代理带来了固有的不确定性，代理需要通过对话、工具使用和政策遵守来管理这种不确定性。我们引入了CAR-bench，这是一个评估车载助手领域中多轮次、使用工具的LLM代理在一致性、不确定性处理和能力意识方面的基准。该环境包括一个LLM模拟用户、领域政策和涵盖导航、生产力、充电和车辆控制的58个相互连接的工具。除了标准任务完成，CAR-bench还引入了幻觉任务，测试代理在缺少工具或信息时的极限意识，以及消歧任务，要求通过澄清或内部信息收集来解决不确定性。基线结果表明，在所有任务类型中，偶尔成功和持续成功之间存在巨大差距。即使是前沿推理的LLM，在消歧任务中的持续通过率也低于50%，因为它们因过早采取行动而违反政策或编造信息以满足用户请求，在幻觉任务中，这凸显了在实际应用中需要更可靠和自我意识的LLM代理。

Summary / 总结

CAR-bench evaluates the consistency and limit-awareness of LLM agents in handling real-world uncertainties, particularly in an in-car assistant domain. The benchmark includes multi-turn dialogues, tool use, and policy adherence, with tasks testing hallucination and disambiguation. Results show large gaps between occasional and consistent success, with even advanced LLMs failing to achieve 50% consistent pass rates on disambiguation tasks due to policy violations and information fabrication.

CAR-bench 评估了 LLM 代理在处理现实世界不确定性方面的连贯性和极限意识，特别是在车内助手领域。基准测试包括多轮对话、工具使用和政策遵守，任务测试了幻觉和澄清。结果显示在所有任务类型中连贯成功存在显著差距，即使是高级 LLM 也难以完成澄清任务，并且经常违反政策或编造信息以满足用户请求。

Hybrid Foveated Path Tracing with Peripheral Gaussians for Immersive Anatomy

Authors: Constantin Kleinbeck, Luisa Theelke, Hannah Schieber, Ulrich Eck, Rüdiger von Eisenhart-Rothe, Daniel Roth

First: 2026-01-29T17:33:14+00:00 · Latest: 2026-01-29T17:33:14+00:00

Comments: Scheduled for publication in the Proceedings of IEEE VR 2026

Abs · PDF · Code1 · Code2

Abstract

Volumetric medical imaging offers great potential for understanding complex pathologies. Yet, traditional 2D slices provide little support for interpreting spatial relationships, forcing users to mentally reconstruct anatomy into three dimensions. Direct volumetric path tracing and VR rendering can improve perception but are computationally expensive, while precomputed representations, like Gaussian Splatting, require planning ahead. Both approaches limit interactive use. We propose a hybrid rendering approach for high-quality, interactive, and immersive anatomical visualization. Our method combines streamed foveated path tracing with a lightweight Gaussian Splatting approximation of the periphery. The peripheral model generation is optimized with volume data and continuously refined using foveal renderings, enabling interactive updates. Depth-guided reprojection further improves robustness to latency and allows users to balance fidelity with refresh rate. We compare our method against direct path tracing and Gaussian Splatting. Our results highlight how their combination can preserve strengths in visual quality while re-generating the peripheral model in under a second, eliminating extensive preprocessing and approximations. This opens new options for interactive medical visualization.

中文标题/摘要

标题：混合焦斑路径追踪与外围高斯分布相结合的沉浸式解剖学可视化

体视医学成像为理解复杂的病理学提供了巨大的潜力。然而，传统的二维切片在解释空间关系方面支持有限，迫使用户在三维中重建解剖结构。直接体视路径追踪和VR渲染可以改善感知，但计算成本高昂，而预计算表示，如高斯散点图，需要提前规划。这两种方法都限制了交互式使用。我们提出了一种混合渲染方法，用于高质量、交互式和沉浸式的解剖学可视化。我们的方法结合了流式焦斑路径追踪和外围的轻量级高斯散点图近似。外围模型的生成通过体数据优化，并通过焦斑渲染不断细化，从而实现交互式更新。深度引导的重新投影进一步提高了对延迟的鲁棒性，并允许用户在保真度与刷新率之间进行权衡。我们将我们的方法与直接路径追踪和高斯散点图进行了比较。我们的结果表明，它们的结合可以保留视觉质量的优势，同时在不到一秒的时间内重新生成外围模型，从而消除大量的预处理和近似。这为交互式医学可视化提供了新的选择。

Summary / 总结

The paper proposes a hybrid foveated path tracing method that combines streamed foveated path tracing with a lightweight Gaussian Splatting approximation for the periphery, enabling high-quality, interactive, and immersive anatomical visualization. The peripheral model is optimized with volume data and continuously refined using foveal renderings, allowing for interactive updates. Experimental results show that this approach preserves visual quality while regenerating the peripheral model in under a second, eliminating the need for extensive preprocessing and approximations, thus opening new options for interactive medical visualization.

该研究提出了一种结合了流式注视点路径追踪和轻量级外围高斯点积的混合渲染方法，以实现高质量、交互性和沉浸式的解剖可视化。外围模型通过体积数据进行优化，并通过注视点渲染不断细化，从而实现交互式更新。实验结果表明，该方法能够在不到一秒的时间内重新生成外围模型，保留视觉质量的同时消除大量预处理和近似，为交互式医学可视化提供了新的选择。

History

20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553