arXiv 论文速递

Snapshot: 20260202_0332

RedSage: A Cybersecurity Generalist LLM

Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani

Venue: ICLR 2026

First: 2026-01-29T18:59:57+00:00 · Latest: 2026-01-29T18:59:57+00:00

Comments: Accepted on ICLR 2026; Project page: https://risys-lab.github.io/RedSage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.

中文标题/摘要

标题：RedSage：网络安全通才大语言模型

网络安全操作需要能够支持多样化工作流程而不泄露敏感数据的辅助大语言模型。现有解决方案要么依赖于存在隐私风险的专有API，要么基于缺乏领域适应性的开源模型。为弥合这一差距，我们通过大规模网络过滤和手动收集高质量资源，整理了118亿个与网络安全相关的持续预训练数据令牌，覆盖了28600份文档，涉及框架、攻击技术和安全工具。在此基础上，我们设计了一种代理增强流水线，模拟专家工作流程生成266000个网络安全多轮对话样本，用于监督微调。结合通用开源大语言模型数据，这些资源使我们能够训练出RedSage，这是一个开源、可本地部署的网络安全助手，具有领域感知的预训练和后训练。为了严格评估模型，我们引入了RedSage-Bench基准，包含30000个多项选择题和240个开放式问答项，涵盖网络安全知识、技能和工具专长。RedSage还在建立的网络安全基准（如CTI-Bench、CyberMetric、SECURE）和通用大语言模型基准上进行了评估，以评估其更广泛的泛化能力。在80亿规模下，RedSage在网络安全基准上取得了持续更好的结果，比基线模型在网络安全基准上高出+5.59分，在Open LLM Leaderboard任务上高出+5.05分。这些发现表明，领域感知的代理增强和预/后训练不仅可以增强网络安全特定的专业知识，还可以帮助提高一般推理和指令遵循能力。所有模型、数据集和代码均已公开。

Summary / 总结

RedSage is designed to address the need for a cybersecurity assistant LLM that supports diverse workflows while maintaining data privacy. It leverages 11.8B tokens of curated cybersecurity data and an agentic augmentation pipeline to generate 266K multi-turn samples for fine-tuning. RedSage outperforms baseline models by up to 5.59 points on cybersecurity benchmarks and 5.05 points on general LLM tasks, demonstrating the effectiveness of domain-aware pre/post-training and agentic augmentation. The model is open-source and assessable through RedSage-Bench and other benchmarks.

RedSage旨在支持多样化的网络安全工作流程同时保持数据隐私。它利用了11.8B个网络安全相关的数据令牌，并通过一个代理增强管道生成了266K样本进行微调。RedSage在网络安全基准测试中的表现比基线模型高出最多5.59分，在通用LLM基准测试中的表现高出5.05分，证明了领域感知的预/后训练和代理增强的有效性。该模型是开源的，并且可以公开获取。

UEval: A Benchmark for Unified Multimodal Generation

Authors: Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

中文标题/摘要

标题：UEval：统一多模态生成基准

我们介绍了UEval，一个用于评估统一模型的基准，即能够生成图像和文本的模型。UEval包含1000个专家精选的问题，要求模型输出中包含图像和文本，这些问题来源于8个实际任务。我们精选的问题涵盖了广泛的推理类型，从逐步指南到教科书解释。评估开放式的多模态生成并不简单，简单的LLM作为评判者的方法可能会忽略细节。不同于以往依赖多模态大型语言模型（MLLMs）来评估图像质量和文本准确性的工作，我们在UEval中设计了一种基于评分标准的评分系统。对于每个问题，参考图像和文本答案被提供给MLLM生成初始评分标准，其中包括多个评估标准，然后由人类专家进一步完善和验证这些评分标准。总共，UEval包含10,417个验证过的评分标准，使其能够实现可扩展和精细的自动评分。UEval对当前的统一模型具有挑战性：GPT-5-Thinking仅得66.4分，而最好的开源模型仅得49.1分。我们观察到，推理模型通常优于非推理模型，从推理模型向非推理模型转移推理痕迹可以显著缩小差距。这表明，对于需要复杂多模态理解和生成的任务，推理可能是重要的。

Summary / 总结

UEval is a benchmark for evaluating unified models that generate both images and text, using 1,000 expert-curated questions from 8 real-world tasks. It employs a rubric-based scoring system, where a MLLM generates initial criteria and human experts refine them. UEval challenges current models, with GPT-5-Thinking scoring 66.4 and the best open-source model at 49.1. Reasoning models outperform non-reasoning ones, and transferring reasoning traces can significantly improve performance.

UEval 是一个用于评估能够生成图像和文本的统一模型的基准，包含来自 8 个真实世界任务的 1,000 个专家精选问题。它采用基于评分表的评分系统，由人类专家验证，以评估图像和文本的质量。当前的统一模型，如 GPT-5-Thinking，仅得 66.4 分，而最好的开源模型仅得 49.1 分，表明复杂多模态理解和生成的挑战。推理模型优于非推理模型，从推理模型向非推理模型转移推理痕迹可以显著提高性能。

Exploring Reasoning Reward Model for Agents

Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue

First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00

Comments: Project page: https://github.com/kxfan2002/Reagent

Abs · PDF · Code1 · Code2 · Code3

Abstract

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

中文标题/摘要

标题：探索代理推理奖励模型

代理强化学习（Agentic RL）在使代理执行复杂推理和工具使用方面取得了显著成功。然而，大多数方法仍然依赖于稀疏的结果奖励进行训练。这种反馈无法区分中间推理的质量，导致训练结果不佳。在本文中，我们引入了代理推理奖励模型（Agent-RRM），这是一种多方面的奖励模型，为代理轨迹提供结构化的反馈，包括（1）明确的推理轨迹，（2）聚焦的批评，通过突出推理缺陷提供改进指导，以及（3）总体评分，评估过程表现。利用这些信号，我们系统地研究了三种整合策略：Reagent-C（文本增强改进），Reagent-R（奖励增强指导）和Reagent-U（统一反馈整合）。在12个不同基准上的广泛评估表明，Reagent-U带来了显著的性能提升，在GAIA上达到43.7%，在WebWalkerQA上达到46.2%，验证了我们推理奖励模型和训练方案的有效性。代码、模型和数据集均已发布，以促进未来的研究。

Summary / 总结

This paper addresses the limitation of sparse outcome-based rewards in Agentic Reinforcement Learning by proposing Agent Reasoning Reward Model (Agent-RRM), which provides structured feedback including reasoning traces, focused critiques, and overall scores. Three integration strategies—Reagent-C, Reagent-R, and Reagent-U—are evaluated, with Reagent-U showing significant performance improvements, achieving 43.7% and 46.2% on GAIA and WebWalkerQA benchmarks, respectively.

本文针对基于稀疏结果奖励的代理强化学习中出现的训练效果不佳问题，提出了代理推理奖励模型（Agent-RRM），该模型提供结构化反馈，包括推理轨迹、重点批评和总体评分。三种集成策略——Reagent-C、Reagent-R和Reagent-U——被评估，其中Reagent-U显示出显著的性能提升，在GAIA上达到43.7%，在WebWalkerQA上达到46.2%。

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Authors: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu

Venue: www

First: 2026-01-29T18:59:51+00:00 · Latest: 2026-01-29T18:59:51+00:00

Comments: Project Page: https://www.infinitescript.com/project/dynamic-vla/ GitHub: https://github.com/hzxie/DynamicVLA

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.

中文标题/摘要

标题：DynamicVLA：一种用于动态物体操作的视觉-语言-行动模型

对于视觉-语言-行动（VLA）模型而言，操作动态物体仍然是一个开放的挑战，尽管它们在静态操作中表现出强大的泛化能力，但在需要快速感知、时间预测和连续控制的动态场景中却难以应对。我们提出了DynamicVLA，这是一种结合了时间推理和闭环适应的动态物体操作框架，通过三个关键设计实现：1）一个紧凑的0.4B VLA，使用卷积视觉编码器进行空间高效、结构忠实的编码，以实现快速多模态推理；2）连续推理，实现重叠的推理和执行，以降低延迟并及时适应物体运动；3）潜在感知动作流，通过强制执行时间对齐的动作执行来弥合感知-执行差距。为了填补动态操作数据的基础，我们引入了Dynamic Object Manipulation（DOM）基准，该基准从头开始构建，使用自动数据收集管道高效地收集了跨越2800个场景和206个物体的20万合成集，并能够快速收集2000个无需远程操作的真实世界集。广泛的评估表明，DynamicVLA在响应速度、感知和泛化方面取得了显著改进，将其定位为一种统一框架，用于跨不同实体的通用动态物体操作。

Summary / 总结

DynamicVLA is a framework designed to address the challenges of dynamic object manipulation, which are not well handled by existing VLA models due to their difficulties in rapid perception, temporal anticipation, and continuous control. It integrates temporal reasoning and closed-loop adaptation through a compact vision-language-action model, continuous inference, and latent-aware action streaming. The framework is evaluated on a new benchmark, DOM, which includes 200K synthetic episodes and 2K real-world episodes, demonstrating improvements in response speed, perception, and generalization for dynamic manipulation tasks.

DynamicVLA旨在通过集成时间推理和闭环适应来解决VLA模型在动态物体操作中的挑战。它包含一个紧凑的视觉-语言-行动模型、连续推理以实现及时适应以及潜在感知执行对齐。该模型在DOM基准上进行了评估，该基准包括200K合成片段和2K真实世界片段，展示了在响应速度、感知和泛化方面的改进。

Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

First: 2026-01-29T18:59:24+00:00 · Latest: 2026-01-29T18:59:24+00:00

Comments: 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

中文标题/摘要

标题：VLMs 是感知还是回忆？经典视觉错觉探究视觉感知与记忆之间的差异

大型视觉-语言模型（VLMs）在原始图像上通常能正确回答经典视觉错觉，但在错觉因素反转后仍坚持相同的回答，尽管这些视觉变化对人类来说非常明显。这引发了一个基本问题：VLMs 是感知视觉变化还是仅仅回忆已记忆的模式？尽管已有几项研究注意到了这一现象，但其背后的成因仍不清楚。为了从观察转向系统理解，本文引入了VI-Probe，这是一种可控的视觉错觉框架，具有分级扰动和匹配的视觉对照组（无错觉诱导器），以解开基于视觉的感知与语言驱动的回忆之间的差异。不同于以往工作主要关注平均准确性，我们使用极性反转一致性、模板固定指数以及与匹配对照组归一化的错觉乘数来衡量稳定性和敏感性。不同家族的实验表明，反应持久性是由多种原因而非单一机制引起的。例如，GPT-5 表现出记忆覆盖，Claude-Opus-4.1 显示感知与记忆的竞争，而 Qwen 变体则表明视觉处理的限制。我们的发现挑战了单一成因的观点，并促使基于探针的评估，以衡量知识和对受控视觉变化的敏感性。数据和代码可在 https://sites.google.com/view/vi-probe/ 获取。

Summary / 总结

This paper investigates whether large vision-language models (VLMs) perceive visual changes or merely recall memorized patterns by using a controllable visual-illusion framework called VI-Probe. The study finds that response persistence in VLMs arises from various causes, including memory override, perception-memory competition, and visual-processing limits, challenging the notion of a single underlying mechanism. The authors measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls.

该研究通过使用可控视觉错觉框架VI-Probe，探讨大型视觉-语言模型（VLMs）是感知视觉变化还是仅回忆记忆中的模式。研究使用极性反转一致性、模板固定指数和错觉乘数来衡量稳定性和敏感性。实验表明，VLMs的响应持久性源自多种原因，包括记忆覆盖、感知与记忆的竞争以及视觉处理限制，这挑战了单一机制的观点。研究结果表明，基于探针的评估对于评估控制视觉变化的知识和敏感性是必要的。

DynaWeb: Model-Based Reinforcement Learning of Web Agents

Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu

First: 2026-01-29T18:59:07+00:00 · Latest: 2026-01-29T18:59:07+00:00

Abs · PDF · Code1 · Code2

Abstract

The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

中文标题/摘要

标题：DynaWeb：基于模型的强化学习网页代理

自主网页代理的发展，由大型语言模型（LLMs）和强化学习（RL）驱动，代表了通用人工智能助手的重要一步。然而，训练这些代理受到与实时互联网交互的挑战的严重阻碍，这既低效又昂贵且充满风险。基于模型的强化学习（MBRL）通过学习环境的世界模型来实现模拟交互，提供了一个有希望的解决方案。本文介绍了DynaWeb，这是一种新颖的MBRL框架，通过与训练以预测给定代理行为的自然网页表示的网页世界模型交互来训练网页代理。该模型充当合成的网页环境，代理策略可以在其中生成大量回放动作轨迹以实现高效的在线强化学习。除了免费策略回放外，DynaWeb还整合了来自训练数据的真实专家轨迹，在训练过程中随机交错于策略回放中，以提高稳定性和样本效率。在具有挑战性的WebArena和WebVoyager基准测试上的实验表明，DynaWeb能够持续且显著地提高最先进的开源网页代理模型的性能。我们的研究结果证明了通过想象训练网页代理的可行性，提供了一种可扩展且高效的在线代理RL扩展方式。

Summary / 总结

DynaWeb is a model-based reinforcement learning framework designed to train web agents using a web world model that predicts naturalistic web page representations. This approach enables efficient simulated interaction and incorporates real expert trajectories to enhance stability and sample efficiency. Experiments on WebArena and WebVoyager show that DynaWeb significantly improves the performance of state-of-the-art web agent models.

DynaWeb 是一种基于模型的强化学习框架，旨在通过合成的网络环境来训练网络代理。它通过学习一个能够根据代理行为预测网页表示的世界模型来解决现实世界网络交互的挑战。DynaWeb 结合了自由策略回放和真实专家轨迹，以提高稳定性和样本效率。在 WebArena 和 WebVoyager 上的实验结果表明，DynaWeb 显著提高了最先进的网络代理模型的性能。

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Authors: Ajay Patel, Colin Raffel, Chris Callison-Burch

First: 2026-01-29T18:58:47+00:00 · Latest: 2026-01-29T18:58:47+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

中文标题/摘要

标题：FineInstructions: 将预训练规模扩展到合成指令

由于监督训练数据有限，大型语言模型（LLMs）通常通过在大量无结构文本数据上使用自我监督的“预测下一个词”目标进行预训练。为了使最终模型对用户有用，它还会在少量“指令调优”数据上进行进一步训练，这些数据由指令和响应的监督训练示例组成。为了克服有限的监督数据，我们提出了一种程序，可以将互联网规模预训练文档中的知识转化为数十亿组合成指令和答案训练对。由此产生的数据集称为FineInstructions，使用来自真实用户查询和提示的约1800万指令模板。这些指令模板与并实例化自无结构预训练语料库中的人类撰写的源文档。通过在如此大规模的“监督”合成训练数据上进行预训练，LLM可以从头开始仅使用指令调优目标进行预训练，这与LLM预期下游使用情况（响应用户提示）更为一致。我们进行了受控的逐词训练实验，并发现使用FineInstructions进行预训练在衡量自由形式响应质量的标准基准上优于标准预训练和其他提出的合成预训练技术。我们的资源可以在https://huggingface.co/fineinstructions 获取。

Summary / 总结

The research aims to address the limitation of supervised training data for large language models (LLMs) by proposing a method to generate synthetic instruction and answer pairs at a massive scale. By using 18 million instruction templates derived from real user queries and prompts, and matching them with human-written source documents from unstructured pre-training corpora, the study creates a dataset called FineInstructions. Experiments show that pre-training LLMs on this synthetic data outperforms standard pre-training methods and other synthetic pre-training techniques on benchmarks measuring free-form response quality.

研究旨在通过从互联网规模的预训练文档中生成合成指令和答案对来解决大型语言模型（LLMs）监督训练数据不足的问题。方法包括使用来自真实用户查询和提示的1800万个指令模板，并将其与人类撰写的源文档匹配和实例化。由此产生的FineInstructions数据集使得LLMs仅通过指令调优目标进行预训练，从而在标准基准测试中的自由形式响应质量方面优于标准预训练和其他合成预训练技术。

MORPH: PDE Foundation Models with Arbitrary Data Modality

Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Alexander Scheinker, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas

First: 2025-09-25T22:38:36+00:00 · Latest: 2026-01-29T18:57:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters, MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

中文标题/摘要

标题：MORPH：任意数据模态的偏微分方程基础模型

我们介绍了MORPH，一种针对偏微分方程（PDEs）的模态无关自回归基础模型。MORPH基于卷积视觉变换器骨干网络，能够无缝处理不同数据模态（1D-3D）和不同分辨率的异质时空数据集，以及具有混合标量和向量分量的多个字段。该架构结合了(i) 组件卷积，联合处理标量和向量通道以捕捉局部交互，(ii) 交叉注意力，建模并选择性地传播不同物理场之间的信息，(iii) 轴向注意力，沿个体空间和时间轴分解全时空自注意力，以减少计算负担同时保持表达能力。我们使用多样化的异质PDE数据集对多个模型变体进行预训练，并评估其在一系列下游预测任务中的迁移性能。使用全模型微调和参数高效低秩适配器，MORPH优于从头训练的模型。在广泛的评估中，MORPH匹配或超越了强大的基线和最近的先进模型。这些能力共同展示了学习科学观测的异构和多模态性质的灵活而强大的基础架构，为可扩展和数据高效的科学机器学习铺平了道路。源代码、数据集和模型可在https://github.com/lanl/MORPH/ 获取。

Summary / 总结

MORPH is an autoregressive foundation model designed for partial differential equations that can handle various data modalities and resolutions. It uses a convolutional vision transformer backbone with component-wise convolution, inter-field cross-attention, and axial attentions to capture local interactions and propagate information between fields. MORPH outperforms models trained from scratch and matches or surpasses strong baselines in multiple downstream prediction tasks, demonstrating its effectiveness in learning from heterogeneous scientific data. The source code and datasets are publicly available.

MORPH 是一种用于偏微分方程的自回归基础模型，能够处理多种数据模态和分辨率。它使用卷积视觉变换器骨干网络，并结合组件卷积、跨域交叉注意力和轴向注意力来捕捉局部交互并传播不同物理域之间的信息。MORPH 在多个预测任务中表现出色，超越了从头训练的模型，并且能够匹配或超越强大的基线模型，展示了其在学习异构和多模态科学数据方面的有效性。源代码和数据集已公开。

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Authors: Anthony Chen, Naomi Ken Korem, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or

First: 2026-01-29T18:57:13+00:00 · Latest: 2026-01-29T18:57:13+00:00

Comments: Project webpage available at https://justdubit.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

中文标题/摘要

标题：JUST-DUB-IT：联合音视频扩散的视频配音

音视频基础模型是预先训练以联合生成声音和视觉内容的模型，最近展示了前所未有的多模态生成和编辑能力，为下游任务开辟了新的机会。在这些任务中，配音可以从这些先验知识中获益良多，但大多数现有解决方案仍然依赖于复杂且特定于任务的管道，在实际场景中难以应对。在本文中，我们介绍了一种单模型方法，通过轻量级LoRA将基础音视频扩散模型适应于视频到视频的配音。LoRA使模型能够根据输入的音视频同时生成翻译后的音频和同步的面部动作。为了训练LoRA，我们利用生成模型本身合成了同一位说话人的多语言配对视频。具体来说，我们生成了单个片段内的多语言视频，并在每个半部分中填充面部和音频，使其与另一半的语言匹配。通过利用音视频模型丰富的生成先验，我们的方法在保持说话人身份和唇同步的同时，对复杂运动和现实世界动态具有鲁棒性。我们证明，与现有的配音管道相比，我们的方法生成的配音视频具有更高的视觉保真度、唇同步和鲁棒性。

Summary / 总结

This paper introduces JUST-DUB-IT, a single-model approach that uses a lightweight LoRA to adapt a pretrained audio-visual diffusion model for video dubbing. The model conditions on input audio-video to generate synchronized translated audio and facial motion. By synthesizing multilingual videos and inpainting faces and audio, the approach preserves speaker identity and lip synchronization, showing improved visual fidelity and robustness compared to existing methods.

该研究提出了JUST-DUB-IT，通过使用轻量级LoRA将音频-视觉扩散模型适应于视频配音。通过利用模型的生成能力，该方法根据输入的音频-视频生成同步的翻译音频和面部动作。该方法通过生成具有语言切换的多语言视频，并填充面部和音频以匹配另一半的语言。实验表明，JUST-DUB-IT生成的配音视频具有更高的视觉保真度、唇部同步和鲁棒性，优于现有方法。

StepShield: When, Not Whether to Intervene on Rogue Agents

Authors: Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli

First: 2026-01-29T18:55:46+00:00 · Latest: 2026-01-29T18:55:46+00:00

Comments: 16 pages, 2 figures, 14 tables

Abs · PDF · Code1 · Code2

Abstract

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

中文标题/摘要

标题：StepShield：何时干预，而非是否干预违规代理

现有的代理安全性基准报告二元准确性，混淆了早期干预与事后分析。在第8步标记违规行为可以启用干预；而在第48步报告则仅具有法医价值。这种区别至关重要，但当前的基准无法衡量这一点。我们引入了StepShield，这是第一个评估违规行为何时被检测到的基准，而不仅仅是是否检测到。StepShield包含9,213个代码代理轨迹，包括1,278个精心标注的训练对和一个包含7,935个轨迹的测试集，现实中的违规率为8.1%。违规行为基于六类真实世界的安全事件。我们提出了三个新的时间度量标准：早期干预率（EIR）、干预差距和节省的令牌数。令人惊讶的是，我们的评估表明，基于LLM的法官实现了59%的EIR，而静态分析器仅实现了26%，标准准确性指标完全无法看到这种2.3倍的性能差距。我们进一步表明，早期检测具有直接的经济效益：我们的级联HybridGuard检测器将监控成本降低了75%，并在企业规模上预计在未来五年内节省1.08亿美元。通过将评估的重点从是否转移到何时，StepShield为构建更安全且更具经济效益的AI代理提供了新的基础。代码和数据在Apache 2.0许可证下发布。

Summary / 总结

StepShield evaluates the timing of violation detection in code agents, addressing the limitations of existing benchmarks that only report binary accuracy. It introduces three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. The study finds that an LLM-based judge outperforms a static analyzer by 2.3x in EIR, and a cascaded HybridGuard detector reduces monitoring costs by 75% with projected savings of $108M over five years. By focusing on when interventions occur, StepShield provides a new framework for developing safer and more cost-effective AI agents.

StepShield 评估代码代理中违规检测的时间，解决了现有基准仅衡量二元准确性的局限性。它引入了三个新型指标：早期干预率（EIR）、干预差距和节省的标记数。评估结果显示，基于LLM的裁判比静态分析器更高效地检测违规行为，EIR为59%，而静态分析器仅为26%，效率高出2.3倍。早期检测还带来了显著的成本节约，级联HybridGuard检测器将监控成本降低了75%，并在五年内预计节省10.8亿美元。通过关注违规何时被检测到，StepShield 为更安全和更具经济效益的AI代理提供了新的基准。

PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Authors: Zhexin Liang, Zhaoxi Chen, Yongwei Chen, Tianyi Wei, Tengfei Wang, Xingang Pan

Venue: ICLR 2026

First: 2026-01-29T18:55:36+00:00 · Latest: 2026-01-29T18:55:36+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ($π$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $π$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

中文标题/摘要

标题：PI-Light：基于物理的扩散模型全图像重新照明

全图像重新照明由于难以收集大规模结构化的配对数据、保持物理合理性困难以及数据驱动先验带来的有限泛化能力，仍然是一个具有挑战性的问题。现有的全场景重新照明合成到现实差距的尝试仍然不尽如人意。为了解决这些挑战，我们提出了基于物理的扩散模型全图像重新照明（$π$-Light，或PI-Light），这是一种两阶段框架。我们的设计包括（i）批次感知注意力，提高了图像集合中固有预测的一致性，（ii）一个基于物理的神经渲染模块，确保物理合理的光传输，（iii）基于物理的损失，规范训练动力学向物理有意义的景观发展，从而增强对真实世界图像编辑的泛化能力，以及（iv）一个精心策划的在受控光照条件下捕捉的多样物体和场景的数据集。这些组件共同使预训练的扩散模型的微调变得高效，同时也为下游评估提供了坚实的标准。实验表明，$π$-Light 能够在各种材料上合成高光和漫反射，与先前的方法相比，在真实世界场景上的泛化能力更优。

Summary / 总结

PI-Light addresses the challenges of full-image relighting by introducing a two-stage framework that uses physics-inspired diffusion models. It includes batch-aware attention, a physics-guided neural rendering module, physics-inspired losses, and a curated dataset to improve consistency, enforce physical plausibility, and enhance generalizability. Experiments show that PI-Light outperforms previous methods in synthesizing specular highlights and diffuse reflections across various materials and generalizing to real-world scenes.

PI-Light 是一个利用物理启发式扩散模型的两阶段框架，旨在解决全图像重新照明的挑战。它包含批次感知注意力、物理引导神经渲染模块、物理启发式损失以及一个精心策划的多样物体和场景数据集。实验表明，PI-Light 能够有效地在各种材料上合成高光和漫反射，相比之前的方法，在真实场景中的泛化能力更强。

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu

First: 2026-01-29T18:52:54+00:00 · Latest: 2026-01-29T18:52:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

中文标题/摘要

标题：付费获取提示，而非答案：LLM 管理以实现成本效益推理

大型语言模型（LLMs）在复杂推理任务上提供最先进的性能，但其推理成本限制了大规模部署。小型语言模型（SLMs）提供显著的成本节省，但在准确性上落后很多。现有方法——路由和级联——将LLM视为全有或全无的资源：查询要么完全绕过LLM，要么LLM以全额成本生成完整响应。我们引入了LLM 管理框架，该框架仅请求LLM提供一个短前缀（提示），并将其提供给SLM。这种简单的机制在数学和编程任务中表现出惊人的效果：即使提示仅占完整LLM响应的10-30%，也能显著提高SLM的准确性。LLM 管理既涵盖了路由和级联，又在最优决策下实现了更低的成本。我们开发了一种两阶段预测器，以共同确定是否需要提示以及请求多少令牌。在广泛使用的数学推理（GSM8K，CNK12）和代码生成（HumanEval，MBPP）基准测试中，LLM 管理将成本降低了42-94%。与最先进的路由和级联基线相比，LLM 管理在成本上最多可减少2.8倍，同时保持相同的准确性。据我们所知，这是首次利用令牌级预算控制实现SLM-LLM协作的工作。

Summary / 总结

The paper addresses the high inference costs of Large Language Models (LLMs) and proposes LLM Shepherding, a method that requests only a short prefix (hint) from the LLM and provides it to Small Language Models (SLMs). This approach improves SLM accuracy by 10-30% and reduces costs by 42-94% on math and coding tasks compared to using LLMs alone. The two-stage predictor determines when and how many tokens to request, achieving up to 2.8x cost reduction without sacrificing accuracy.

论文通过提出LLM牧羊人方法，仅从大型语言模型（LLM）请求一个短前缀（提示），并将其提供给小型语言模型（SLM），来解决复杂任务中使用LLM的高成本问题。该方法在数学和编程任务上显著提高了SLM的准确性，与单独使用LLM相比，成本降低了42-94%。两阶段预测器确定何时以及请求多少令牌，最多可实现2.8倍的成本降低，同时不牺牲准确性。

World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Authors: Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak

First: 2026-01-29T18:51:54+00:00 · Latest: 2026-01-29T18:51:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

中文标题/摘要

标题：工作流世界：将世界模型引入企业系统的基准

前沿大型语言模型（LLMs）在许多领域作为自主代理表现出色，但在复杂的企业系统中仍未经测试，这些系统中的隐藏工作流会在相互连接的数据库之间产生级联效应。现有的企业基准测试评估表面级别的代理任务完成情况，类似于通用消费者基准测试，忽略了企业中的真正挑战，如有限的可观测性、庞大的数据库状态以及隐藏的工作流及其级联副作用。我们引入了工作流世界（WoW），这是一个基于ServiceNow的现实环境，包含4000多个业务规则和55个嵌入系统中的活跃工作流，以及WoW基准，这是一个包含234个任务的基准，评估受限的代理任务完成能力和企业动态建模能力。我们揭示了两个主要发现：（1）前沿LLMs存在动态盲点，一致地未能预测其行动的隐形级联副作用，导致无声的约束违规；（2）在不透明系统中需要基于世界建模的可靠性，代理必须在高保真反馈不可用时在心中模拟隐藏状态转换以弥合可观测性差距。为了实现可靠且有用的企业代理，WoW推动了一种新的范式，即明确学习系统动力学。我们发布了GitHub以设置和评估WoW。

Summary / 总结

The paper introduces World of Workflows (WoW), a benchmark for evaluating large language models (LLMs) in complex enterprise systems, addressing limitations of existing benchmarks. WoW incorporates 4,000+ business rules and 55 active workflows in a ServiceNow environment, and WoW-bench evaluates 234 tasks focusing on agentic task completion and enterprise dynamics modeling. Key findings show that LLMs struggle with predicting cascading side effects, leading to silent constraint violations, and emphasize the need for grounded world modeling to bridge observability gaps in opaque systems for reliable enterprise agents.

论文介绍了World of Workflows (WoW)，一个用于评估大型语言模型（LLMs）在复杂企业系统中的基准，解决了现有基准的局限性。WoW 使用了一个包含 4,000 多个业务规则和 55 个活跃工作流的 ServiceNow 环境，并且 WoW-bench 包含 234 个任务来评估代理任务完成和企业动态建模能力。主要发现表明，LLMs 在预测连锁副作用方面存在困难，导致约束违规，并强调了在不透明系统中需要进行基于世界建模以弥补可观测性差距的重要性。这项工作推动了在企业环境中学习系统动态的新范式。

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Authors: Yifeng Ding, Lingming Zhang

First: 2026-01-29T18:50:29+00:00 · Latest: 2026-01-29T18:50:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

中文标题/摘要

标题：SWE-Replay: 软件工程代理的高效测试时扩展

测试时扩展已被广泛采用以增强大型语言模型（LLM）代理在软件工程（SWE）任务中的能力。然而，标准方法从头开始重复采样轨迹是计算上昂贵的。虽然最近的方法试图通过使用专门的价值代理来减轻成本，但它们可能会遭受模型校准不当的问题，并且无法泛化到现代代理，这些代理会合成自定义的bash脚本作为工具。在本文中，我们介绍了SWE-Replay，这是第一个无需依赖可能噪声的价值估计即可实现现代代理的高效且可泛化的测试时扩展技术。SWE-Replay通过回收先前试验中的轨迹来优化扩展过程，动态选择从头探索或利用存档经验在关键中间步骤进行分支。这种中间步骤的选择是基于仓库探索的潜力和推理意义，而不是外部LLM的质量估计。我们的评估表明，在SWE-Bench Verified上，SWE-Replay始终优于简单的扩展，成本最多可降低17.4%，同时保持或甚至提高性能最多3.8%。进一步在SWE-Bench Pro和多语言上的评估验证了SWE-Replay的泛化能力，确立了其作为软件工程代理高效测试时扩展稳健基础的地位。

Summary / 总结

The paper introduces SWE-Replay, an efficient and generalizable test-time scaling technique for modern software engineering agents. It addresses the computational inefficiency of repeatedly sampling trajectories from scratch and avoids the use of potentially noisy value estimates. SWE-Replay recycles trajectories from previous trials, dynamically deciding whether to explore from scratch or exploit archived experience by branching at critical steps. Evaluation on SWE-Bench Verified shows that SWE-Replay reduces costs by up to 17.4% while maintaining or improving performance by up to 3.8%. Further validation on SWE-Bench Pro and Multilingual confirms its generalizability and robustness.

论文提出了SWE-Replay，一种高效的现代软件工程代理测试时扩展技术。它通过回收先前的轨迹并动态选择探索或利用来解决从头开始重复采样的计算效率低下问题。SWE-Replay在SWE-Bench Verified上优于简单的扩展方法，成本最多可降低17.4%，性能最多可提高3.8%。此外，它在SWE-Bench Pro和Multilingual上的进一步评估证明了其普适性，使其成为软件工程代理高效测试时扩展的稳健基础。

The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Authors: Irsyad Adam, Zekai Chen, David Laprade, Shaun Porwal, David Laub, Erik Reinertsen, Arda Pekis, Kevin Brown

First: 2026-01-29T18:49:37+00:00 · Latest: 2026-01-29T18:49:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

中文标题/摘要

标题：患者不是移动的文档：一种用于纵向EHR的世界模型训练范式

大型语言模型（LLMs）通过下一个词预测训练，在临床基础模型方面取得了成功。这些语言骨干的表示在生物医学任务上表现出强大的线性探针性能，表明患者语义是从大规模的下一个词预测中涌现出来的。然而，这种范式将患者视为需要总结的文档，而不是需要模拟的动力系统；患者的轨迹是从其状态在干预和时间的作用下演变而来的，需要能够模拟动力学而不是预测词的模型。为了解决这个问题，我们引入了SMB-Structure，这是一种结构化EHR的世界模型，将联合嵌入预测架构（JEPA）与下一个词预测（SFT）相结合。SFT使我们的模型能够重建患者状态在词空间中的未来状态，而JEPA则仅从初始患者表示中预测这些未来状态在潜在空间中，迫使轨迹动力学在观察到下一个状态之前被编码。我们在两个大规模队列中进行了验证：纪念斯隆凯特林（23,319名肿瘤患者；323,000多个患者年）和INSPECT（19,402名肺栓塞患者）。使用沿疾病轨迹多个点评估的线性探针，我们证明我们的训练范式学习到的嵌入捕捉到了自回归基线无法恢复的疾病动力学，使SMB-Structure能够在高患者异质性特征的复杂任务上实现竞争力的性能。模型权重可在https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure 获取。

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

Authors: John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan, Manuel Toribio, Sandipan Banerjee, Guy Gafni

First: 2026-01-29T18:49:27+00:00 · Latest: 2026-01-29T18:49:27+00:00

Comments: Project page: https://edit-yourself.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

中文标题/摘要

标题：EditYourself：基于音频的对话头视频生成与操控

当前的生成视频模型在从文本和图像提示生成新颖内容方面表现出色，但在编辑现有的预录制视频方面存在关键缺口，其中对讲话脚本的微小修改需要保留动作、时间连贯性、说话者身份和准确的唇部同步。我们介绍了EditYourself，一种基于DiT的音频驱动视频到视频（V2V）编辑框架，该框架能够基于脚本修改对话头视频，包括无缝添加、删除和调整视觉讲话内容的时间。基于通用视频扩散模型，EditYourself通过音频条件和区域感知、编辑导向的训练扩展增强了其V2V能力。这使得通过时空修补精确唇部同步和时间连贯地重新结构现有表演成为可能，包括在新添加的段落中合成现实的人体运动，同时保持视觉保真度和身份一致性。这项工作代表了生成视频模型作为专业视频后期制作实用工具的基础步骤。

Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics

Authors: Winfried Ripken, Michael Plainer, Gregor Lied, Thorben Frank, Oliver T. Unke, Stefan Chmiela, Frank Noé, Klaus Robert Müller

First: 2026-01-29T18:47:46+00:00 · Latest: 2026-01-29T18:47:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Simulating the long-time evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span $Δt$, enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.

中文标题/摘要

标题：学习哈密顿流映射：大时间步长分子动力学的平均流一致性

模拟哈密顿系统长时间演化受限于所需的较小时间步长以确保数值积分的稳定性。为克服这一限制，我们提出了一种框架，通过预测选定时间跨度$Δt$内的平均相空间演化来学习哈密顿流映射，从而实现远超经典积分器稳定限制的大时间步长更新。为此，我们对时间平均哈密顿动力学施加了平均流一致性条件。与先前的方法不同，这种方法允许在无需访问未来状态的情况下使用独立的相空间样本进行训练，从而避免了昂贵的轨迹生成。我们的方法在多种哈密顿系统中得到了验证，特别是在使用机器学习力场（MLFF）改进分子动力学模拟方面表现出色。我们的模型保持了相似的训练和推理成本，但在训练时可以直接使用广泛可用的无轨迹MLFF数据集，从而支持显著更大的积分时间步长。

Summary / 总结

The research aims to address the limitation of small timesteps in simulating the long-time evolution of Hamiltonian systems. It introduces a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span, which enables stable large-timestep updates. Key experimental findings show that this method improves molecular dynamics simulations using machine-learned force fields, maintaining similar training and inference costs but supporting much larger integration timesteps compared to classical integrators.

研究旨在解决哈密顿系统长时间演化模拟中需要小时间步长的限制。它提出了一种框架，通过预测选定时间跨度内的平均相空间演化来学习哈密顿流图，从而实现稳定的大型时间步长更新。实验结果表明，该方法在使用机器学习力场的分子动力学模拟中表现出色，保持了相似的训练和推理成本，但支持比经典积分器更大的时间步长。

SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence

Authors: Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi

First: 2026-01-29T18:41:52+00:00 · Latest: 2026-01-29T18:41:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

中文标题/摘要

标题：SINA：使用人工智能的电路原理图图像到网表生成器

当前将电路原理图图像转换为机器可读网表的方法在组件识别和连接推理方面存在困难。在本文中，我们介绍了SINA，这是一个开源的全自动电路原理图图像到网表生成器。SINA结合了深度学习进行准确的组件检测、连通分量标记（CCL）进行精确的连接提取以及光学字符识别（OCR）进行组件参考标识符检索，同时采用视觉语言模型（VLM）进行可靠的参考标识符分配。在我们的实验中，SINA的整体网表生成准确率为96.47%，比最先进的方法高出2.72倍。

Summary / 总结

SINA is an open-source tool that uses artificial intelligence to convert circuit schematic images into machine-readable netlists. It combines deep learning for component detection, CCL for connectivity extraction, OCR for reference designator retrieval, and a VLM for reference designator assignments. Experiments show that SINA achieves 96.47% overall netlist-generation accuracy, surpassing state-of-the-art methods by 2.72 times.

SINA 是一种自动化的电路原理图图像到网表生成器，它使用深度学习进行组件检测、CCL 进行连接提取、OCR 进行参考标识符检索，以及 VLM 进行参考标识符分配。它实现了 96.47% 的整体网表生成准确率，比最先进的方法高出 2.72 倍。

Physics Informed Reconstruction of Four-Dimensional Atmospheric Wind Fields Using Multi-UAS Swarm Observations in a Synthetic Turbulent Environment

Authors: Abdullah Tasim, Wei Sun

First: 2026-01-29T18:40:32+00:00 · Latest: 2026-01-29T18:40:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate reconstruction of atmospheric wind fields is essential for applications such as weather forecasting, hazard prediction, and wind energy assessment, yet conventional instruments leave spatio-temporal gaps within the lower atmospheric boundary layer. Unmanned aircraft systems (UAS) provide flexible in situ measurements, but individual platforms sample wind only along their flight trajectories, limiting full wind-field recovery. This study presents a framework for reconstructing four-dimensional atmospheric wind fields using measurements obtained from a coordinated UAS swarm. A synthetic turbulence environment and high-fidelity multirotor simulation are used to generate training and evaluation data. Local wind components are estimated from UAS dynamics using a bidirectional long short-term memory network (Bi-LSTM) and assimilated into a physics-informed neural network (PINN) to reconstruct a continuous wind field in space and time. For local wind estimation, the bidirectional LSTM achieves root-mean-square errors (RMSE) of 0.064 and 0.062 m/s for the north and east components in low-wind conditions, increasing to 0.122 to 0.129 m/s under moderate winds and 0.271 to 0.273 m/s in high-wind conditions, while the vertical component exhibits higher error, with RMSE values of 0.029 to 0.091 m/s. The physics-informed reconstruction recovers the dominant spatial and temporal structure of the wind field up to 1000 m altitude while preserving mean flow direction and vertical shear. Under moderate wind conditions, the reconstructed mean wind field achieves an overall RMSE between 0.118 and 0.154 m/s across evaluated UAS configurations, with the lowest error obtained using a five-UAS swarm. These results demonstrate that coordinated UAS measurements enable accurate and scalable four-dimensional wind-field reconstruction without dedicated wind sensors or fixed infrastructure.

中文标题/摘要

标题：基于多UAS群组观测在合成湍流环境中的物理信息驱动四维大气风场重建

准确重建大气风场对于天气预报、灾害预测和风能评估等应用至关重要，但传统仪器在下边界层内存在时空空白。无人驾驶航空系统（UAS）提供了灵活的原位测量，但单个平台仅在其飞行轨迹上采样风，限制了风场的全面恢复。本研究提出了一种使用协调UAS群组观测数据重建四维大气风场的框架。使用合成湍流环境和高保真多旋翼模拟生成训练和评估数据。利用双向长短期记忆网络（Bi-LSTM）从UAS动力学估计局部风分量，并将其整合到物理信息神经网络（PINN）中，以重建空间和时间连续的风场。在低风速条件下，双向LSTM在北向和东向分量上的均方根误差（RMSE）分别为0.064和0.062米/秒，中等风速条件下增加到0.122至0.129米/秒，在高风速条件下增加到0.271至0.273米/秒，而垂直分量的误差较高，RMSE值为0.029至0.091米/秒。物理信息重建恢复了1000米高度以下风场的主要空间和时间结构，同时保持了平均流方向和垂直切变。在中等风速条件下，重建的平均风场在评估的UAS配置中总体RMSE在0.118至0.154米/秒之间，使用五UAS群组获得最低误差。这些结果表明，协调的UAS测量能够实现无需专用风速传感器或固定基础设施的四维风场准确且可扩展的重建。

Summary / 总结

This study aims to accurately reconstruct four-dimensional atmospheric wind fields using coordinated unmanned aircraft system (UAS) swarm observations in a synthetic turbulence environment. The research employs a bidirectional long short-term memory network (Bi-LSTM) for local wind component estimation and a physics-informed neural network (PINN) for wind field reconstruction. The Bi-LSTM achieves RMSEs of 0.064 to 0.273 m/s for wind components under different wind conditions, while the PINN reconstructs the wind field with an overall RMSE of 0.118 to 0.154 m/s under moderate wind conditions, using a five-UAS swarm for the lowest error. The reconstructed wind field preserves the dominant spatial and temporal structure and mean flow direction up to 1000 m altitude.

本研究旨在通过协调的无人机集群观测，在合成湍流环境中准确重建四维大气风场。研究采用双向长短期记忆网络（Bi-LSTM）进行局部风分量估计，并使用物理信息神经网络（PINN）进行风场重建。结果表明，Bi-LSTM在不同风速条件下对风分量的RMSE值为0.064至0.273 m/s，而PINN在中等风速条件下使用五架无人机集群配置进行风场重建时，整体RMSE值为0.118至0.154 m/s，证明了该框架在无需专用传感器或固定基础设施的情况下进行风场重建的有效性。

Boosting CVaR Policy Optimization with Quantile Gradients

Authors: Yudong Luo, Erick Delage

First: 2026-01-29T18:33:46+00:00 · Latest: 2026-01-29T18:33:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.

中文标题/摘要

标题：利用分位数梯度提升CVaR策略优化

使用策略梯度优化条件值-at-风险(CVaR)面临显著的样本效率问题。这一问题源于CVaR关注尾部性能而忽略了大量采样轨迹。我们通过将CVaR与期望分位数项结合来解决这一问题。分位数优化允许使用动态规划形式，利用所有采样数据，从而提高样本效率。这不会改变CVaR目标，因为CVaR对应于尾部的分位数期望。在具有可验证风险厌恶行为的领域中，我们的算法在马尔可夫策略类中显著优于CVaR-PG，并且始终优于其他现有方法。

Summary / 总结

The paper addresses the sample inefficiency in optimizing Conditional Value-at-Risk (CVaR) using policy gradients by incorporating an expected quantile term. This approach enhances sample efficiency by utilizing all sampled trajectories, aligning with the CVaR objective without altering it. Experimental results demonstrate that the proposed method outperforms CVaR-PG and other existing methods in risk-averse domains.

论文通过引入期望分位数项来解决使用策略梯度方法优化条件值-at-风险(CVaR)时的样本低效问题，通过动态规划形式利用所有采样数据提高了样本效率。实验结果表明，所提出的方法在风险厌恶领域中显著优于CVaR-PG和其他现有方法。

RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Authors: Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang

Venue: ICLR 2026

First: 2026-01-29T18:30:10+00:00 · Latest: 2026-01-29T18:30:10+00:00

Comments: ICLR 2026. Project page: https://judgementh.github.io/RefAny3D Codes: https://github.com/JudgementH/RefAny3D

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.

中文标题/摘要

标题：RefAny3D：基于3D资产的扩散模型图像生成

在本文中，我们提出了一种基于3D资产的扩散模型，探索如何将3D资产整合到图像扩散模型中。现有的基于参考的图像生成方法利用大规模预训练的扩散模型，并展示了在单张参考图像条件下生成多样化图像的强大能力。然而，这些方法仅限于单张图像参考，无法利用3D资产，限制了其实用的灵活性。为了解决这一差距，我们提出了一种跨域扩散模型，具有双分支感知机制，利用多视角RGB图像和3D资产的点图来联合建模它们的颜色和标准空间坐标，实现生成图像与3D参考之间的精确一致性。我们的空间对齐双分支生成架构和域解耦生成机制确保同时生成两个空间对齐但内容分离的输出，RGB图像和点图，将2D图像属性与3D资产属性联系起来。实验表明，我们的方法有效地利用3D资产作为参考，生成与给定资产一致的图像，为将扩散模型与3D内容创作相结合开辟了新的可能性。

Summary / 总结

This paper introduces RefAny3D, a 3D asset-referenced diffusion model for image generation, addressing the limitation of existing methods that can only use single-image references. The model uses multi-view RGB images and point maps of 3D assets to generate images that are consistent with the 3D references, demonstrating precise alignment and disentanglement of 2D and 3D attributes. Experiments show that RefAny3D effectively leverages 3D assets as references to produce high-quality images, enhancing the practical versatility of image generation models.

本文提出了RefAny3D，一种基于3D资产的图像生成扩散模型。该模型通过整合多视角RGB图像和3D资产的点图，解决了现有方法只能使用单张参考图像的局限性。通过空间对齐的双分支生成架构和领域解耦生成机制，该模型实现了生成图像与3D参考的精确一致性。实验表明，RefAny3D能够有效利用3D资产作为参考生成与给定资产一致的图像，增强了图像生成模型的实用性和灵活性。

Learning Transient Convective Heat Transfer with Geometry Aware World Models

Authors: Onur T. Doganay, Alexander Klawonn, Martin Eigel, Hanno Gottschalk

First: 2026-01-29T18:24:24+00:00 · Latest: 2026-01-29T18:24:24+00:00

Comments: 36 pages, 18 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Partial differential equation (PDE) simulations are fundamental to engineering and physics but are often computationally prohibitive for real-time applications. While generative AI offers a promising avenue for surrogate modeling, standard video generation architectures lack the specific control and data compatibility required for physical simulations. This paper introduces a geometry aware world model architecture, derived from a video generation architecture (LongVideoGAN), designed to learn transient physics. We introduce two key architecture elements: (1) a twofold conditioning mechanism incorporating global physical parameters and local geometric masks, and (2) an architectural adaptation to support arbitrary channel dimensions, moving beyond standard RGB constraints. We evaluate this approach on a 2D transient computational fluid dynamics (CFD) problem involving convective heat transfer from buoyancy-driven flow coupled to a heat flow in a solid structure. We demonstrate that the conditioned model successfully reproduces complex temporal dynamics and spatial correlations of the training data. Furthermore, we assess the model's generalization capabilities on unseen geometric configurations, highlighting both its potential for controlled simulation synthesis and current limitations in spatial precision for out-of-distribution samples.

中文标题/摘要

标题：学习几何感知世界模型中的瞬态对流热传递

偏微分方程（PDE）模拟是工程和物理学的基础，但在实时应用中通常计算成本高昂。虽然生成式AI为替代建模提供了有希望的途径，但标准的视频生成架构缺乏物理模拟所需的特定控制和数据兼容性。本文介绍了一种几何感知世界模型架构，该架构源自视频生成架构（LongVideoGAN），旨在学习瞬态物理现象。我们引入了两个关键架构元素：（1）结合全局物理参数和局部几何掩码的双重条件机制，（2）架构适应以支持任意通道维度，超越了标准的RGB限制。我们通过一个涉及由浮力驱动的流动与固体结构中热流耦合的2D瞬态计算流体动力学（CFD）问题，评估了这种方法。我们证明，条件模型成功地再现了训练数据的复杂时空动态和空间相关性。此外，我们还评估了该模型在未见几何配置上的泛化能力，突显了其在受控模拟合成方面的潜力以及在分布外样本中空间精度的当前局限性。

Summary / 总结

This paper addresses the computational challenges of using partial differential equation (PDE) simulations in real-time applications by proposing a geometry-aware world model architecture. The model, derived from LongVideoGAN, incorporates a twofold conditioning mechanism and supports arbitrary channel dimensions. It is evaluated on a 2D transient CFD problem involving convective heat transfer, demonstrating the model's ability to reproduce complex temporal and spatial dynamics and its potential for controlled simulation synthesis on unseen geometric configurations, though it shows limitations in spatial precision for out-of-distribution samples.

本文通过提出一种几何感知的世界模型架构来应对使用偏微分方程（PDE）模拟在实时应用中的计算挑战。该模型基于LongVideoGAN，包含双重条件机制并支持任意通道维度。它在涉及对流热传递的2D瞬态CFD问题上进行了评估，展示了能够重现复杂的时间和空间动态的能力。该模型还显示出在未见过的几何配置上进行泛化的潜力，尽管它在处理分布外样本的空间精度方面存在局限性。

Where Do the Joules Go? Diagnosing Inference Energy Consumption

Authors: Jae-Won Chung, Ruofan Wu, Jeff J. Ma, Mosharaf Chowdhury

First: 2026-01-29T18:16:45+00:00 · Latest: 2026-01-29T18:16:45+00:00

Comments: The ML.ENERGY Leaderboard v3.0 is open https://ml.energy/leaderboard

Abs · PDF · Code1 · Code2

Abstract

Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.

中文标题/摘要

标题：能量去向何方？诊断推理能耗消耗

能源现在是关键的ML计算资源。虽然测量能耗和观察趋势是一个有价值的初步步骤，但准确理解并诊断这些差异发生的原因对于优化至关重要。为此，我们首先通过在NVIDIA H100和B200 GPU上使用46个模型、7个任务和1,858种不同配置进行大规模测量研究，展示了生成AI领域的推理时间和能耗情况。我们的实证发现涵盖了数量级的差异：LLM任务类型可能导致25倍的能耗差异，视频生成有时消耗超过100倍的图像能耗，而GPU利用率差异可能导致3到5倍的能耗差异。基于我们的观察，我们提出了一种关于时间与能耗消耗背后机制的推理框架。核心在于时间与能耗由潜在指标如内存和利用率决定，而这些指标又受到算法、软件和硬件各层因素的影响。我们的框架还直接扩展到每瓦吞吐量，这是受电力限制的数据中心的关键指标。

Summary / 总结

This study investigates the energy consumption of inference tasks in generative AI models, measuring 1,858 configurations across 46 models and 7 tasks on NVIDIA H100 and B200 GPUs. It finds significant variations in energy consumption, with LLM tasks using 25 times more energy than others, and video generation consuming up to 100 times more energy than image generation. The research proposes a framework to understand these differences, attributing them to factors like memory and utilization across algorithm, software, and hardware layers, and extending to throughput per watt for power-constrained datacenters.

该研究调查了各种生成AI模型和任务下的推理能耗，发现了显著的能耗差异。通过在NVIDIA H100和B200 GPU上测量1,858种配置，研究者发现LLM任务类型可能导致高达25倍的能耗增加，而视频生成的能耗有时会超过图像生成的100倍。研究提出了一种框架来理解影响时间和能耗的因素，包括由算法、软件和硬件层影响的内存和利用率指标。该框架还适用于每瓦吞吐量，这是受限于功率的数据中心的关键指标。

SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

Authors: Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar

First: 2025-11-12T18:25:51+00:00 · Latest: 2026-01-29T17:59:16+00:00

Comments: 10 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

中文标题/摘要

标题：SiDGen：结构导向的扩散模型用于蛋白质配体生成

设计既化学有效又与蛋白质结合口袋结构兼容的配体是计算药物发现中的一个关键瓶颈。现有方法要么忽略结构上下文，要么依赖于昂贵且内存密集的编码，这限制了吞吐量和可扩展性。我们提出了SiDGen（结构导向扩散生成器），这是一种基于蛋白质条件的扩散框架，结合了掩码SMILES生成和轻量级折叠衍生特征以增强口袋意识。为了平衡表达性和效率，SiDGen 支持两种条件路径：一种简化模式，从蛋白质嵌入中聚合粗略的结构信号；另一种完整模式，注入局部成对偏差以实现更强的耦合。通过粗步长折叠机制和最近邻上采样，缓解成对张量的二次内存成本，使模型能够处理现实长度的序列。通过循环中的化学有效性检查和无效性惩罚保持学习稳定性，通过选择性编译、数据加载器调优和梯度累积恢复大规模训练效率。在自动化基准测试中，SiDGen 生成的配体具有高有效性、独特性和新颖性，同时在对接评估中表现出竞争力，并保持合理的分子性质。这些结果表明，SiDGen 可以实现可扩展且具有口袋意识的分子设计，提供了一条用于高通量药物发现的条件生成的实用途径。

Summary / 总结

SiDGen is a protein-conditioned diffusion framework designed to generate chemically valid and structurally compatible ligands for proteins. It integrates masked SMILES generation with lightweight folding-derived features and supports two conditioning pathways for efficiency and expressivity. SiDGen achieves high validity, uniqueness, and novelty in automated benchmarks, and performs competitively in docking-based evaluations while maintaining reasonable molecular properties.

SiDGen 是一种蛋白质条件下的扩散框架，结合了掩码 SMILES 生成和轻量级折叠衍生特征，以生成化学上有效的且结构上与蛋白质结合口袋兼容的配体。它支持两种条件路径以平衡表达性和效率，并使用粗粒度步长折叠机制来缓解内存成本。实验结果表明，SiDGen 生成的配体具有高有效性、独特性和新颖性，并在对接评估中表现出竞争力，同时保持合理的分子性质。

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Authors: Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang

First: 2026-01-29T17:58:40+00:00 · Latest: 2026-01-29T17:58:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

中文标题/摘要

标题：Vision-DeepResearch: 在多模态大型语言模型中激励深度研究能力

多模态大型语言模型（MLLMs）在广泛的视觉任务中取得了显著的成功。然而，由于其内部世界知识的容量限制，先前的工作通过“推理后调用工具”来增强MLLMs，以在需要大量事实信息的任务中获得显著的收益。然而，这些方法通常在一种简单的设置下定义多模态搜索，假设单一的全层级或实体层级图像查询和少量文本查询足以检索到回答问题所需的关键证据，这在现实世界中存在大量视觉噪声的情况下是不现实的。此外，它们通常在推理深度和搜索广度上受到限制，使得解决需要从多种视觉和文本来源汇总证据的复杂问题变得困难。在此基础上，我们提出了Vision-DeepResearch，它提出了一种新的多模态深度研究范式，即进行多轮、多实体和多尺度的视觉和文本搜索，以在重噪声环境下稳健地击中现实世界的搜索引擎。我们的Vision-DeepResearch支持数十个推理步骤和数百次引擎交互，通过冷启动监督和强化学习训练将深度研究能力内置于MLLM中，从而形成一个强大的端到端多模态深度研究MLLM。它在现有的多模态深度研究MLLMs以及基于强大封闭源基础模型的工作流程（如GPT-5、Gemini-2.5-pro和Claude-4-Sonnet）中表现出显著的优越性。代码将在https://github.com/Osilly/Vision-DeepResearch中发布。

Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models

Authors: Archer Wang, Emile Anand, Yilun Du, Marin Soljačić

First: 2026-01-29T17:57:06+00:00 · Latest: 2026-01-29T17:57:06+00:00

Comments: 28 pages, 16 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.

中文标题/摘要

标题：无监督分解与重组：基于鉴别器驱动的扩散模型

将复杂数据分解为因子表示可以揭示可重用的组件，并通过组件重组生成新的样本。我们研究了在无需因子级别监督的情况下，基于扩散模型学习因子化的潜在空间。在图像中，因子可以捕捉背景、照明和对象属性；在机器人视频中，它们可以捕捉可重用的动作组件。为了提高潜在因子发现的质量和组合生成的质量，我们通过训练一个鉴别器来区分单一来源样本和通过重组因子生成的样本，引入了一种对抗训练信号。通过优化生成器来欺骗这个鉴别器，我们鼓励重组结果中的物理和语义一致性。我们的方法在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D上优于先前基线的实现，实现了更低的FID分数和更好的解耦合，如MIG和MCC所测量的。此外，我们展示了机器人视频轨迹的一个新应用：通过重组学习到的动作组件，我们生成了多样化的序列，显著增加了LIBERO基准上的状态空间覆盖。

Summary / 总结

The research aims to decompose complex data into reusable components for better data synthesis. It uses diffusion-based models to learn factorized latent spaces without supervision. An adversarial training signal from a discriminator improves both latent factor discovery and compositional generation quality. The method outperforms prior baselines on various datasets, showing lower FID scores and better disentanglement. It also demonstrates a novel application in generating diverse robotic video trajectories for exploration on the LIBERO benchmark.

该论文研究了使用基于扩散的模型进行复杂数据的无监督分解和重组。作者通过引入判别器的对抗训练信号来增强潜在因子的发现质量和组合生成质量。该方法在多个数据集上优于先前的基线，实现了更低的FID分数和更好的去纠缠性。此外，它在机器人视频中展示了有效的行为组件重组，提高了探索基准中的状态空间覆盖范围。

PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials

Authors: Teddy Koker, Abhijeet Gangan, Mit Kotak, Jaime Marian, Tess Smidt

First: 2026-01-12T17:20:09+00:00 · Latest: 2026-01-29T17:54:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP by 55% on average across phonon thermodynamic properties and achieves state-of-the-art accuracy among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.

中文标题/摘要

标题：PFT：声子微调用于机器学习原子势

许多材料性质依赖于势能面的高阶导数，而通过标准损失函数（能量、力和应力误差）训练的机器学习原子势（MLIPs）可能会在曲率上出现误差，从而降低振动性质的预测精度。我们引入了声子微调（PFT），它直接监督材料的二阶力常数，通过将MLIP的能量海森矩阵与从有限位移声子计算中得到的DFT计算得到的力常数进行匹配。为了扩展到大型超晶胞，PFT随机采样海森矩阵的列，并通过单个海森矩阵-向量乘积计算损失。我们还使用简单的协同训练方案来整合上游数据，以减轻灾难性遗忘。在MDR声子基准测试中，PFT在声子热力学性质上的平均改进幅度为55%，并在基于材料项目轨迹训练的模型中达到了最先进的精度。PFT还能够泛化以改进超出二阶导数的性质，从而提高依赖于势能三次导数的热导率预测。

Summary / 总结

The research aims to improve the accuracy of machine learned interatomic potentials (MLIPs) in predicting higher-order derivatives of the potential energy surface, which are crucial for materials properties like vibrational and thermal conductivity. The method, phonon fine-tuning (PFT), directly supervises the second-order force constants by matching MLIP energy Hessians to DFT-computed force constants. PFT achieves significant improvements, enhancing Nequix MP by 55% on average for phonon thermodynamic properties and setting a new state-of-the-art in thermal conductivity predictions. Beyond second-derivatives, PFT also generalizes to improve third-order derivative properties.

研究旨在提高机器学习原子势（MLIPs）在预测势能面的高阶导数方面的准确性，这对于材料性质如振动和热导率至关重要。方法是通过直接监督第二阶力常数，将MLIP的能量海森矩阵与从有限位移声子计算得到的DFT力常数进行匹配的声子微调（PFT）。PFT在MDR声子基准测试中显著提高了Nequix MP的平均性能，达到55%的提升，并在热导率预测中达到了最先进的精度。此外，PFT还能够推广到改进第三阶导数性质。

Corrective Diffusion Language Models

Authors: Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, Grigorios G. Chrysos

First: 2025-12-17T17:04:38+00:00 · Latest: 2026-01-29T17:52:44+00:00

Comments: 21 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Diffusion Language Models (DLMs) are theoretically well-suited for iterative refinement due to their non-causal structure, they often fail to reliably revise incorrect tokens in practice. The key challenge lies in the model's inability to distinguish between correct and erroneous tokens in a visible sequence. Standard masked diffusion language model (MDLM) training is restricted to the objective of unmasking, undermining the effectiveness of refinement guided by confidence. Based on this observation, we study corrective behavior in DLMs, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a post-training principle oriented by correction that explicitly supervises visible incorrect tokens, enabling discriminative confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark, a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and parallel decoding scenarios demonstrate that models trained with our approach substantially outperform standard MDLMs, with gains that are most pronounced when parallel decoding introduces substantial uncertainty and iterative refinement becomes essential. Our code is publicly available at https://github.com/zhangshuibai/CDLM.

中文标题/摘要

标题：纠正性扩散语言模型

尽管扩散语言模型（DLMs）由于其非因式结构在理论上非常适合逐步改进，但在实践中往往无法可靠地修正错误的标记。关键挑战在于模型无法区分可见序列中的正确和错误标记。标准的掩码扩散语言模型（MDLM）训练仅限于解掩的目的，削弱了由置信度引导的改进效果。基于这一观察，我们研究了DLMs的纠正行为，定义为能够对错误标记赋予较低置信度，并在保持正确内容的同时逐步修正它们。我们表明，这种能力不是由传统的掩码扩散目标诱导的，并提出了一种以纠正为导向的后训练原则，明确监督可见的错误标记，从而实现区分性置信度和目标修正。为了评估纠正行为，我们引入了代码修订基准，这是一个可控且可执行的基准，用于评估错误定位和就地修正。在代码修订任务和并行解码场景上的实验表明，使用我们方法训练的模型显著优于标准MDLM，尤其是在并行解码引入大量不确定性且逐步修正变得必不可少时，这种优势最为明显。我们的代码已公开发布在https://github.com/zhangshuibai/CDLM。

Summary / 总结

The research aims to improve the ability of Diffusion Language Models (DLMs) to correct errors iteratively by addressing their challenge in distinguishing correct from incorrect tokens. The study proposes a post-training principle that explicitly supervises visible incorrect tokens, enabling models to assign lower confidence to errors and iteratively refine them while preserving correct content. Experiments show that models trained with this approach outperform standard MDLMs, especially in scenarios with high uncertainty during parallel decoding.

研究旨在通过解决扩散语言模型(DLMs)难以区分正确和错误标记的问题，提高其纠错能力。研究引入了一种后训练原则，明确监督可见的错误标记，使模型能够对错误赋予较低的置信度并进行迭代修正。实验结果显示，使用这种方法训练的模型在代码修订任务和并行解码场景中显著优于标准的掩码扩散语言模型，尤其是在并行解码引入大量不确定性需要迭代修正时效果更佳。

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Authors: Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen

First: 2026-01-29T17:52:41+00:00 · Latest: 2026-01-29T17:52:41+00:00

Comments: Project Page: https://metric-anything.github.io/metric-anything-io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

中文标题/摘要

标题：MetricAnything：通过嘈杂异构数据源扩展度量深度预训练

扩展性推动了视觉基础模型的最新进展，但将其扩展到度量深度估计仍然具有挑战性，因为存在异构传感器噪声、相机相关的偏差以及嘈杂跨源3D数据中的度量模糊性。我们提出了Metric Anything，这是一种简单且可扩展的预训练框架，可以从嘈杂且多样的3D数据源中学习度量深度，而无需手动工程化的提示、相机特定的建模或任务特定的架构。我们方法的核心是稀疏度量提示，它是通过随机遮蔽深度图创建的，作为通用接口，将空间推理与传感器和相机偏差解耦。使用来自10000种相机模型的重建、捕获和渲染3D数据的约2000万幅图像-深度对，我们首次展示了度量深度跟踪中的清晰扩展趋势。预训练模型在深度完成、超分辨率和雷达-相机融合等提示驱动任务中表现出色，而其精简的无提示学生模型在单目深度估计、相机内参恢复、单视或多视度量3D重建和VLA规划方面达到了最先进的结果。我们还展示了使用Metric Anything的预训练ViT作为视觉编码器可以显著提升多模态大型语言模型在空间智能方面的性能。这些结果表明，度量深度估计可以从驱动现代基础模型的相同扩展法则中受益，从而开辟了一条新的通向可扩展和高效现实世界度量感知的道路。我们开源了MetricAnything，网址为http://metric-anything.github.io/metric-anything-io/，以支持社区研究。

Summary / 总结

MetricAnything is a pretraining framework for metric depth estimation that leverages noisy, heterogeneous 3D data sources without requiring task-specific architectures or prompts. By using a Sparse Metric Prompt created by randomly masking depth maps, it decouples spatial reasoning from sensor and camera biases. The pretrained model shows a clear scaling trend and excels in tasks like depth completion and super-resolution, while its distilled version achieves state-of-the-art results in various 3D reconstruction tasks and VLA planning. This work establishes that metric depth estimation can benefit from scaling laws similar to those of modern foundation models.

MetricAnything 是一个用于度量深度估计的预训练框架，利用了嘈杂的异构 3D 数据源，无需手动提示或特定任务的架构。通过使用随机遮掩深度图生成的稀疏度量提示，它将空间推理与传感器和相机偏见解耦。预训练模型展示了清晰的扩展趋势，并在深度补全和超分辨率等任务中表现出色。精简的学生模型在单目深度估计和其他 3D 重建任务中达到了最先进的结果。这项工作表明，度量深度估计可以从现代基础模型的扩展规律中受益，从而提高现实世界度量感知的效率和可扩展性。

PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Authors: Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, Mulin Yu

First: 2026-01-29T17:47:26+00:00 · Latest: 2026-01-29T17:47:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .

中文标题/摘要

标题：PLANING：一种松耦合三角高斯框架用于流式3D重建

单目图像序列的流式重建仍然具有挑战性，因为现有方法通常更倾向于高质量的渲染或准确的几何结构，但很少两者兼顾。我们提出了PLANING，这是一种基于混合表示的高效实时重建框架，该框架松散地将显式的几何原语与神经高斯函数耦合，使得几何结构和外观可以以解耦的方式建模。这种解耦支持一种在线初始化和优化策略，将几何结构和外观更新分离，从而实现具有显著减少结构冗余的稳定流式重建。PLANING在稠密网格Chamfer-L2上比PGSR提高了18.52%，在PSNR上比ARTDECO提高了1.31 dB，并在不到100秒内重建了ScanNetV2场景，比2D高斯点积快5倍以上，同时保持了离线逐场景优化的质量。除了重建质量，PLANING的结构清晰度和计算效率使其适用于广泛的下游应用，如大规模场景建模和为具身AI准备的模拟环境。项目页面：https://city-super.github.io/PLANING/ 。

Summary / 总结

The research addresses the challenge of streaming 3D reconstruction from monocular image sequences, where existing methods often prioritize either high-quality rendering or accurate geometry. PLANING introduces a hybrid representation that decouples geometric primitives from neural Gaussians, allowing for efficient online initialization and optimization. This approach significantly improves dense mesh reconstruction quality, reducing structural redundancy and achieving faster reconstruction times compared to existing methods, while maintaining high-quality results. Beyond reconstruction, PLANING's structural clarity and computational efficiency make it suitable for various downstream applications such as scene modeling and embodied AI environments.

研究通过提出PLANING框架，利用显式几何原语和神经高斯的混合表示来解耦几何和外观，解决了从单目图像序列进行流式3D重建的挑战。该方法实现了高效的在线重建，提高了质量指标，减少了结构冗余，并比现有方法更快地重建场景，耗时不到100秒，比2D高斯散点图快5倍以上，同时匹配离线场景优化的质量。

Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion

Authors: Da Li, Chen Yao, Tong Mao, Jiacheng Bao, Houjun Sun

First: 2026-01-29T17:47:07+00:00 · Latest: 2026-01-29T17:47:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.

中文标题/摘要

标题：基于3D SAR融合的受限稀疏航空影像城市神经表面重建

神经表面重建（NSR）最近在多视角航空影像的城市三维重建中显示出强大的潜力。然而，现有的NSR方法在稀疏视角条件下经常遭受几何模糊和不稳定性的困扰。这个问题在大规模城市遥感中尤为关键，因为航空影像获取受限于飞行路径、地形和成本。为了解决这一挑战，我们提出了第一个将3D合成孔径雷达（SAR）点云与航空影像融合的城市NSR框架，以在受限、稀疏视角设置下实现高保真重建。3D SAR可以从单侧飞行路径高效地捕获大规模几何结构，提供与图像的光度线索互补的鲁棒先验。我们的框架将雷达提取的空间约束整合到基于SDF的NSR骨干中，指导结构感知的射线选择和自适应采样，以实现稳定和高效的优化。我们还构建了第一个包含配准3D SAR点云和航空影像的基准数据集，便于系统评估跨模态三维重建。大量实验表明，在高度稀疏和偏斜视角条件下，结合3D SAR显著提高了重建的准确度、完整性和鲁棒性，突显了通过先进的机载和星载光学-SAR传感实现可扩展高保真城市重建的可行途径。

Summary / 总结

The research aims to improve urban 3D reconstruction from sparse aerial imagery by integrating 3D synthetic aperture radar (SAR) data. The method fuses SAR point clouds with aerial imagery to provide robust geometric priors, addressing the challenges of geometric ambiguity and instability under constrained view conditions. Experiments demonstrate that the proposed framework significantly enhances reconstruction accuracy, completeness, and robustness compared to single-modality approaches in highly sparse and oblique-view scenarios.

研究通过提出一种新颖的框架，将3D合成孔径雷达（SAR）点云与航空影像融合，以解决从稀疏航空影像中进行城市三维重建时的几何模糊问题。该方法将雷达提取的空间约束整合到基于SDF的神经表面重建模型中，从而在受限、稀疏视角设置下提高了重建的准确度、完整性和鲁棒性。通过广泛的实验验证，该框架在高度稀疏和侧视视角条件下显著优于单一模态基线方法。

SIA: Symbolic Interpretability for Anticipatory Deep Reinforcement Learning in Network Control

Authors: MohammadErfan Jabbari, Abhishek Duttagupta, Claudio Fiandrino, Leonardo Bonati, Salvatore D'Oro, Michele Polese, Marco Fiore, Tommaso Melodia

First: 2026-01-29T17:46:46+00:00 · Latest: 2026-01-29T17:46:46+00:00

Comments: 10 pages, 12 figures, accepted at IEEE INFOCOM 2026

Abs · PDF · Code1 · Code2

Abstract

Deep reinforcement learning (DRL) promises adaptive control for future mobile networks but conventional agents remain reactive: they act on past and current measurements and cannot leverage short-term forecasts of exogenous KPIs such as bandwidth. Augmenting agents with predictions can overcome this temporal myopia, yet uptake in networking is scarce because forecast-aware agents act as closed-boxes; operators cannot tell whether predictions guide decisions or justify the added complexity. We propose SIA, the first interpreter that exposes in real time how forecast-augmented DRL agents operate. SIA fuses Symbolic AI abstractions with per-KPI Knowledge Graphs to produce explanations, and includes a new Influence Score metric. SIA achieves sub-millisecond speed, over 200x faster than existing XAI methods. We evaluate SIA on three diverse networking use cases, uncovering hidden issues, including temporal misalignment in forecast integration and reward-design biases that trigger counter-productive policies. These insights enable targeted fixes: a redesigned agent achieves a 9% higher average bitrate in video streaming, and SIA's online Action-Refinement module improves RAN-slicing reward by 25% without retraining. By making anticipatory DRL transparent and tunable, SIA lowers the barrier to proactive control in next-generation mobile networks.

中文标题/摘要

标题：SIA：网络控制中预见性深度强化学习的符号可解释性

深度强化学习（DRL）为未来的移动网络提供了自适应控制的潜力，但传统的代理仍然是反应性的：它们基于过去的和当前的测量值行动，无法利用外生KPI（如带宽）的短期预测。通过预测增强代理可以克服这种时间短视，但在网络中采用却很少，因为预测意识型的代理作为黑箱操作，运营商无法判断预测是否指导决策或增加复杂性。我们提出了SIA，这是第一个实时揭示预测增强DRL代理操作方式的解释器。SIA 结合了符号人工智能抽象与每个KPI的知识图谱，生成解释，并包含一个新的影响分数指标。SIA 达到了亚毫秒级的速度，比现有解释性人工智能方法快200多倍。我们在三个不同的网络应用场景上评估了SIA，发现了隐藏的问题，包括预测集成的时间错位和奖励设计偏差，这些偏差触发了反生产性的策略。这些见解使我们能够进行针对性的修复：重新设计的代理在视频流媒体中平均比特率提高了9%，SIA 的在线行动精炼模块在不重新训练的情况下将RAN切片奖励提高了25%。通过使预见性DRL透明和可调，SIA 降低了下一代移动网络中主动控制的门槛。

Summary / 总结

SIA is designed to enhance the interpretability of deep reinforcement learning (DRL) agents in network control by integrating short-term forecasts of KPIs like bandwidth. It uses Symbolic AI and per-KPI Knowledge Graphs to generate real-time explanations and introduces an Influence Score metric. SIA significantly improves the transparency of forecast-aware DRL agents, enabling faster and more targeted fixes. Experimental results show that SIA can uncover hidden issues such as temporal misalignment and reward-design biases, leading to a 9% higher average bitrate in video streaming and a 25% improvement in RAN-slicing reward without retraining.

SIA 是一种方法，通过结合短期 KPI 预测来增强网络控制中深度强化学习代理的可解释性。它使用符号 AI 和知识图谱生成实时解释，并引入了影响分数指标。SIA 显著提高了预测感知代理的透明度，使运营商能够理解其决策过程。实验结果表明，SIA 可以揭示网络控制中的隐藏问题，并导致有针对性的修复，例如视频流传输的平均比特率提高 9%，以及 RAN 切片奖励提高 25% 而无需重新训练。

Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Authors: Manuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

First: 2026-01-29T17:44:23+00:00 · Latest: 2026-01-29T17:44:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.

中文标题/摘要

标题：理解多模态互补性以实现单帧动作预判

人类动作预判通常被视为一个视频理解问题，隐含地假设需要密集的时间信息来推断未来动作。在本文中，我们通过研究动作预判在仅限于单个视觉观察时能实现什么，挑战了这一假设。我们提出一个基本问题：未来的信息在单帧中已经编码了多少，如何有效利用？基于我们之前关于一瞥动作预判（AAG）的工作，我们系统地研究了单帧动作预判，结合了互补的信息来源。我们分析了RGB外观、基于深度的几何线索和过去动作的语义表示的贡献，并探讨了不同的多模态融合策略、关键帧选择策略和过去动作历史来源如何影响预判性能。根据这些发现，我们将最有效的设计选择整合到AAG+中，这是一个改进的单帧预判框架。尽管仅基于单帧，AAG+在原始AAG的基础上始终表现出改进，并在包括IKEA-ASM、Meccano和Assembly101在内的具有挑战性的预判基准上达到了与最先进的基于视频的方法相当或更优的性能。我们的结果为单帧动作预判的局限性和潜力提供了新的见解，并澄清了何时需要密集的时间建模以及何时一个精心选择的瞥视就足够。

Summary / 总结

This work challenges the common assumption that dense temporal information is necessary for human action anticipation by investigating single-frame action anticipation. The authors explore the information encoded in a single frame and how it can be effectively utilized, leading to the development of AAG+, which improves upon the original AAG and matches or exceeds the performance of state-of-the-art video-based methods on challenging benchmarks. Key findings include the effective use of RGB appearance, depth-based geometric cues, and semantic representations of past actions, as well as the importance of multimodal fusion strategies and keyframe selection policies.

这项工作挑战了动作预见通常需要密集的时间信息这一假设，通过研究单帧动作预见来探索单帧中已编码的信息及其有效利用方式，最终开发出AAG+，该方法在具有挑战性的基准测试中提高了原AAG的表现，并且与最先进的基于视频的方法相当或超越。研究提供了单帧动作预见的局限性和潜力的新见解，并指明了何时需要密集的时间建模，何时一个精心选择的瞬间就足够了。

Optimizing Agentic Workflows using Meta-tools

Authors: Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

First: 2026-01-29T17:43:08+00:00 · Latest: 2026-01-29T17:43:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.

中文标题/摘要

标题：使用元工具优化代理工作流

代理AI使LLM能够动态推理、规划并与工具交互以解决复杂任务。然而，代理工作流通常需要许多迭代推理步骤和工具调用，导致显著的操作成本、端到端延迟和由于幻觉导致的失败。本研究引入了代理工作流优化(AWO)框架，该框架识别并优化冗余的工具执行模式，以提高代理工作流的效率和鲁棒性。AWO分析现有的工作流轨迹，发现重复的工具调用序列，并将它们转换为元工具，这是一种确定性的复合工具，将多个代理操作打包成一次调用。元工具绕过了不必要的中间LLM推理步骤，减少了操作成本，同时缩短了执行路径，减少了失败次数。在两个代理AI基准测试上的实验表明，AWO将LLM调用次数最多减少了11.9%，同时将任务成功率提高了4.2个百分点。

Summary / 总结

This work addresses the inefficiencies and operational costs associated with agentic workflows by introducing Agent Workflow Optimization (AWO). AWO identifies and optimizes redundant tool execution patterns, transforming them into meta-tools that bundle multiple agent actions into a single invocation. This approach reduces the number of LLM calls by up to 11.9% and increases the task success rate by up to 4.2 percentage points.

这项工作通过引入Agent Workflow Optimization (AWO)来解决agents工作流中的效率和鲁棒性问题。AWO识别并优化重复的工具执行模式，将其转化为包含多个agent动作的单一调用的meta-tools。这种方法最多可以减少11.9%的LLM调用次数，并将任务成功率提高4.2个百分点。

Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

Authors: Longxuan Yu, Yu Fu, Shaorong Zhang, Hui Liu, Mukund Varma T, Greg Ver Steeg, Yue Dong

First: 2026-01-29T17:40:58+00:00 · Latest: 2026-01-29T17:40:58+00:00

Comments: 18 pages, 13 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.

中文标题/摘要

标题：逆序思考：当输出顺序不再反映推理顺序时在扩散语言模型中的表现

自回归（AR）语言模型强制执行固定的从左到右生成顺序，当所需的输出结构与自然推理冲突时（例如，由于呈现或模式约束，先生成答案后解释），这种固定顺序成为基本限制。在这种情况下，AR模型必须在生成中间推理之前就做出答案的承诺，这种刚性约束迫使它们进行过早的承诺。掩码扩散语言模型（MDLMs），它们并行迭代细化所有标记，提供了一种将计算顺序与输出结构脱钩的方法。我们在GSM8K、Math500和ReasonOrderQA基准上验证了这一能力，这是一个我们引入的具有受控难度和顺序级评估的基准。当提示要求先给出答案后进行推理时，AR模型与标准思维链排序相比表现出巨大的准确度差距（最高达67%的相对下降），而MDLMs保持稳定（≤14%的相对下降），我们称这一特性为“顺序稳健性”。使用ReasonOrderQA，我们展示了MDLMs通过在扩散过程中更早地稳定更简单的标记（例如，推理步骤）而不是复杂的标记（例如，最终答案），使推理标记在答案承诺之前稳定，从而实现顺序稳健性。最后，我们确定了这种优势减弱的失败条件，概述了顺序稳健性所需的限制。

Summary / 总结

The paper explores the limitations of autoregressive (AR) language models in handling output structures that conflict with natural reasoning order. It introduces masked diffusion language models (MDLMs) that iteratively refine all tokens in parallel, allowing for a decoupling of computation order from output structure. Experiments on GSM8K, Math500, and ReasonOrderQA show that AR models suffer significant accuracy drops (up to 67% relative) when answers are requested before reasoning, while MDLMs maintain stability (up to 14% relative drop), a property termed 'order robustness'. The study demonstrates that MDLMs achieve this by stabilizing simpler tokens earlier in the diffusion process, enabling reasoning tokens to stabilize before answer commitment, though this advantage can weaken under certain conditions.

该研究探讨了自回归（AR）语言模型在处理与自然推理顺序冲突的输出结构时的局限性。引入了掩码扩散语言模型（MDLMs），这些模型能够并行迭代地细化所有标记，从而解耦计算顺序与输出结构。实验表明，当要求在推理之前给出答案时，AR模型的准确性会显著下降（最高达67%的相对下降），而MDLMs则保持稳定（最高14%的相对下降），这一特性被称为“顺序稳健性”。研究显示，MDLMs通过在扩散过程中更早地稳定较简单的标记（如推理步骤），使得较复杂的标记（如最终答案）能够稳定，从而实现推理标记在答案承诺之前稳定，尽管这种优势在某些条件下会减弱。

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Authors: Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu

First: 2026-01-29T17:39:20+00:00 · Latest: 2026-01-29T17:39:20+00:00

Abs · PDF · Code1 · Code2

Abstract

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

中文标题/摘要

标题：Drive-JEPA：视频JEPA结合多模态轨迹蒸馏实现端到端驾驶

端到端自动驾驶越来越多地利用自我监督的视频预训练来学习可转移的规划表示。然而，用于场景理解的视频世界模型的预训练到目前为止仅带来了有限的改进。这一限制进一步被驾驶固有的不确定性所加剧：每个场景通常只提供一条人类轨迹，使得学习多模态行为变得困难。在本文中，我们提出Drive-JEPA框架，该框架将视频联合嵌入预测架构（V-JEPA）与多模态轨迹蒸馏相结合，以实现端到端驾驶。首先，我们针对端到端驾驶对V-JEPA进行适应，在大规模驾驶视频上预训练一个ViT编码器，以生成与轨迹规划对齐的预测表示。其次，我们引入了一种以提案为中心的规划器，该规划器蒸馏来自模拟器生成的多种轨迹和人类轨迹，采用一种具有动量感知的选择机制来促进稳定和安全的行为。在NAVSIM上评估时，V-JEPA表示与简单的基于变换器的解码器结合使用，在无感知设置中比先前方法高出3个PDMS。完整的Drive-JEPA框架在v1上达到93.3个PDMS，在v2上达到87.8个EPDMS，创下了新的最佳水平。

Summary / 总结

The research aims to improve end-to-end autonomous driving by leveraging video pretraining and multimodal trajectory distillation. Drive-JEPA integrates V-JEPA with a proposal-centric planner to distill diverse trajectories. The framework outperforms previous methods by 3 PDMS in the perception-free setting and achieves new state-of-the-art results of 93.3 PDMS on v1 and 87.8 EPDMS on v2.

研究旨在通过利用视频预训练和多模态轨迹蒸馏来提升端到端自动驾驶。Drive-JEPA 将 V-JEPA 与提案中心规划器结合，以蒸馏多样化的轨迹。该框架在感知自由的设置中比之前的方法高出 3 PDMS，并在 v1 和 v2 上分别达到 93.3 PDMS 和 87.8 EPDMS，取得了新的最先进成果。

The Ensemble Inverse Problem: Applications and Methods

Authors: Zhengyan Huan, Camila Pazos, Martin Klassen, Vincent Croft, Pierre-Hugues Beauchemin, Shuchin Aeron

First: 2026-01-29T17:34:41+00:00 · Latest: 2026-01-29T17:34:41+00:00

Comments: 26 pages, 11 figures, in peer review

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non-iterative inference-time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth-observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods.

中文标题/摘要

标题：集合逆问题：应用与方法

我们引入了一个新的多元统计问题，称之为集合逆问题（EIP）。EIP 的目标是反演一个根据前向过程的先验推前得到的集合分布。在高能物理（HEP）中，这与一个广泛熟知的问题——解卷问题——相关，该问题旨在从被探测器效应扭曲的测量值中重构真实的物理分布，如动量和角度。在最近的应用中，EIP 也出现在全波形反演（FWI）和未知先验的逆成像中。我们提出了非迭代的推理时方法，基于一种新的条件生成模型类构建后验采样器，我们称之为集合逆生成模型。对于后验建模，这些模型还利用了观测集中包含的集合信息，而不仅仅是单个测量值。与现有方法不同，我们提出的方法在推理时不显式且迭代地使用前向模型，而是通过在与同一前向模型一致但来自广泛先验的多组真值-观测对上进行训练来实现。我们证明了这种训练过程隐式地编码了似然模型。利用集合信息有助于后验推断，并能泛化到未见过的先验。我们在逆成像、HEP 和 FWI 中的几个合成和真实数据集上对提出的方法进行了基准测试。代码可在 https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods 获取。

Summary / 总结

The research introduces the Ensemble Inverse Problem (EIP) to reconstruct true distributions from measurements distorted by detector effects, using a new class of conditional generative models called ensemble inverse generative models. Unlike existing methods, these models avoid using the forward model iteratively at inference time by training on multiple sets of truth-observation pairs. The method demonstrates improved posterior inference and generalization to unseen priors, as shown in benchmarks on synthetic and real datasets in inverse imaging, high energy physics, and full waveform inversion.

论文介绍了用于从受探测器效应扭曲的测量值中重构集合的Ensemble Inverse Problem (EIP)。提出了一种非迭代的推理时方法，使用包含观测集中集合信息的Ensemble Inverse生成模型。这些方法通过跨多个真理-观测对集进行训练，避免了推理时显式使用前向模型。在逆成像、高能物理和全波形反演的合成和真实数据集上的实验表明，这种方法提高了后验推断并能泛化到未见过的先验。相关代码已公开。

From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

Authors: Haoran Tang, Rajiv Khanna

First: 2026-01-29T17:34:37+00:00 · Latest: 2026-01-29T17:34:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.

中文标题/摘要

标题：从逻辑到潜在特征：对比表示塑形在大模型去学习中的应用

大多数大模型去学习方法旨在通过最小的分布偏移来近似重新从头训练的行为，通常通过在预测空间中定义对齐式目标来实现。虽然这些方法在减少遗忘内容生成方面非常有效，但它们可能会起到抑制作用：遗忘的概念可能会在表示中持续存在，并与保留的知识纠缠在一起。我们引入了CLReg，这是一种对比表示正则化器，能够识别遗忘特征并将其远离保留特征，从而在最小化保留特征偏移的情况下显式地减少遗忘-保留干扰。我们提供了关于表示塑形与纠缠减少之间关系的初步理论见解。在不同规模的大模型去学习基准测试中，CLReg减少了遗忘-保留表示纠缠，从而促进了主流去学习方法的应用，而无需提出额外的隐私风险，启发了未来工作，通过重塑表示空间来移除遗忘概念。

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Authors: Johannes Kirmayr, Lukas Stappen, Elisabeth André

First: 2026-01-29T17:33:42+00:00 · Latest: 2026-01-29T17:33:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.

中文标题/摘要

标题：CAR-bench：评估大型语言模型代理在现实世界不确定性下的连贯性和极限意识

现有的大型语言模型（LLM）代理基准主要关注理想条件下的任务完成，而忽视了面向用户的实际应用中的可靠性。在诸如车载语音助手等领域，用户经常发出不完整或含糊不清的请求，这为代理带来了固有的不确定性，代理需要通过对话、工具使用和政策遵守来管理这种不确定性。我们引入了CAR-bench，这是一个评估车载助手领域中多轮次、使用工具的LLM代理在一致性、不确定性处理和能力意识方面的基准。环境包括一个LLM模拟的用户、领域政策和涵盖导航、生产力、充电和车辆控制的58个相互连接的工具。除了标准的任务完成，CAR-bench还引入了幻觉任务，测试代理在缺少工具或信息时的极限意识，以及消歧任务，要求通过澄清或内部信息收集来解决不确定性。基线结果表明，在所有任务类型上，偶尔成功和持续成功之间存在巨大差距。即使是前沿推理的LLM，在消歧任务中的一致通过率也低于50%，因为过早采取行动，而且经常违反政策或编造信息以满足用户请求，在幻觉任务中，这突显了在实际应用中需要更可靠和自我意识的LLM代理。

Summary / 总结

CAR-bench evaluates the consistency and limit-awareness of LLM agents in managing real-world uncertainties, particularly in an in-car assistant domain. It introduces Hallucination and Disambiguation tasks to test agents' ability to handle missing information and resolve ambiguities. Baseline results show significant gaps in consistent success across task types, with even advanced LLMs struggling with disambiguation tasks and frequently violating policies or fabricating information in hallucination tasks.

CAR-bench 评估了 LLM 代理在处理现实世界不确定性方面的连贯性和极限意识，特别是在车内助手领域。基准测试包括多轮对话、工具使用和政策遵守，其中包含如幻觉和澄清任务来测试代理管理不确定性及政策合规的能力。基线结果显示，在所有任务类型中的一致成功率存在显著差距，即使是先进的 LLM 代理在 Disambiguation 任务中的连贯通过率也低于 50%，主要是由于过早行动和政策违规。

Hybrid Foveated Path Tracing with Peripheral Gaussians for Immersive Anatomy

Authors: Constantin Kleinbeck, Luisa Theelke, Hannah Schieber, Ulrich Eck, Rüdiger von Eisenhart-Rothe, Daniel Roth

First: 2026-01-29T17:33:14+00:00 · Latest: 2026-01-29T17:33:14+00:00

Comments: Scheduled for publication in the Proceedings of IEEE VR 2026

Abs · PDF · Code1 · Code2

Abstract

Volumetric medical imaging offers great potential for understanding complex pathologies. Yet, traditional 2D slices provide little support for interpreting spatial relationships, forcing users to mentally reconstruct anatomy into three dimensions. Direct volumetric path tracing and VR rendering can improve perception but are computationally expensive, while precomputed representations, like Gaussian Splatting, require planning ahead. Both approaches limit interactive use. We propose a hybrid rendering approach for high-quality, interactive, and immersive anatomical visualization. Our method combines streamed foveated path tracing with a lightweight Gaussian Splatting approximation of the periphery. The peripheral model generation is optimized with volume data and continuously refined using foveal renderings, enabling interactive updates. Depth-guided reprojection further improves robustness to latency and allows users to balance fidelity with refresh rate. We compare our method against direct path tracing and Gaussian Splatting. Our results highlight how their combination can preserve strengths in visual quality while re-generating the peripheral model in under a second, eliminating extensive preprocessing and approximations. This opens new options for interactive medical visualization.

中文标题/摘要

标题：混合焦斑路径追踪与外围高斯分布结合的沉浸式解剖学可视化

体视医学成像为理解复杂的病理学提供了巨大的潜力。然而，传统的二维切片在解释空间关系方面支持有限，迫使用户在三维中重建解剖结构。直接体视路径追踪和VR渲染可以改善感知，但计算成本高昂，而预计算表示，如高斯点积，需要提前规划。这两种方法都限制了交互式使用。我们提出了一种混合渲染方法，用于高质量、交互式和沉浸式的解剖学可视化。该方法结合了流式焦斑路径追踪和外围的轻量级高斯点积近似。外围模型的生成通过体数据优化，并通过焦斑渲染不断细化，从而实现交互式更新。深度导向的重新投影进一步提高了对延迟的鲁棒性，并允许用户在保真度与刷新率之间进行权衡。我们将我们的方法与直接路径追踪和高斯点积进行了比较。我们的结果表明，它们的结合可以保留视觉质量的优势，同时在不到一秒的时间内重新生成外围模型，从而消除大量的预处理和近似。这为交互式医学可视化提供了新的选择。

Summary / 总结

The paper proposes a hybrid foveated path tracing method that combines streamed foveated path tracing with a lightweight Gaussian Splatting approximation of the periphery for high-quality, interactive, and immersive anatomical visualization. The peripheral model is optimized with volume data and continuously refined using foveal renderings, enabling interactive updates. The method outperforms direct path tracing and Gaussian Splatting by preserving visual quality while regenerating the peripheral model in under a second, eliminating extensive preprocessing and approximations.

论文提出了一种混合焦域路径追踪方法，结合了流式焦域路径追踪和边缘区域的轻量级高斯斑点近似，实现了高质量、交互式和沉浸式的解剖可视化。边缘模型通过体积数据进行优化，并通过焦域渲染不断细化，实现交互式更新。该方法在保持视觉质量的同时，能在不到一秒的时间内重新生成边缘模型，避免了大量预处理和近似，为交互式医学可视化提供了新的选择。

History

20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553