UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images
Authors: Junhwa Hur, Charles Herrmann, Songyou Peng, Philipp Henzler, Zeyu Ma, Todd Zickler, Deqing Sun
Venue: ICLR 2026
First: 2026-02-27T18:59:54+00:00 · Latest: 2026-02-27T18:59:54+00:00
Comments: ICLR 2026, Project page: https://ufo-4d.github.io/
Abstract
Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/
中文标题/摘要
标题:UFO-4D:仅从两张未摆姿势的照片进行无约束的四维重建
从未摆姿势的照片进行密集四维重建仍然是一个关键挑战,当前方法依赖于缓慢的测试时优化或碎片化的、特定任务的前馈模型。我们引入了UFO-4D,这是一种统一的前馈框架,可以从仅一对未摆姿势的照片中重建密集的、显式的四维表示。UFO-4D直接估计动态3D高斯斑点,使其能够以前馈方式联合和一致地估计3D几何、3D运动和相机姿态。我们的核心见解是,从单一动态3D高斯表示中可微渲染多个信号提供了主要的训练优势。这种方法使我们能够实现自我监督的图像合成损失,同时紧密耦合外观、深度和运动。由于所有模态共享相同的几何原语,监督一个模态会自然地正则化和提高其他模态。这种协同作用克服了数据稀缺性,使UFO-4D在联合几何、运动和相机姿态估计方面比先前的工作高出3倍。我们的表示还使我们能够对新视角和时间进行高保真四维插值。请访问我们的项目页面以获取可视化结果:https://ufo-4d.github.io/
Summary / 总结
UFO-4D addresses the challenge of dense 4D reconstruction from unposed images by introducing a unified feedforward framework. It directly estimates dynamic 3D Gaussian Splats, enabling joint and consistent estimation of 3D geometry, motion, and camera pose. This approach, which differentiably renders multiple signals from a single Dynamic 3D Gaussian representation, leads to a self-supervised image synthesis loss and tight coupling of appearance, depth, and motion. As a result, UFO-4D outperforms previous methods by up to three times in joint geometry, motion, and camera pose estimation and supports high-fidelity 4D interpolation across novel views and time.
UFO-4D通过引入统一的前馈框架解决了从未摆姿势的图像进行密集4D重建的挑战。它直接估计动态3D高斯点,能够同时一致地估计3D几何、运动和相机姿态。该方法通过从单一的动态3D高斯表示中不同地渲染多个信号,允许自我监督的图像合成损失,并紧密耦合外观、深度和运动。因此,UFO-4D在几何、运动和相机姿态联合估计方面比之前的方法高出三倍,并支持高保真的4D插值。
Mode Seeking meets Mean Seeking for Fast Long Video Generation
Authors: Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat
First: 2026-02-27T18:59:02+00:00 · Latest: 2026-02-27T18:59:02+00:00
Comments: Project website: https://primecai.github.io/mmm/
Abstract
Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.
中文标题/摘要
标题:模式搜索与均值搜索结合实现快速长视频生成
将视频生成从秒级扩展到分钟级面临一个关键瓶颈:虽然短视频数据丰富且高保真,但连贯的长视频数据稀缺且局限于狭窄领域。为解决这一问题,我们提出了一种训练范式,即模式搜索与均值搜索相结合,基于统一表示通过解耦扩散变换器来解耦局部保真度与长期一致性。我们的方法利用一个通过监督学习在长视频上训练的全局流匹配头部来捕捉叙事结构,同时采用一个局部分布匹配头部,通过模式搜索逆KL散度将滑动窗口与冻结的短视频教师对齐。该策略通过监督流匹配学习长距离一致性和运动,同时通过将学生每个滑动窗口段与冻结的短视频教师对齐继承局部现实性,从而实现快速长视频生成器。评估表明,我们的方法通过同时提高局部清晰度、运动和长距离一致性有效地缩小了保真度-时间差距。
Summary / 总结
The paper addresses the challenge of generating long videos with high fidelity and coherence, which is scarce in long-form data. It proposes a training paradigm combining Mode Seeking and Mean Seeking using a Decoupled Diffusion Transformer. The approach uses a global Flow Matching head to capture narrative structure from long videos and a local Distribution Matching head to align sliding windows with a short-video teacher. This results in a fast generator that synthesizes minute-scale videos with improved local sharpness, motion, and long-range consistency compared to previous methods.
研究旨在通过解决长视频稀缺问题,生成从秒到分钟的长视频。提出的Mode Seeking meets Mean Seeking方法使用Decoupled Diffusion Transformer,其中全局Flow Matching头用于捕捉叙事结构,局部Distribution Matching头用于将滑动窗口与冻结的短视频教师对齐。这种方法能够合成具有改进的局部清晰度、运动和长程一致性的分钟级视频,有效缩小了保真度-时间差距。
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan
Venue: ICLR 2026
First: 2026-02-27T18:58:57+00:00 · Latest: 2026-02-27T18:58:57+00:00
Comments: Published as a conference paper at ICLR 2026. 10 pages plus appendix
Abstract
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
中文标题/摘要
标题:DARE-bench:评估大语言模型在数据科学中的建模和指令忠实度
随着对使用大型语言模型(LLMs)解决复杂多步骤数据科学任务需求的快速增长,准确的基准测试变得迫切需要。现有基准测试存在两个主要差距:(i)缺乏标准化、过程意识的评估,无法捕捉指令遵守和过程忠实度,以及(ii)缺乏准确标注的训练数据。为弥补这些差距,我们提出了DARE-bench,这是一个旨在评估机器学习建模和数据科学指令的基准测试。与依赖于人类或模型评判的许多现有基准测试不同,DARE-bench 中的所有任务都有可验证的地面真相,确保了客观和可重复的评估。为了涵盖广泛的任务并支持自主工具,DARE-bench 包含了 6,300 个 Kaggle 派生任务,并提供了大规模的训练数据和评估集。广泛的评估表明,即使是能力很强的模型如 gpt-o4-mini 也难以取得良好的性能,尤其是在机器学习建模任务中。使用 DARE-bench 的训练任务进行微调可以显著提高模型性能。例如,监督微调将 Qwen3-32B 的准确率提高了 1.83 倍,强化学习将 Qwen3-4B 的准确率提高了超过 8 倍。这些显著的改进验证了 DARE-bench 作为准确评估基准和关键训练数据的重要性。
Summary / 总结
DARE-bench is designed to evaluate the modeling and instruction fidelity of LLMs in data science by addressing the lack of standardized evaluation and accurate labeled data. It consists of 6,300 Kaggle-derived tasks with verifiable ground truth, ensuring objective evaluation. Extensive evaluations show that even highly capable models struggle in machine learning tasks, and fine-tuning with DARE-bench tasks significantly improves model performance, with supervised fine-tuning boosting Qwen3-32B's accuracy by 1.83x and reinforcement learning improving Qwen3-4B's accuracy by over 8x.
DARE-bench 是一个用于评估大型语言模型在数据科学中的建模和指令准确性的基准,解决了标准化和过程感知评估的缺乏。它包含6,300个来自Kaggle的任务,具有可验证的地面真相,确保了客观评估。广泛的评估显示,即使是高度有能力的模型在机器学习建模任务中也难以取得好成绩,但使用DARE-bench的训练任务进行微调可以显著提高模型性能,例如监督微调将Qwen3-32B的准确性提高了1.83倍,强化学习将Qwen3-4B的准确性提高了超过8倍。
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Authors: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
First: 2026-02-27T18:58:05+00:00 · Latest: 2026-02-27T18:58:05+00:00
Abstract
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
中文标题/摘要
标题:CUDA代理:大规模代理型强化学习在高性能CUDA内核生成中的应用
GPU内核优化是现代深度学习的基础,但仍然是一个高度专业化且需要深厚硬件知识的任务。尽管在通用编程方面表现出色,大型语言模型(LLMs)在CUDA内核生成方面仍无法与基于编译器的系统(如torch.compile)竞争。现有的CUDA代码生成方法要么依赖于无训练的细化,要么在固定多轮执行反馈循环中微调模型,但这两种范式都无法从根本上提高模型的CUDA优化能力,导致性能提升有限。我们提出了CUDA代理,这是一种通过三个组件开发CUDA内核专业知识的大规模代理型强化学习系统:可扩展的数据合成管道、具有自动化验证和分析的技能增强CUDA开发环境,以提供可靠的奖励信号,以及强化学习算法技术以实现稳定的训练。CUDA代理在KernelBench上取得了最先进的结果,分别在KernelBench Level-1、Level-2和Level-3分割上比torch.compile快100%、100%和92%,在最难的Level-3设置上比最强的专有模型Claude Opus 4.5和Gemini 3 Pro高出约40%。
Summary / 总结
The research aims to optimize GPU kernels for deep learning by leveraging agentic reinforcement learning. The method involves a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling, and reinforcement learning techniques for stable training. Key findings show that CUDA Agent outperforms torch.compile and proprietary models like Claude Opus 4.5 and Gemini 3 Pro, achieving up to 92% faster performance on KernelBench Level-3.
研究旨在通过强化学习优化GPU内核以提升深度学习性能。方法包括可扩展的数据合成管道、具有自动化验证和分析的技能增强CUDA开发环境,以及用于稳定训练的强化学习技术。主要发现表明,CUDA Agent 在KernelBench Level-3 上比torch.compile 和Claude Opus 4.5、Gemini 3 Pro等专有模型快高达92%。
Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations
Authors: Shruti Joshi, Théo Saulus, Wieland Brendel, Philippe Brouillard, Dhanya Sridhar, Patrik Reizinger
First: 2026-02-27T18:50:13+00:00 · Latest: 2026-02-27T18:50:13+00:00
Abstract
Identifiability in representation learning is commonly evaluated using standard metrics (e.g., MCC, DCI, R^2) on synthetic benchmarks with known ground-truth factors. These metrics are assumed to reflect recovery up to the equivalence class guaranteed by identifiability theory. We show that this assumption holds only under specific structural conditions: each metric implicitly encodes assumptions about both the data-generating process (DGP) and the encoder. When these assumptions are violated, metrics become misspecified and can produce systematic false positives and false negatives. Such failures occur both within classical identifiability regimes and in post-hoc settings where identifiability is most needed. We introduce a taxonomy separating DGP assumptions from encoder geometry, use it to characterise the validity domains of existing metrics, and release an evaluation suite for reproducible stress testing and comparison.
中文标题/摘要
标题:谁来监督守护者?学习表示可识别性评估的挑战
在表示学习中,可识别性通常使用标准指标(例如MCC、DCI、R²)在具有已知真实因素的合成基准上进行评估。这些指标被认为反映了由可识别性理论保证的等价类的恢复情况。我们表明,在特定结构条件下,这种假设才成立:每个指标隐含地包含了关于数据生成过程(DGP)和编码器的假设。当这些假设被违反时,指标会变得不适当,并可能导致系统性的假阳性与假阴性。这种失败不仅发生在经典的可识别性范围内,而且发生在最需要可识别性评估的后验设置中。我们引入了一种分类法,将DGP假设与编码器几何形状区分开来,使用它来描述现有指标的有效性范围,并发布了一套评估套件,用于可重复的压力测试和比较。
Summary / 总结
The paper addresses the challenges in evaluating the identifiability of learned representations using standard metrics like MCC, DCI, and R^2, which are typically applied to synthetic benchmarks with known ground-truth factors. The authors demonstrate that these metrics only accurately reflect identifiability under specific structural conditions, as they encode assumptions about both the data-generating process and the encoder. When these assumptions are violated, the metrics can produce false positives and negatives. The study introduces a taxonomy to separate DGP assumptions from encoder geometry, characterizes the validity domains of existing metrics, and provides an evaluation suite for reproducibly testing and comparing these metrics.
该论文探讨了使用MCC、DCI和R^2等标准指标评估学习表示的可识别性所面临的挑战。这些指标通常应用于具有已知真实因素的合成基准,假设它们反映了识别理论保证的等价类的恢复。然而,论文表明,在特定结构条件下,这些指标可能会变得不适用,导致系统性的假阳性与假阴性。作者引入了一种分类法,将数据生成过程假设与编码器几何结构分开,描述了现有指标的有效性范围,并提供了一套可重复的压力测试和比较评估套件。
Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment
Authors: Dake Zhang, Mark D. Smucker, Charles L. A. Clarke
First: 2026-02-27T18:49:31+00:00 · Latest: 2026-02-27T18:49:31+00:00
Abstract
Many readers today struggle to assess the trustworthiness of online news because reliable reporting coexists with misinformation. The TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track provided a venue for researchers to develop and evaluate assistive RAG systems that support readers' news trustworthiness assessment by producing reader-oriented, well-attributed reports. As the organizers of the DRAGUN track, we describe the resources that we have newly developed to allow for the reuse of the track's tasks. The track had two tasks: (Task 1) Question Generation, producing 10 ranked investigative questions; and (Task 2, the main task) Report Generation, producing a 250-word report grounded in the MS MARCO V2.1 Segmented Corpus. As part of the track's evaluation, we had TREC assessors create importance-weighted rubrics of questions with expected short answers for 30 different news articles. These rubrics represent the information that assessors believe is important for readers to assess an article's trustworthiness. The assessors then used their rubrics to manually judge the participating teams' submitted runs. To make these tasks and their rubrics reusable, we have created an automated process to judge runs not part of the original assessing. We show that our AutoJudge ranks existing runs well compared to the TREC human-assessed evaluation (Kendall's $τ= 0.678$ for Task 1 and $τ= 0.872$ for Task 2). These resources enable both the evaluation of RAG systems for assistive news trustworthiness assessment and, with the human evaluation as a benchmark, research on improving automated RAG evaluation.
中文标题/摘要
标题:辅助评估助读RAG系统资源,以帮助读者评估新闻可信度
如今,许多读者难以评估在线新闻的可信度,因为可靠报道与虚假信息共存。TREC 2025 DRAGUN(检测、检索和增强生成以理解新闻)赛道为研究人员提供了一个平台,以开发和评估支持读者新闻可信度评估的辅助RAG系统,通过生成面向读者、有良好归属的报告。作为DRAGUN赛道的组织者,我们描述了新开发的资源,以便重新使用赛道的任务。赛道有两个任务:(任务1)问题生成,生成10个排名的调查性问题;(任务2,主要任务)报告生成,生成基于MS MARCO V2.1分段语料库的250字报告。作为赛道评估的一部分,我们让TREC评估员为30篇不同新闻文章创建了加权问题清单,其中包含预期的简短答案。这些清单代表了评估员认为读者评估文章可信度时需要的重要信息。评估员随后使用他们的清单手动评估参赛团队提交的运行结果。为了使这些任务及其清单可重用,我们创建了一个自动评估过程,用于评估不属于原始评估的运行结果。我们展示了我们的AutoJudge在任务1(Kendall's τ= 0.678)和任务2(Kendall's τ= 0.872)中对现有运行结果的评估与TREC的人工评估相比表现良好。这些资源不仅使RAG系统评估成为可能,还为改进自动RAG评估提供了基准。
Summary / 总结
This paper describes resources developed for evaluating assistive RAG systems that help readers assess news trustworthiness. The TREC 2025 DRAGUN track included two tasks: question generation and report generation. Assessors created rubrics for 30 news articles to guide evaluation, and an automated system (AutoJudge) was developed to rank runs. AutoJudge showed strong correlation with human assessments (Kendall's τ=0.678 for Task 1 and τ=0.872 for Task 2).
本文描述了为评估辅助RAG系统而开发的资源,这些系统帮助读者评估新闻可信度。TREC 2025 DRAGUN赛道包括两个任务:生成调查性问题和基于特定语料库生成报告。评估者创建了评估新闻文章可信度的评分表,并开发了一个自动系统(AutoJudge)来根据这些评分表对提交的运行进行排名。AutoJudge与人工评估结果的相关性很强(Task 1的Kendall's τ=0.678,Task 2的τ=0.872),这使得这些任务可以被重新用于进一步研究RAG系统和自动评估方法。
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Authors: Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao
Venue: CVPR2026
First: 2026-02-27T18:48:22+00:00 · Latest: 2026-02-27T18:48:22+00:00
Abstract
Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
中文标题/摘要
标题:层次化行动学习在弱监督行动分割中的应用
人类通过关键转换来感知行动,这些转换在多个抽象层次上结构化行动,而机器依赖于视觉特征,往往会过度分割。这突显了在视频理解中实现层次化推理的难度。有趣的是,我们观察到低级视觉和高级行动潜在变量以不同的速率演变,低级视觉变量变化迅速,而高级行动变量演变较慢,使其更容易识别。基于这一洞察,我们提出了层次化行动学习(\textbf{HAL})模型,用于弱监督行动分割。我们的方法引入了一个层次化的因果数据生成过程,其中高级潜在行动控制低级视觉特征的动力学。为了有效建模这些不同的时间尺度,我们引入了确定性过程来对齐这些潜在变量。\textbf{HAL}模型使用层次化金字塔变换器来捕获视觉特征和潜在变量,并应用稀疏转换约束来强制执行高级行动变量的较慢动力学。这种机制增强了这些潜在变量随时间的识别。在温和的假设下,我们证明了这些潜在行动变量是严格可识别的。在几个基准上的实验结果表明,\textbf{HAL}模型在弱监督行动分割中显著优于现有方法,证实了其在实际应用中的有效性。
Summary / 总结
The research aims to address the challenge of hierarchical reasoning in video understanding by leveraging the different evolution rates of low-level visual and high-level action variables. The Hierarchical Action Learning (HAL) model is proposed, which introduces a hierarchical causal data generation process and uses a hierarchical pyramid transformer to capture both visual features and latent variables. The model also applies a sparse transition constraint to enforce the slower dynamics of high-level action variables. Experiments on benchmarks demonstrate that HAL significantly outperforms existing methods for weakly-supervised action segmentation.
研究旨在通过利用低级视觉和高级动作变量变化率的不同来解决视频理解中的层次推理问题。提出的Hierarchical Action Learning (HAL)模型引入了层次因果数据生成过程,并使用层次金字塔变换器来捕捉视觉特征和潜在变量。该模型还应用稀疏转换约束来强制执行高级动作变量的较慢动态,从而在弱监督动作分割方面显著优于现有方法。
A Minimal Agent for Automated Theorem Proving
Authors: Borja Requena Pozo, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra
First: 2026-02-27T18:43:47+00:00 · Latest: 2026-02-27T18:43:47+00:00
Abstract
We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
中文标题/摘要
标题:一种用于自动定理证明的最小代理
我们提出了一种最小代理基线,以系统地比较不同基于AI的定理证明器架构。该设计实现了当前最先进的系统中共享的核心功能:迭代证明细化、库搜索和上下文管理。我们使用不同类型的基准评估了我们的基线,并比较了各种流行的模型和设计选择,展示了与最先进的方法相比具有竞争力的性能,同时使用了显著更简单的架构。我们的结果表明,迭代方法在样本效率和成本效益方面相对于单次生成具有一致的优势。该实现已开源,作为未来研究的候选参考以及社区的可访问证明器。
Summary / 总结
The paper introduces a minimal agent for automated theorem proving to facilitate systematic comparisons among different AI-based theorem prover architectures. It employs iterative proof refinement, library search, and context management, which are core features of state-of-the-art systems. The evaluation across various benchmarks shows that this minimal agent achieves competitive performance with simpler architecture compared to advanced approaches, highlighting the benefits of an iterative approach in terms of sample efficiency and cost effectiveness.
论文提出了一种最小化代理的自动定理证明方法,以促进不同基于AI的定理证明器架构之间的系统比较。该方法采用了迭代证明精炼、库搜索和上下文管理等核心功能。通过不同基准的评估显示,该最小化代理在更简单的架构下达到了与先进方法相当的性能,特别是在样本效率和成本效益方面展示了迭代方法的优势。
Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives
Authors: AmirHossein Zamani, Bruno Roy, Arianna Rampini
First: 2025-09-29T17:28:58+00:00 · Latest: 2026-02-27T18:42:50+00:00
Abstract
Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
中文标题/摘要
标题:基于语义和视见性的无监督3D网格参数化表示学习
近年来,3D生成模型能够生成高质量的3D网格对象纹理。然而,它们通常依赖于一个沉重的假设,即输入的3D网格必须伴随手动的网格参数化(UV映射),这是一个需要技术精度和艺术判断的繁琐手动任务。行业调查显示,这一过程往往占用了大量资产创建的时间,成为3D内容创作者的主要瓶颈。此外,现有的自动方法往往忽略了两个重要的感知标准:(1)语义意识(UV图应使形状中的语义相似3D部分对齐)和(2)视见意识(切割缝应位于不太可能被看到的区域)。为克服这些不足并自动化网格参数化过程,我们提出了一种无监督可微分框架,该框架在标准的几何保持UV学习中增加了语义和视见意识的目标。对于语义意识,我们的流水线(i)将网格分割为语义3D部分,(ii)应用一个无监督学习的每部分UV参数化骨干网络,(iii)将每部分图整合为一个统一的UV图集。对于视见意识,我们使用环境遮挡(AO)作为曝光代理,并反向传播一个软可微分的AO加权切割缝目标,以引导切割缝朝向被遮挡的区域。通过与最新方法的定性和定量评估,我们展示了所提出的方法生成的UV图集在支持纹理生成和减少可感知的切割缝伪影方面优于最近的基线方法。我们的实现代码可在以下链接公开获取:https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
Summary / 总结
The paper addresses the challenge of manual mesh parameterization in 3D content creation, which is both time-consuming and requires technical skill. It proposes an unsupervised framework that integrates semantic and visibility objectives to automate this process. The method segments the mesh into semantic parts, applies a learned UV parameterization for each part, and uses ambient occlusion to guide seam placement. Experiments show that the proposed method outperforms existing techniques in terms of texture generation and seam artifacts.
研究通过提出一个结合语义和视见感知的无监督框架来解决3D内容创建中的手动网格参数化瓶颈问题。该方法将网格分割成语义部分,应用一个学习到的UV参数化骨干网络,并将这些部分聚合为一个统一的UV图集。此外,它使用环境遮挡来引导切口缝合远离可见区域。实验表明,这种方法生成的UV图集更适合纹理生成,并且减少了可见缝合伪影,优于现有方法。
LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans
Authors: Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, Joan Lasenby
Venue: www
First: 2025-07-03T17:59:55+00:00 · Latest: 2026-02-27T18:27:47+00:00
Comments: Project Page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c&feature=youtu.be Camera-Ready Version
Abstract
We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c
中文标题/摘要
标题:LiteReality:从RGB-D扫描重建室内环境的紧凑现实3D场景
我们提出LiteReality,一种将室内环境的RGB-D扫描转换为紧凑、逼真且可交互的3D虚拟复制品的新管道。LiteReality不仅重建了视觉上类似于现实的场景,还支持图形管道中必不可少的功能,如物体的独特性、关节运动、高质量的基于物理的渲染材料和基于物理的交互。其核心在于首先进行场景理解并将结果解析为一个连贯的3D布局和物体,借助结构化的场景图。然后通过检索精心策划的资产数据库中最相似的3D艺术家设计模型来重建场景。接下来,材质绘画模块通过恢复高质量的空间变化材质来增强现实感。最后,重建的场景被整合到具有基本物理属性的模拟引擎中,以实现交互行为。生成的场景紧凑、可编辑且完全兼容标准图形管道,适用于AR/VR、游戏、机器人技术和数字孪生等应用。此外,LiteReality引入了一个无需训练的对象检索模块,在Scan2CAD基准测试中实现了最先进的相似性性能,以及一个稳健的材质绘画模块,能够将任何风格的图像外观转移到3D资产上——即使在严重对齐不良、遮挡和照明不佳的情况下。我们在现实扫描和公共数据集上展示了LiteReality的有效性。项目页面:https://litereality.github.io;视频:https://www.youtube.com/watch?v=ecK9m3LXg2c&feature=youtu.be
Summary / 总结
LiteReality is a pipeline that converts RGB-D scans into realistic 3D virtual replicas with key graphics features like object individuality and high-quality rendering. It uses a structured scene graph for scene understanding, retrieves 3D models from a curated database, enhances materials with the Material Painting module, and integrates the scene into a simulation engine. The results are compact, editable, and compatible with standard graphics pipelines, suitable for AR/VR, gaming, robotics, and digital twins. The system achieves state-of-the-art object retrieval performance and robust material transfer capabilities.
LiteReality 是一种将 RGB-D 扫描转换为具有关键图形功能(如物体个体性和高质量渲染)的逼真 3D 虚拟复制品的管道。它使用结构化的场景图进行场景理解,检索艺术家创作的模型,通过材料绘画增强逼真度,并将场景集成到仿真引擎中。结果表明,这些场景是紧凑、交互式的,并且与标准图形管道兼容,适用于 AR/VR、游戏、机器人技术和数字孪生。该系统在对象检索和材料转移方面达到了最先进的性能,即使在挑战性条件下也能实现鲁棒性。
Histopathology Image Normalization via Latent Manifold Compaction
Authors: Xiaolong Zhang, Jianwei Zhang, Selim Sevim, Emek Demir, Ece Eksi, Xubo Song
First: 2026-02-27T18:26:59+00:00 · Latest: 2026-02-27T18:26:59+00:00
Comments: 11 pages
Abstract
Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge for computational pathology, hindering cross-batch generalization and limiting reliable deployment of models across clinical sites. In this work, we introduce Latent Manifold Compaction (LMC), an unsupervised representation learning framework that performs image harmonization by learning batch-invariant embeddings from a single source dataset through explicit compaction of stain-induced latent manifolds. This allows LMC to generalize to target domain data unseen during training. Evaluated on three challenging public and in-house benchmarks, LMC substantially reduces batch-induced separations across multiple datasets and consistently outperforms state-of-the-art normalization methods in downstream cross-batch classification and detection tasks, enabling superior generalization.
中文标题/摘要
标题:通过潜在流形压缩实现组织病理学图像归一化
由组织病理学染色协议、扫描仪和技术获取管道中的技术差异引起的批次效应,一直是计算病理学中的持续挑战,阻碍了跨批次泛化,并限制了模型在临床站点中的可靠部署。在本文中,我们引入了潜在流形压缩(LMC),这是一种无监督的表示学习框架,通过从单一源数据集中学习批次不变嵌入来实现图像协调,从而显式压缩染色引起的潜在流形。这使得LMC能够在训练期间未见过的目标域数据上泛化。在三个具有挑战性的公共和内部基准测试上进行评估,LMC显著减少了多个数据集中的批次引起的分离,并且在下游跨批次分类和检测任务中始终优于最先进的归一化方法,从而实现了更好的泛化。
Summary / 总结
The research addresses the challenge of batch effects in histopathology images, which can hinder the generalization of computational pathology models across different clinical sites. The method, Latent Manifold Compaction (LMC), is an unsupervised framework that learns batch-invariant image embeddings by compacting stain-induced latent manifolds from a single source dataset. LMC significantly reduces batch-induced separations and outperforms existing normalization methods in cross-batch classification and detection tasks, demonstrating superior generalization capabilities.
研究解决了组间效应在组织病理学图像中的问题,这会妨碍计算病理学模型在不同批次间的泛化能力。方法是无监督的Latent Manifold Compaction (LMC),通过从单一数据集学习压缩染色引起的潜在流形来生成组间不变的图像嵌入。LMC通过减少组间分离,提高了跨组分类和检测任务的表现,并在多个基准测试中优于现有方法。
Reinforcement Learning from Human Feedback
Authors: Nathan Lambert
First: 2025-04-16T21:36:46+00:00 · Latest: 2026-02-27T18:22:58+00:00
Comments: 204 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website
Abstract
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
中文标题/摘要
标题:从人类反馈中学习强化学习
从人类反馈中学习强化学习(RLHF)已成为部署最新机器学习系统的重要的技术和叙述工具。本书旨在为具有一定定量背景的人士提供一个温和的介绍核心方法。本书从RLHF的起源开始——既包括最近的文献,也包括经济学、哲学和最优控制等不同科学领域的交汇。然后,我们通过定义、问题表述、数据收集和其他文献中常用的数学方法来奠定基础。本书的核心部分详细介绍了使用RLHF的每一个优化阶段,从指令调优开始,到训练奖励模型,再到拒绝采样、强化学习和直接对齐算法。最后,本书涵盖了高级主题——合成数据和评估中的未研究问题——以及该领域的开放问题。
Summary / 总结
This book provides an introduction to reinforcement learning from human feedback (RLHF) for those with a quantitative background. It covers the origins of RLHF, problem formulation, data collection, and various optimization stages such as instruction tuning and reward model training. Key findings include the detailed process from initial setup to advanced topics like synthetic data and evaluation methods.
本书为那些具有定量背景的人介绍了人类反馈强化学习(RLHF)的核心方法,涵盖了RLHF的起源、定义、问题表述、数据收集以及诸如指令调优、奖励模型训练和强化学习等各种优化阶段。书中还探讨了合成数据和评估等高级主题,并指出了该领域存在的开放性问题。
Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis
Authors: Javier Pulido, Filipe Rodrigues
First: 2026-02-27T18:10:54+00:00 · Latest: 2026-02-27T18:10:54+00:00
Comments: 6 pages
Abstract
Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.
中文标题/摘要
标题:时间序列基础模型在交通预测中的强基准作用:大规模基准分析
准确预测交通动态对于城市交通和基础设施规划至关重要。尽管最近的工作已经使用深度学习模型取得了强大的性能,但这些方法通常需要针对特定数据集进行训练、架构设计和超参数调整。本文通过在涵盖高速公路交通流量和流速、城市交通速度、共享单车需求和电动汽车充电站数据的十个真实世界数据集上对最先进的模型Chronos-2进行零样本基准测试,评估通用时间序列基础模型是否可以作为交通任务的预测器。在一致的评估协议下,我们发现,即使没有任何任务特定的微调,Chronos-2在大多数数据集上都能达到最先进的或具有竞争力的准确性,经常优于经典统计基准和专门的深度学习架构,尤其是在更长的时间范围内。除了点预测之外,我们还使用预测区间覆盖和锐度评估其原生的概率输出,证明Chronos-2在无需特定数据集训练的情况下也能提供有用的不确定性量化。总体而言,这项研究支持将时间序列基础模型作为交通预测研究中的关键基准的采用。
Summary / 总结
This paper evaluates the performance of a general-purpose time-series foundation model, Chronos-2, in transportation forecasting across ten real-world datasets. Without any task-specific fine-tuning, Chronos-2 achieves state-of-the-art or competitive accuracy, often outperforming classical and specialized deep learning models, especially for longer forecasting horizons. It also provides useful uncertainty quantification through its probabilistic outputs.
该研究评估了一种通用时间序列基础模型Chronos-2在涵盖十个多领域真实世界数据集中的交通预测表现。无需特定任务微调,Chronos-2在长预测期尤其表现出色,并通过其概率输出提供了有用的不确定性量化。
SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
Authors: Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
First: 2026-02-27T18:06:10+00:00 · Latest: 2026-02-27T18:06:10+00:00
Comments: 12 pages, 6 figures
Abstract
Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM can not only enhance the safety satisfaction of task plans but also generalize well to novel safety properties in various domains. We first construct a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints. Then, we introduce a two-stage post-training framework: Supervised Fine-Tuning (SFT) on a constraint-compliant planning dataset to learn planning syntax and semantics, and Group Relative Policy Optimization (GRPO) guided by fine-grained reward machines derived from formal verification to enforce safety alignment and by curriculum learning to better handle complex tasks. Extensive experiments show that SafeGen-LLM achieves strong safety generalization and outperforms frontier proprietary baselines across multi-domain planning tasks and multiple input formats (e.g., PDDLs and natural language).
中文标题/摘要
标题:SafeGen-LLM:增强机器人系统任务规划中的安全性泛化
机器人系统中的安全性关键任务规划仍然具有挑战性:经典规划器在可扩展性方面表现不佳,基于强化学习(RL)的方法泛化能力差,基础大型语言模型无法保证安全性。为了解决这一差距,我们提出了一种安全泛化的大语言模型,命名为SafeGen-LLM。SafeGen-LLM不仅可以提高任务计划的安全满意度,还可以在各种领域中很好地泛化到新的安全属性。我们首先构建了一个具有显式安全约束的多领域Planning Domain Definition Language 3 (PDDL3)基准。然后,我们引入了一种两阶段后训练框架:在约束合规的规划数据集上进行监督微调(SFT)以学习规划语法和语义,以及由形式验证中导出的细粒度奖励机器引导的组相对策略优化(GRPO),并由课程学习来更好地处理复杂任务以确保安全对齐。广泛的实验表明,SafeGen-LLM在多领域规划任务和多种输入格式(例如PDDL和自然语言)中实现了强大的安全性泛化,并优于前沿的专有基线。
Summary / 总结
The paper addresses the challenge of safety-critical task planning in robotic systems by proposing SafeGen-LLM, a safety-generalizable large language model. It introduces a two-stage post-training framework involving Supervised Fine-Tuning on a constraint-compliant planning dataset and Group Relative Policy Optimization guided by reward machines and curriculum learning. Experiments demonstrate that SafeGen-LLM effectively enhances safety satisfaction and generalizes well to novel safety properties across various domains, outperforming existing baselines.
研究旨在通过克服经典规划器和基于强化学习的方法的局限性,提高机器人系统任务规划中的安全性。提出了一种名为SafeGen-LLM的安全可泛化大型语言模型,采用两阶段后训练框架:监督微调学习规划语法和语义,以及由奖励机器和课程学习引导的组相对策略优化来确保安全对齐。实验表明,SafeGen-LLM在多个领域和输入格式下表现出色,安全泛化能力优于现有专有基线。
Enhancing Spatial Understanding in Image Generation via Reward Modeling
Authors: Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou
Venue: CVPR 2026
First: 2026-02-27T17:59:57+00:00 · Latest: 2026-02-27T17:59:57+00:00
Comments: Accepted at CVPR 2026. Github: https://github.com/DAGroup-PKU/SpatialT2I Project website: https://dagroup-pku.github.io/SpatialT2I/
Abstract
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
中文标题/摘要
标题:通过奖励建模增强图像生成中的空间理解
文本到图像生成的近期进展极大地提高了视觉保真度和创造力,但也对提示的复杂性提出了更高要求,特别是在编码复杂的空间关系方面。在这种情况下,获得满意的结果通常需要多次采样尝试。为了解决这一挑战,我们提出了一种新方法,以增强当前图像生成模型的空间理解能力。我们首先构建了包含超过8万个偏好对的SpatialReward-数据集。在此数据集的基础上,我们构建了SpatialScore,这是一种奖励模型,旨在评估文本到图像生成中的空间关系准确性,其性能甚至超过了领先的专有模型在空间评估中的表现。我们进一步证明,该奖励模型能够有效地实现复杂空间生成的在线强化学习。在多个基准上的广泛实验表明,我们的专门奖励模型在图像生成中的空间理解方面取得了显著且一致的提升。
Summary / 总结
The research aims to improve the spatial understanding of image generation models by addressing the challenge of encoding intricate spatial relationships. The authors developed a novel method involving the SpatialReward-Dataset and SpatialScore, a reward model that evaluates spatial accuracy. Experiments across multiple benchmarks demonstrated that this reward model significantly enhances spatial understanding in image generation.
研究旨在通过解决在文本到图像生成中编码复杂空间关系的挑战,提高图像生成模型的空间理解能力。作者开发了一个SpatialReward-Dataset和一个名为SpatialScore的奖励模型来评估空间准确性。该模型在空间评估上超越了现有模型,并能够有效实现复杂空间生成的在线强化学习。跨多个基准的实验显示了在空间理解方面的持续改进。
BLISSNet: Deep Operator Learning for Fast and Accurate Flow Reconstruction from Sparse Sensor Measurements
Authors: Maksym Veremchuk, K. Andrea Scott, Zhao Pan
First: 2026-02-27T17:55:43+00:00 · Latest: 2026-02-27T17:55:43+00:00
Abstract
Reconstructing fluid flows from sparse sensor measurements is a fundamental challenge in science and engineering. Widely separated measurements and complex, multiscale dynamics make accurate recovery of fine-scale structures difficult. In addition, existing methods face a persistent tradeoff: high-accuracy models are often computationally expensive, whereas faster approaches typically compromise fidelity. In this work, we introduce BLISSNet, a model that strikes a strong balance between reconstruction accuracy and computational efficiency for both flow reconstruction and nudging-based data assimilation. The model follows a DeepONet-like architecture, enabling zero-shot inference on domains of arbitrary size. After the first model call on a given domain, certain network components can be precomputed, leading to low inference cost for subsequent evaluations on large domains. Consequently, the model can achieve faster inference than classical interpolation methods such as radial basis function or bicubic interpolation. This combination of high accuracy, low cost, and zero-shot generalization makes BLISSNet well-suited for large-scale real-time flow reconstruction and data assimilation tasks.
中文标题/摘要
标题:BLISSNet:用于从稀疏传感器测量快速准确地重构流场的深度算子学习
从稀疏传感器测量重构流场是科学和工程中的一个基本挑战。广泛分布的测量和复杂的多尺度动态使得精细结构的准确恢复变得困难。此外,现有方法面临一个持续的权衡:高精度模型通常计算成本高昂,而更快的方法通常会牺牲保真度。在本文中,我们引入了BLISSNet模型,该模型在流场重构和基于推移的数据同化方面在重构精度和计算效率之间取得了良好的平衡。该模型遵循类似于DeepONet的架构,能够对任意大小的域进行零样本推理。在给定域上的第一个模型调用之后,某些网络组件可以预先计算,从而在后续对大型域的评估中具有较低的推理成本。因此,该模型可以比经典的插值方法(如径向基函数或双立方插值)更快地进行推理。这种高精度、低成本和零样本泛化的结合使BLISSNet非常适合大规模实时流场重构和数据同化任务。
Summary / 总结
The paper addresses the challenge of reconstructing fluid flows from sparse sensor measurements, which is crucial in science and engineering. It introduces BLISSNet, a model based on a DeepONet-like architecture that balances accuracy and computational efficiency. BLISSNet can achieve faster inference than traditional methods like radial basis function or bicubic interpolation while maintaining high accuracy. This model can be precomputed for a given domain, allowing for low-cost inference on large domains, making it suitable for real-time flow reconstruction and data assimilation tasks.
BLISSNet旨在从稀疏传感器测量中高精度地重建流体流动,并具有低计算成本。它采用类似于DeepONet的架构,可以在任意域上实现零样本推理。在初始模型调用后,某些网络组件可以被预计算,从而减少后续在大域上的推理成本。这种方法在准确性和速度上都优于经典插值方法,使BLISSNet适用于实时流场重建和数据同化任务。
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
Authors: Albert Dominguez Mantes, Gioele La Manno, Martin Weigert
Venue: CVPR 2026
First: 2026-02-27T17:48:54+00:00 · Latest: 2026-02-27T17:48:54+00:00
Comments: Accepted at CVPR 2026
Abstract
Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.
中文标题/摘要
标题:MuViT:多分辨率视觉变换器在显微镜跨尺度学习中的应用
现代显微镜通常会产生包含从细胞形态到组织组织多个空间尺度结构的吉像素级图像。许多分析任务需要结合这些尺度,但大多数视觉模型仅在单一分辨率下运行,或者从单一视角推导多尺度特征,限制了它们利用显微镜数据固有的多分辨率性质的能力。我们引入了MuViT,一种构建用于融合同一图像的真正多分辨率观察的变换器架构。MuViT 将所有补丁嵌入到共享的世界坐标系统中,并将旋转位置嵌入扩展到这些坐标,使注意力能够在单个编码器中整合宽场上下文与高分辨率细节。在合成基准、肾组织病理学和高分辨率小鼠大脑显微镜数据上,MuViT 在强大的ViT和CNN基线之上提供了持续的改进。多分辨率MAE预训练进一步生成了尺度一致的表示,增强了下游任务。这些结果表明,显式的世界坐标建模提供了一种简单而强大的机制,用于在大规模显微镜分析中利用多分辨率信息。
Summary / 总结
MuViT is designed to handle multi-resolution data in microscopy by fusing observations from different scales within a shared coordinate system. It uses rotary positional embeddings to enable attention that integrates wide-field context with high-resolution detail. Experiments show MuViT outperforms ViT and CNN baselines on various tasks, including synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy. Pretraining with multi-resolution masked autoencoders further improves scale-consistent representations for downstream tasks.
MuViT 是一种变压器架构,旨在处理显微镜图像中的多分辨率观察,这些图像包含不同尺度的结构。它将所有补丁嵌入到共享的世界坐标系统中,并使用扩展的旋转位置嵌入来使注意力能够整合宽场上下文与高分辨率细节。MuViT 在合成基准、肾组织病理学和高分辨率小鼠大脑显微镜数据上均优于强大的 ViT 和 CNN 基线,并且使用多分辨率掩蔽自编码器进行预训练进一步增强了尺度一致的表示,以增强下游任务。
Comparing Classical and Quantum Variational Classifiers on the XOR Problem
Authors: Miras Seilkhan, Adilbek Taizhanov
First: 2026-02-27T17:46:52+00:00 · Latest: 2026-02-27T17:46:52+00:00
Comments: 32 pages, 17 figures. Code and experiment scripts available at https://github.com/mseilkhan/XOR-research-Quantum-ML-vs-Classic
Abstract
Quantum machine learning applies principles such as superposition and entanglement to data processing and optimization. Variational quantum models operate on qubits in high-dimensional Hilbert spaces and provide an alternative approach to model expressivity. We compare classical models and a variational quantum classifier on the XOR problem. Logistic regression, a one-hidden-layer multilayer perceptron, and a two-qubit variational quantum classifier with circuit depths 1 and 2 are evaluated on synthetic XOR datasets with varying Gaussian noise and sample sizes using accuracy and binary cross-entropy.
Performance is determined primarily by model expressivity. Logistic regression and the depth-1 quantum circuit fail to represent XOR reliably, whereas the multilayer perceptron and the depth-2 quantum circuit achieve perfect test accuracy under representative conditions. Robustness analyses across noise levels, dataset sizes, and random seeds confirm that circuit depth is decisive for quantum performance on this task. Despite matching accuracy, the multilayer perceptron achieves lower binary cross-entropy and substantially shorter training time. Hardware execution preserves the global XOR structure but introduces structured deviations in the decision function. Overall, deeper variational quantum classifiers can match classical neural networks in accuracy on low-dimensional XOR benchmarks, but no clear empirical advantage in robustness or efficiency is observed in the examined settings.
中文标题/摘要
标题:经典和量子变分分类器在XOR问题上的比较
量子机器学习利用叠加和纠缠等原理进行数据处理和优化。变分量子模型在高维希尔伯特空间中的量子比特上运行,提供了一种模型表达性的替代方法。我们比较了经典模型和变分量子分类器在XOR问题上的表现。使用准确率和二元交叉熵评估逻辑回归、单隐藏层多层感知机以及深度分别为1和2的两量子比特变分量子分类器在具有不同高斯噪声和样本大小的合成XOR数据集上的表现。
表现主要由模型表达性决定。逻辑回归和深度为1的量子电路无法可靠地表示XOR,而多层感知机和深度为2的量子电路在代表性条件下实现了完美的测试准确率。噪声水平、数据集大小和随机种子的稳健性分析证实,电路深度对于此任务上的量子性能至关重要。尽管准确率相同,多层感知机的二元交叉熵较低,训练时间也显著缩短。硬件执行保留了全局XOR结构,但在决策函数中引入了结构化的偏差。总体而言,更深的变分量子分类器可以在低维XOR基准上与经典神经网络匹配准确率,但在稳健性和效率方面未观察到明显的经验优势。
Summary / 总结
This study compares classical and quantum models on the XOR problem, evaluating logistic regression, a one-hidden-layer multilayer perceptron, and a two-qubit variational quantum classifier with circuit depths 1 and 2. The results show that logistic regression and the depth-1 quantum circuit cannot reliably represent XOR, while the multilayer perceptron and the depth-2 quantum circuit achieve perfect test accuracy under representative conditions. Circuit depth is found to be crucial for quantum performance, and while the classical model matches the quantum model in accuracy, it performs better in terms of binary cross-entropy and training time. Quantum circuits show global structure preservation but introduce structured deviations in the decision function.
研究比较了经典和量子模型在XOR问题上的表现,评估了逻辑回归、单隐藏层多层感知机以及深度分别为1和2的两量子比特变量子化分类器。结果显示,逻辑回归和深度为1的量子电路无法可靠地表示XOR,而多层感知机和深度为2的量子电路在代表性条件下实现了完美的测试准确率。电路深度对于量子性能至关重要,尽管在准确率上匹配,但多层感知机在二元交叉熵和训练时间上优于量子模型。硬件执行保留了全局XOR结构,但在决策函数中引入了结构化的偏差。
Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge
Authors: Ziyang Yu, Wenbing Huang, Yang Liu
Venue: ICLR 2026
First: 2026-02-07T15:32:37+00:00 · Latest: 2026-02-27T17:43:29+00:00
Comments: The Fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract
Molecular Dynamics (MD) simulations provide a fundamental tool for characterizing molecular behavior at full atomic resolution, but their applicability is severely constrained by the computational cost. To address this, a surge of deep generative models has recently emerged to learn dynamics at coarsened timesteps for efficient trajectory generation, yet they either generalize poorly across systems or, due to limited molecular diversity of trajectory data, fail to fully exploit structural information to improve generative fidelity. Here, we present the Pretrained Variational Bridge (PVB) in an encoder-decoder fashion, which maps the initial structure into a noised latent space and transports it toward stage-specific targets through augmented bridge matching. This unifies training on both single-structure and paired trajectory data, enabling consistent use of cross-domain structural knowledge across training stages. Moreover, for protein-ligand complexes, we further introduce a reinforcement learning-based optimization via adjoint matching that speeds progression toward the holo state, which supports efficient post-optimization of docking poses. Experiments on proteins and protein-ligand complexes demonstrate that PVB faithfully reproduces thermodynamic and kinetic observables from MD while delivering stable and efficient generative dynamics.
中文标题/摘要
标题:通过预训练变分桥梁统一生物分子轨迹生成
分子动力学(MD)模拟为在全原子分辨率下表征分子行为提供了基本工具,但其应用受到计算成本的严重限制。为解决这一问题,最近涌现了一种深度生成模型,用于学习粗化时间步长的动力学以提高轨迹生成效率,但这些模型要么在系统间泛化能力差,要么由于轨迹数据的分子多样性有限,无法充分利用结构信息以提高生成保真度。在此,我们以编码器-解码器的方式提出了预训练变分桥梁(PVB),将初始结构映射到噪声潜空间,并通过增强桥梁匹配将其运向阶段特定目标。这统一了单结构和配对轨迹数据的训练,使跨训练阶段的一致性结构知识使用成为可能。此外,对于蛋白质-配体复合物,我们进一步引入了基于强化学习的优化方法,通过反向匹配加速向完整状态的过渡,从而支持对接姿势的高效后优化。在蛋白质和蛋白质-配体复合物上的实验表明,PVB能够忠实再现MD的热力学和动力学观测值,同时提供稳定和高效的生成动力学。
Summary / 总结
The research aims to address the computational limitations of Molecular Dynamics (MD) simulations by developing a unified generative model called Pretrained Variational Bridge (PVB). PVB uses an encoder-decoder architecture to map initial structures into a latent space and then transport them towards stage-specific targets through augmented bridge matching. This approach allows for consistent use of structural knowledge across training stages and improves generative fidelity. Experiments show that PVB accurately reproduces thermodynamic and kinetic observables from MD while providing stable and efficient generative dynamics for proteins and protein-ligand complexes.
研究旨在通过开发统一生成模型Pretrained Variational Bridge (PVB)来解决分子动力学(MD)模拟的计算限制问题。PVB采用编码器-解码器架构,将初始结构映射到潜在空间,然后通过增强的桥梁匹配将其运送到阶段特定的目标。这种方法允许在训练阶段之间一致地使用结构知识并提高生成保真度。实验表明,PVB能够准确地从MD中再现热力学和动力学观测值,同时提供稳定和高效的生成动力学,适用于蛋白质和蛋白质-配体复合物。
Controllable Reasoning Models Are Private Thinkers
Authors: Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych
First: 2026-02-27T17:39:10+00:00 · Latest: 2026-02-27T17:39:10+00:00
Abstract
AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models
中文标题/摘要
标题:可控推理模型是私密思考者
由推理模型驱动的AI代理需要访问敏感用户数据。然而,它们的推理痕迹难以控制,可能会无意中将私人信息泄露给外部方。我们提出训练模型不仅在最终答案中遵循指令,还在推理痕迹中遵循指令,可能在不同约束条件下。我们假设在推理痕迹中提高其遵循指令的能力可以提高其隐私保护能力。为了证明这一点,我们在一个具有明确推理痕迹限制的新指令遵循数据集上对模型进行微调。我们还引入了一种生成策略,使用单独的LoRA适配器解耦推理和答案生成。我们在两个模型家族的六个模型上进行了评估,参数范围从17亿到140亿,跨越两个指令遵循基准和两个隐私基准。我们的方法取得了显著的改进,指令遵循性能提高了高达20.9分,隐私基准提高了高达51.9个百分点。然而,这些改进可能会因推理性能与遵循指令能力之间的权衡而牺牲任务实用性。总体而言,我们的结果表明,在推理模型中提高遵循指令的行为可以显著增强隐私,这表明未来隐私感知代理开发的一个有希望的方向。我们的代码和数据可在https://github.com/UKPLab/arxiv2026-controllable-reasoning-models 获取。
Summary / 总结
The research addresses the issue of private information leakage in reasoning models by training them to follow instructions in both final answers and reasoning traces. The authors introduce a new dataset with explicit restrictions on reasoning traces and a generation strategy that separates reasoning and answer generation. Evaluations on six models from two model families show significant improvements in instruction-following performance and privacy benchmarks, though at the cost of task utility. This suggests that enhancing instruction-following in reasoning models can improve privacy.
研究旨在通过控制AI代理的推理过程来提高其隐私性。方法包括在具有明确限制的新数据集上微调模型,并使用一种分离推理和答案生成的生成策略。该方法显著提高了指令遵循性能,最高可达20.9分,并且在隐私基准测试中提高了高达51.9个百分点,尽管可能会降低任务实用性。总体而言,研究结果表明,在推理模型中提高指令遵循行为可以显著提高隐私性,这为未来隐私意识代理的发展指明了方向。
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Authors: Yasaman Haghighi, Alexandre Alahi
First: 2026-02-27T17:36:09+00:00 · Latest: 2026-02-27T17:36:09+00:00
Abstract
Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.
中文标题/摘要
标题:SenCache:通过敏感性感知缓存加速扩散模型推理
扩散模型在视频生成质量上达到了最先进的水平,但由于需要大量的顺序去噪步骤,其推理仍然很昂贵。这激发了对加速扩散推理的研究。在无需训练的方法中,缓存通过在时间步之间重用之前计算的模型输出来减少计算量。现有的缓存方法依赖于启发式标准来选择缓存/重用的时间步,并需要大量的调优。我们通过一个基于模型输出对去噪输入(即噪声潜在变量和时间步)扰动的敏感性分析,提出了一种原理性的敏感性感知缓存框架来解决这一限制。具体来说,我们通过分析模型输出对去噪输入的敏感性来形式化缓存误差,并表明这种敏感性是预测缓存误差的关键指标。基于这一分析,我们提出了敏感性感知缓存(SenCache),这是一种动态缓存策略,能够根据每个样本自适应地选择缓存时间步。我们的框架为自适应缓存提供了理论基础,解释了为什么先前的经验启发式方法部分有效,并将它们扩展为一种动态的、样本特定的方法。在Wan 2.1、CogVideoX和LTX-Video上的实验表明,在相似的计算预算下,SenCache在视觉质量上优于现有的缓存方法。
Summary / 总结
The paper introduces SenCache, a sensitivity-aware caching framework to accelerate diffusion model inference. It formulates the caching error based on the model output sensitivity to perturbations in denoising inputs and proposes a dynamic caching policy, SenCache, which adaptively selects caching timesteps. Experiments show that SenCache outperforms existing methods in visual quality under similar computational budgets.
SenCache 通过使用敏感性感知缓存框架来加速扩散模型推理。它基于去噪输入的扰动对模型输出敏感性来形式化缓存误差,并提出了一种动态缓存策略 SenCache,该策略能够根据每个样本自适应地选择缓存时间步。实验表明,在相似的计算预算下,SenCache 在视觉质量方面优于现有缓存方法。
The Stability of Online Algorithms in Performative Prediction
Authors: Gabriele Farina, Juan Carlos Perdomo
First: 2026-02-27T17:35:03+00:00 · Latest: 2026-02-27T17:35:03+00:00
Abstract
The use of algorithmic predictions in decision-making leads to a feedback loop where the models we deploy actively influence the data distributions we see, and later use to retrain on. This dynamic was formalized by Perdomo et al. 2020 in their work on performative prediction. Our main result is an unconditional reduction showing that any no-regret algorithm deployed in performative settings converges to a (mixed) performatively stable equilibrium: a solution in which models actively shape data distributions in ways that their own predictions look optimal in hindsight. Prior to our work, all positive results in this area made strong restrictions on how models influenced distributions. By using a martingale argument and allowing randomization, we avoid any such assumption and sidestep recent hardness results for finding stable models. Lastly, on a more conceptual note, our connection sheds light on why common algorithms, like gradient descent, are naturally stabilizing and prevent runaway feedback loops. We hope our work enables future technical transfer of ideas between online optimization and performativity.
中文标题/摘要
标题:在线算法在表现性预测中的稳定性
算法预测在决策中的使用会导致反馈循环,其中我们部署的模型会积极影响我们所看到的数据分布,进而用于重新训练。Perdomo等人在2020年的表现性预测工作中对此动态进行了形式化。我们的主要结果是一个无条件的减少,表明任何在表现性环境中部署的无悔算法都会收敛到一个(混合)表现性稳定均衡:一种解决方案,在这种解决方案中,模型积极塑造数据分布的方式使得它们自己的预测在事后看来是最优的。在我们之前的工作中,该领域所有积极的结果都对模型如何影响分布做出了强烈的假设。通过使用鞅论证并允许随机化,我们避免了任何这样的假设,并绕过了最近关于寻找稳定模型的困难结果。最后,从概念上讲,我们的联系揭示了为什么常见的算法,如梯度下降,自然具有稳定性和防止失控反馈循环的原因。我们希望我们的工作能够促进在线优化与表现性之间的技术转移。
Summary / 总结
The research aims to understand the stability of online algorithms in performative prediction, where predictions influence future data. The study uses a martingale argument and randomization to show that any no-regret algorithm in performative settings converges to a performatively stable equilibrium without making strong assumptions about model influence. Key findings include the unconditional reduction of no-regret algorithms to performative stability and insights into why common algorithms like gradient descent are stabilizing. This work bridges online optimization and performativity, potentially preventing feedback loops.
研究探讨了预测性决策中的反馈循环问题,即模型如何影响数据分布。研究展示了在预测性设置中,无遗憾算法会收敛到预测性稳定均衡。通过使用鞅论证和允许随机化,研究避免了之前的限制性假设,并提供了关于为什么梯度下降等常见算法具有稳定性的概念性见解。研究结果表明,这些算法能够自然地防止反馈循环失控,而不需要额外的约束。
Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics
Authors: Egor Antipov, Alessandro Palma, Lorenzo Consoli, Stephan Günnemann, Andrea Dittadi, Fabian J. Theis
First: 2026-02-27T17:27:55+00:00 · Latest: 2026-02-27T17:27:55+00:00
Abstract
Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions and covariates. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive flow-based evaluations are computationally expensive, as they require simulating costly likelihood integrals for each distribution separately. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.
中文标题/摘要
标题:基于流的密度比率估计在不可处理分布中的应用,特别是在基因组学中的应用
在概率建模中,估计不可处理数据分布之间的密度比率是一个核心问题,它允许在不同数据生成过程、条件和协变量之间进行样本似然性的原则性比较。虽然归一化流等精确似然模型为密度比率估计提供了有希望的方法,但直接基于流的评估计算成本高昂,因为它们需要分别对每个分布模拟昂贵的似然积分。在本文中,我们利用条件感知流匹配来推导出一种沿生成轨迹跟踪密度比率的单一动力学公式。我们在模拟基准上展示了闭式比率估计的竞争性能,并展示了我们的方法在单细胞基因组学数据分析中的多功能性,其中基于似然性的细胞状态比较在不同实验条件之间使治疗效果估计和批次校正评估成为可能。
Summary / 总结
This study addresses the challenge of estimating density ratios between intractable distributions, which is crucial for comparing sample likelihoods under different data-generating processes. The authors propose a condition-aware flow matching method to derive a single dynamical formulation for tracking density ratios. They demonstrate competitive performance on simulated benchmarks and show that their method can be applied to single-cell genomics data analysis, enabling treatment effect estimation and batch correction evaluation.
该研究解决了估计不可处理分布之间密度比的问题,这对于在不同数据生成过程下比较似然性至关重要。作者提出了一种条件感知流匹配方法,以推导出跟踪密度比的单一动力学公式。他们在模拟基准上实现了竞争力的表现,并展示了该方法在单细胞基因组学中的应用,支持诸如治疗效果估计和批次校正评估等任务。
Carré du champ flow matching: better quality-generalisation tradeoff in generative models
Authors: Jacob Bamberger, Iolo Jones, Dennis Duncan, Michael M. Bronstein, Pierre Vandergheynst, Adam Gosztolai
First: 2025-10-07T13:41:33+00:00 · Latest: 2026-02-27T17:23:59+00:00
Abstract
Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carré du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.
中文标题/摘要
标题:方差匹配流匹配:生成模型中质量-泛化权衡的改进
深度生成模型经常面临一个基本的权衡:高质量样本可能会以记忆化为代价,即模型复制训练数据而不是在潜在数据几何结构上泛化。我们引入了方差匹配流匹配(CDC-FM),这是一种流匹配(FM)的扩展,通过使用几何感知噪声来正则化概率路径,从而改善了质量-泛化权衡。我们的方法用一个空间变化的各向异性高斯噪声替换FM中的均匀、各向同性噪声,该噪声的协方差捕捉了潜在数据流形的局部几何结构。我们证明这种几何噪声可以从数据中最优估计,并且可以扩展到大量数据。此外,我们在多种数据集(合成流形、点云、单细胞基因组学、动物动作捕捉和图像)以及各种神经网络架构(MLPs、CNNs和变压器)上进行了详尽的实验评估。我们证明CDC-FM始终提供更好的质量-泛化权衡。我们观察到在数据稀缺和高度非均匀采样的数据集上,与标准FM相比有显著改进,这些数据集在科学人工智能应用中经常遇到。我们的工作为研究生成模型中数据几何、泛化和记忆之间的相互作用提供了数学框架,并提供了一种稳健且可扩展的算法,可以轻松集成到现有的流匹配管道中。
Summary / 总结
The paper addresses the challenge of balancing sample quality and generalisation in deep generative models. It introduces Carré du champ flow matching (CDC-FM), which enhances this tradeoff by using geometry-aware noise instead of isotropic noise. Experiments on various datasets and architectures show that CDC-FM consistently improves the quality-generalisation tradeoff, especially in data-scarce and non-uniformly sampled scenarios.
论文旨在解决深度生成模型中样本质量和泛化之间的平衡问题。它引入了Carré du champ流匹配(CDC-FM)方法,通过使用几何感知噪声而非各向同性噪声来提升这种平衡。实验表明,CDC-FM在各种数据集和架构上能够一致地改善质量和泛化之间的权衡,特别是在数据稀缺和非均匀采样的情况下。
SelvaBox: A high-resolution dataset for tropical tree crown detection
Authors: Hugo Baudchon, Arthur Ouaknine, Martin Weiss, Mélisande Teng, Thomas R. Walla, Antoine Caron-Guay, Christopher Pal, Etienne Laliberté
First: 2025-06-30T18:23:30+00:00 · Latest: 2026-02-27T17:22:21+00:00
Abstract
Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open-access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than 83,000 manually labeled crowns - an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: (1) higher-resolution inputs consistently boost detection accuracy; and (2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.
中文标题/摘要
标题:SelvaBox:高分辨率的热带树木树冠检测数据集
在热带森林中检测单个树木树冠对于研究受人类干预和气候变化影响的复杂而重要的生态系统至关重要。然而,热带树冠在大小、结构和模式上差异很大,且大多重叠交织,需要应用于高分辨率影像的高级遥感方法。尽管对热带树冠检测的兴趣日益增长,但注释数据集仍然稀缺,阻碍了稳健模型的发展。我们介绍了SelvaBox,这是首个开放访问的用于高分辨率无人机影像中热带树冠检测的最大数据集。它覆盖了三个国家,包含超过83,000个手动标注的树冠,是之前所有热带森林数据集总和的十倍。SelvaBox上的广泛基准测试揭示了两个关键发现:(1) 更高的分辨率输入始终能提升检测准确性;(2) 仅在SelvaBox上训练的模型在未见过的热带树冠数据集上实现了具有竞争力的零样本检测性能,与竞争方法相当或更优。此外,在一个统一的多分辨率管道中联合训练SelvaBox和三个其他分辨率为3至10厘米/像素的数据集,检测器在所有评估数据集中排名第一或第二。我们的数据集、代码和预训练权重均已公开。
Summary / 总结
The research aims to improve the detection of individual tree crowns in tropical forests, which are crucial for studying ecosystems affected by human activities and climate change. The study introduces SelvaBox, a large-scale dataset for tropical tree crown detection using high-resolution drone imagery. Key findings include higher-resolution inputs enhancing detection accuracy and models trained on SelvaBox achieving competitive performance on unseen datasets. Joint training on SelvaBox and other datasets within a multi-resolution pipeline further improves detection performance across various resolutions.
研究旨在提高热带森林中单个树冠的检测,这对于研究受人类活动和气候变化影响的生态系统至关重要。研究引入了SelvaBox,这是一个包含来自三个国家超过83,000个手动标注树冠的高分辨率数据集,规模远超以往所有数据集。实验表明,高分辨率图像能提高检测准确性,而仅在SelvaBox上训练的模型在未见过的数据集上表现良好。将SelvaBox与其他不同分辨率的数据集联合训练,进一步提升了在多个数据集上的检测性能。
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Authors: Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low
Venue: ICLR 2025
First: 2026-02-27T17:18:42+00:00 · Latest: 2026-02-27T17:18:42+00:00
Comments: Earlier versions presented at ICLR 2025 QUESTION workshop and ICML 2025 R2-FM workshop
Abstract
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.
中文标题/摘要
标题:多模态大型语言模型的不确定性量化:基于内部模态特征的不一致性调整语义体积
尽管具有强大的能力,多模态大型语言模型(MLLMs)可能会生成看似合理但实际上错误的输出,阻碍了可靠部署。准确的不确定性度量可以将不可靠的查询升级给人类专家或更大规模的模型以提高性能。然而,现有的不确定性度量存在实际限制,如仅针对特定模态设计、依赖外部工具或计算成本高昂。我们提出了UMPIRE,这是一种无需训练的MLLM不确定性量化框架,可以在各种输入和输出模态下高效工作,无需外部工具,仅依赖模型自身的内部模态特征。UMPIRE 通过计算给定任务实例中采样MLLM响应的不一致性调整语义体积,有效地捕捉样本的全局语义多样性和基于内部模型置信度的响应局部不一致性。我们为MLLMs提出了不确定性期望,并提供了支持UMPIRE设计的理论分析。广泛实验表明,UMPIRE 在图像、音频和视频文本基准测试中,包括对抗性和离群值设置中,始终优于基线度量在错误检测和不确定性校准方面的表现。我们还展示了UMPIRE在非文本输出任务中的泛化能力,包括图像和音频生成。
Summary / 总结
The research aims to address the issue of unreliable outputs from Multimodal Large Language Models (MLLMs) by developing UMPIRE, a training-free uncertainty quantification framework. UMPIRE computes the incoherence-adjusted semantic volume of MLLM responses, capturing both global semantic diversity and local incoherence. Experiments show that UMPIRE outperforms baseline metrics in error detection and uncertainty calibration across various modalities, including image, audio, and video-text benchmarks.
研究旨在通过开发UMPRIE无训练集的不确定性量化框架来解决多模态大型语言模型(MLLMs)的不可靠输出问题。UMPRIE计算MLLM响应的不一致性调整后的语义体积,同时捕捉全局语义多样性和局部不一致性。实验表明,UMPRIE在图像、音频和视频文本等各种模态的错误检测和不确定性校准中优于基线指标。
Multi-Objective Reinforcement Learning for Large-Scale Tote Allocation in Human-Robot Collaborative Fulfillment Centers
Authors: Sikata Sengupta, Guangyi Liu, Omer Gottesman, Joseph W Durham, Michael Kearns, Aaron Roth, Michael Caldara
First: 2026-02-27T17:04:57+00:00 · Latest: 2026-02-27T17:04:57+00:00
Abstract
Optimizing the consolidation process in container-based fulfillment centers requires trading off competing objectives such as processing speed, resource usage, and space utilization while adhering to a range of real-world operational constraints. This process involves moving items between containers via a combination of human and robotic workstations to free up space for inbound inventory and increase container utilization. We formulate this problem as a large-scale Multi-Objective Reinforcement Learning (MORL) task with high-dimensional state spaces and dynamic system behavior. Our method builds on recent theoretical advances in solving constrained RL problems via best-response and no-regret dynamics in zero-sum games, enabling principled minimax policy learning. Policy evaluation on realistic warehouse simulations shows that our approach effectively trades off objectives, and we empirically observe that it learns a single policy that simultaneously satisfies all constraints, even if this is not theoretically guaranteed. We further introduce a theoretical framework to handle the problem of error cancellation, where time-averaged solutions display oscillatory behavior. This method returns a single iterate whose Lagrangian value is close to the minimax value of the game. These results demonstrate the promise of MORL in solving complex, high-impact decision-making problems in large-scale industrial systems.
中文标题/摘要
标题:大规模托盘分配的人机协作 fulfillment 中心多目标强化学习
在基于容器的 fulfillment 中心优化合并过程需要在处理速度、资源使用和空间利用等竞争目标之间进行权衡,同时遵守一系列实际操作约束。此过程涉及通过人类和机器人工作站的组合移动项目,以释放空间以容纳入站库存并提高容器利用率。我们将此问题形式化为具有高维状态空间和动态系统行为的大规模多目标强化学习(MORL)任务。我们的方法基于解决受约束的 RL 问题的最新理论进展,通过零和博弈中的最佳响应和无遗憾动力学习原则性最小最大策略。在现实仓库模拟中的策略评估表明,我们的方法有效地权衡了目标,我们还实证观察到,它学习了一个策略,同时满足了所有约束,即使这在理论上并不保证。我们进一步引入了一个理论框架来处理误差抵消的问题,其中时间平均解表现出振荡行为。该方法返回一个拉格朗日值接近博弈最小最大值的单个迭代。这些结果表明,MORL 在解决大规模工业系统中的复杂、高影响决策问题方面的潜力。
Summary / 总结
The paper addresses the optimization of the consolidation process in container-based fulfillment centers by formulating it as a multi-objective reinforcement learning (MORL) problem. The method leverages recent advances in constrained reinforcement learning to learn a policy that effectively trades off competing objectives such as processing speed, resource usage, and space utilization. Experimental results on realistic warehouse simulations show that the approach can learn a single policy satisfying all constraints, even though theoretical guarantees are not always provided. Additionally, the paper introduces a framework to handle error cancellation in time-averaged solutions, ensuring more stable and reliable policy evaluation.
论文通过将容器化 fulfillment 中心的优化过程形式化为多目标强化学习 (MORL) 问题,来解决容器化 fulfillment 中心的优化问题。该方法利用了受限强化学习的最新理论进展,学习了一个能够有效权衡处理速度、资源使用和空间利用率等竞争目标的单一策略。该方法在现实仓库模拟中进行了评估,显示它可以同时满足所有约束,并能够处理误差抵消的问题,尽管没有理论上的保证。
Manifold of Failure: Behavioral Attraction Basins in Language Models
Authors: Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Blake Gatto
First: 2026-02-25T15:08:20+00:00 · Latest: 2026-02-27T17:04:02+00:00
Abstract
While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
中文标题/摘要
标题:失败流形:语言模型的行为吸引盆地
尽管先前的工作集中在将对抗性示例投影回自然数据流形以恢复安全性,但我们认为,全面理解人工智能安全性需要表征不安全区域本身。本文介绍了一种系统映射大型语言模型(LLMs)失败流形的框架。我们将寻找漏洞的问题重新定义为质量多样性问题,使用MAP-Elites照亮这些失败区域的连续拓扑结构,我们称其为行为吸引盆地。我们的质量度量,对齐偏差,引导搜索向模型行为与预期对齐最偏离的区域。在三个LLM:Llama-3-8B、GPT-OSS-20B和GPT-5-Mini中,我们展示了MAP-Elites实现了高达63%的行为覆盖率,发现了多达370个独特的漏洞生态位,并揭示了不同模型特有的拓扑特征:Llama-3-8B表现出几乎普遍的漏洞平台(平均对齐偏差0.93),GPT-OSS-20B显示了一个碎片化的景观,空间上集中的盆地(平均0.73),而GPT-5-Mini则表现出强大的鲁棒性,天花板为0.50。我们的方法生成了每个模型安全景观的可解释全局图,这是目前任何现有攻击方法(GCG、PAIR或TAP)都无法提供的,将范式从寻找离散的失败转移到理解其潜在结构。
Summary / 总结
This paper introduces a framework to map the Manifold of Failure in Large Language Models (LLMs) by using MAP-Elites to identify behavioral attraction basins where model behavior deviates from intended alignment. Across three LLMs, the approach uncovered up to 370 distinct vulnerability niches, with different topological signatures: Llama-3-8B showed a near-universal vulnerability plateau, GPT-OSS-20B had a fragmented landscape, and GPT-5-Mini demonstrated strong robustness. The method provides interpretable, global safety maps that existing attack methods cannot offer, shifting the focus from discrete failures to their underlying structure.
本文通过表征语言模型中的不安全区域,旨在全面理解AI安全性。它引入了使用MAP-Elites绘制Manifold of Failure的方法,其中包括行为吸引盆地。在三种LLM中,该方法实现了高达63%的行为覆盖,发现了多达370个不同的漏洞领域,并揭示了不同模型特有的拓扑特征,突显了这些不安全区域的模型特定性质。
Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints
Authors: Shishun Zhang, Juzhan Xu, Yidan Fan, Chenyang Zhu, Ruizhen Hu, Yongjun Wang, Kai Xu
First: 2026-02-27T17:03:34+00:00 · Latest: 2026-02-27T17:03:34+00:00
Comments: 8 pages, 8 figures, conference
Abstract
The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenarios--the Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.
中文标题/摘要
标题:在有限缓冲和物料装配约束下的灵活车间调度学习
灵活车间调度问题(FJSP)源自实际生产线,但当前FJSP研究中往往会忽略或理想化一些实际约束,其中有限缓冲问题对生产效率有特别的影响。为此,我们研究了一个更接近实际场景的扩展问题——有限缓冲和物料装配约束下的灵活车间调度问题。近年来,深度强化学习(DRL)在调度任务中显示出巨大的潜力。然而,它在处理复杂依赖和长期约束时的状态建模能力仍然有限。为了解决这一问题,我们利用DRL框架中的异构图网络来建模全局状态。通过在机器、操作和缓冲之间构建高效的消息传递,该网络专注于避免在长时间序列调度中可能导致频繁托盘更换的决策,从而有助于提高缓冲利用率和整体决策质量。在合成和实际生产线数据集上的实验结果表明,所提出的方法在周转时间和托盘更换方面优于传统启发式方法和先进的DRL方法,并且在解决方案质量和计算成本之间取得了良好的平衡。此外,还提供了一个补充视频,展示了有效可视化生产线进程的仿真系统。
Summary / 总结
The study addresses the Flexible Job Shop Scheduling Problem with limited buffers and material kitting constraints, which is more practical than previous idealized models. It proposes a method using a heterogeneous graph network within the deep reinforcement learning framework to model the global state, focusing on avoiding frequent pallet changes. The method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and balances solution quality and computational cost effectively. Experimental results on synthetic and real datasets demonstrate its effectiveness.
研究关注的是带有有限缓冲和物料装配约束的柔性作业车间调度问题,旨在解决当前研究中常被忽略的实用约束。提出了一种使用深度强化学习框架中的异构图网络来建模全局状态的方法,以提高缓冲利用率。实验结果表明,该方法在合成和实际生产线上都优于传统启发式方法和高级DRL方法,在吞吐量和物料变化方面表现出色,同时平衡了解的质量和计算成本。
What Makes a Reward Model a Good Teacher? An Optimization Perspective
Authors: Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora
Venue: NeurIPS 2025
First: 2025-03-19T17:54:41+00:00 · Latest: 2026-02-27T17:00:32+00:00
Comments: Accepted to NeurIPS 2025; Code available at https://github.com/princeton-pli/what-makes-good-rm
Abstract
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.
中文标题/摘要
标题:什么让奖励模型成为好的教师?从优化的角度来看
人类反馈强化学习(RLHF)的成功在很大程度上取决于奖励模型的质量。然而,尽管质量主要通过准确性来评估,但尚不清楚准确性是否完全捕捉到一个奖励模型成为有效教师的全部因素。我们从优化的角度来回答这个问题。首先,我们证明,无论奖励模型的准确性如何,如果它导致低奖励方差,那么RLHF目标将遭受平坦的景观。因此,即使一个完全准确的奖励模型也可能导致极其缓慢的优化,而不如那些诱导更高奖励方差的不那么准确的模型表现好。我们还表明,一个对一种语言模型有效的奖励模型可能会导致另一种语言模型的低奖励方差,从而导致目标景观的平坦化。这些结果确立了仅基于准确性和独立于引导的语言模型来评估奖励模型的基本局限性。使用多达8B参数的模型进行的实验证实了我们的理论,展示了奖励方差、准确性和奖励最大化率之间的相互作用。总体而言,我们的研究结果强调,除了准确性之外,奖励模型还需要诱导足够的方差以实现高效的优化。
Summary / 总结
The paper explores the factors that make a reward model effective in Reinforcement Learning from Human Feedback (RLHF). It shows that a reward model's accuracy alone does not guarantee efficient optimization, as low reward variance can lead to a flat landscape, causing slow optimization even with a perfectly accurate model. The study also demonstrates that a reward model effective for one language model might not be optimal for another, highlighting the interplay between reward variance, accuracy, and optimization speed. Experiments with models up to 8B parameters support these findings.
论文探讨了什么使得奖励模型在人类反馈强化学习(RLHF)中成为一个有效的教师。研究表明,奖励模型的准确性并不能保证其有效性,还需要能够产生足够的奖励方差以避免优化过程中的平坦景观。研究证明,即使一个模型非常准确,但如果它产生的奖励方差低,也可能无效,而准确性较低但方差高的模型可能表现更好。实验结果证实,奖励模型的适用性取决于其能否产生适当的方差,强调在评估奖励模型时需要同时考虑准确性和方差。
LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
Authors: Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat
First: 2026-02-27T16:52:52+00:00 · Latest: 2026-02-27T16:52:52+00:00
Comments: 15 pages, 3 figures, 5 Tables
Abstract
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.
中文标题/摘要
标题:LemmaBench:一种评估大型语言模型在数学领域研究能力的实时基准
我们提出了一种新的方法,用于评估大型语言模型(LLM)在研究级数学领域的能力。现有的基准主要依赖于静态的手工整理的竞赛或教科书风格的问题集作为数学研究的代理。相反,我们建立了一个可更新的基准,直接评估模型在数学最新研究成果上的表现。这包括一个自动流水线,从arXiv中提取引理,并将其重写为自包含的陈述,使所有假设和所需定义都明确。这导致了一个基准,可以定期更新新的问题,这些问题是直接从人类数学研究中提取的,而之前的实例可以用于训练,而不会影响未来的评估。我们对当前最先进的LLM进行了基准测试,这些模型在定理证明(pass@1)方面的准确率约为10-15%,具体取决于模型,这表明LLM在研究环境中达到人类级证明能力方面还有很大的进步空间。
Summary / 总结
LemmaBench is a new benchmark for evaluating LLMs in research-level mathematics by using lemmas from arXiv, rewritten into self-contained statements. This approach allows for regular updates with the latest research problems, enabling training without compromising future evaluations. Current LLMs achieve around 10-15% accuracy in theorem proving, indicating significant room for improvement to reach human-level capabilities.
LemmaBench 是一个新的基准,用于通过使用来自 arXiv 的最新引理来评估 LLM 在研究级数学中的能力。它不同于现有的使用静态、手工编纂的问题的基准。该基准会自动提取并重写引理为自包含陈述,从而可以定期更新。当前的 LLM 在定理证明中的准确率约为 10-15%,表明在这一领域与人类水平之间存在显著差距。
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Authors: Adam Dejl, Deniz Gorur, Francesca Toni
First: 2026-02-27T16:52:27+00:00 · Latest: 2026-02-27T16:52:27+00:00
Comments: AAMAS 2026 Demonstration Track
Abstract
Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App supports visualisation of the produced explanations and interaction with human users, allowing them to identify and contest any mistakes in the system's reasoning. It is highly modular and enables drawing information from trusted external sources. ArgLLM-App is publicly available at https://argllm.app, with a video demonstration at https://youtu.be/vzwlGOr0sPM.
中文标题/摘要
标题:ArgLLM-App:一种基于大型语言模型的论证推理交互系统
论证型大语言模型(ArgLLMs)是一种利用大型语言模型(LLMs)和计算论证来辅助决策的方法,旨在使决策结果能够忠实解释并可由人类质疑。在此,我们提出了一种基于网络的系统,该系统实现了具有ArgLLM能力的代理,用于二元任务。ArgLLM-App 支持生成解释的可视化展示,并允许与人类用户互动,使用户能够识别并质疑系统推理中的任何错误。该系统高度模块化,能够从可信赖的外部来源获取信息。ArgLLM-App 公开可用,网址为 https://argllm.app,视频演示可在 https://youtu.be/vzwlGOr0sPM 查看。
Summary / 总结
The research aims to develop an interactive system, ArgLLM-App, that uses Argumentative Large Language Models (ArgLLMs) to support decision-making processes, making the reasoning transparent and contestable by humans. The system visualizes explanations and allows users to interact and contest the system's reasoning. It is modular and can incorporate trusted external information. Key findings include the system's ability to enhance explainability and contestability in decision-making processes through interactive and visual means.
该研究旨在开发一个名为ArgLLM-App的交互系统,利用论辩型大型语言模型(ArgLLMs)支持决策过程,使人类能够透明地理解和质疑系统的推理。该系统可视化解释并允许用户交互和质疑系统的推理。它具有模块化特性,并能整合可信的外部信息。关键发现包括通过交互和可视化手段增强决策过程的解释性和可质疑性。
TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
Authors: Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He
First: 2025-10-24T05:51:31+00:00 · Latest: 2026-02-27T16:43:57+00:00
Abstract
Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
中文标题/摘要
标题:TokenCLIP:基于Token的提示学习在零样本异常检测中的应用
将CLIP适应于以未见过的对象进行异常检测,在零样本方式下显示出强大的潜力。然而,现有方法通常依赖单一的文本空间来跨不同对象和领域与视觉语义对齐。这种无差别的对齐阻碍了模型准确捕捉各种异常语义的能力。我们提出TokenCLIP,一种基于Token的适应框架,使视觉空间和可学习的文本空间之间能够动态对齐,以实现精细粒度的异常学习。TokenCLIP 不是将所有视觉Token映射到单一的、Token无关的文本空间,而是将每个Token与一个定制的文本子空间对齐,该子空间代表其视觉特征。为每个Token明确分配一个独特的可学习文本空间在计算上是不可行的,并且容易导致优化不足。我们相反地将Token无关的文本空间扩展为一组正交子空间,然后根据语义亲和力动态地将每个Token分配到由子空间组合引导的子空间中,这同时支持定制和高效的基于Token的适应。为此,我们将动态对齐形式化为一个最优传输问题,其中图像中的所有视觉Token根据语义相似性被传输到文本子空间。OT的传输约束确保了子空间之间的充分优化,并鼓励它们专注于不同的语义。求解该问题产生一个传输计划,该计划能够自适应地将每个Token分配到语义相关的子空间中。然后应用top-k掩码来稀疏化该计划,并使子空间专门化于不同的视觉区域。广泛的实验表明TokenCLIP的优越性。
Summary / 总结
TokenCLIP is a token-wise adaptation framework for zero-shot anomaly detection, which dynamically aligns visual and learnable textual spaces by assigning each visual token to a customized textual subspace based on semantic affinity. This method overcomes the limitations of existing single-textual-space approaches and improves the accuracy of anomaly detection. Experiments show that TokenCLIP outperforms existing methods in various scenarios.
TokenCLIP 是一种用于零样本异常检测的 token 智能适配框架,通过基于语义亲和性将每个视觉 token 分配到定制化的文本子空间,动态对齐视觉和可学习的文本空间。这种方法克服了现有方法使用单一文本空间的局限性,从而实现更准确的异常检测。实验表明,TokenCLIP 在各种场景中优于现有方法。
ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
Authors: Joseph Tso, Preston Schmittou, Quan Huynh, Jibran Hutchins
First: 2026-02-25T22:54:26+00:00 · Latest: 2026-02-27T16:43:01+00:00
Comments: Preprint. 10 pages, 1 figure, 6 tables. Benchmark and evaluation code will be publicly released
Abstract
Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained optimization problems without access to a solver? We introduce ConstraintBench, a benchmark for evaluating LLMs on direct constrained optimization across 10 operations research domains, with all ground-truth solutions verified by the Gurobi solver. Each task presents a natural-language scenario with entities, constraints, and an optimization objective; the model must return a structured solution that a deterministic verifier checks against every constraint and the solver-proven optimum. We evaluate six frontier models on 200 tasks and find that feasibility, not optimality, is the primary bottleneck. The best model achieves only 65.0% feasibility, yet feasible solutions average 89 to 96% of the Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of the solver reference. Per-domain analysis shows large variation in difficulty, with average feasibility spanning from 85.0% in the facility location domain to 0.8% in the crew assignment domain. Further, systematic failure modes include duration constraint misunderstanding, entity hallucination, and a feasibility-optimality decoupling in facility location and vehicle routing where models achieve high feasibility but 0% optimality. ConstraintBench and all evaluation infrastructure will be publicly released.
中文标题/摘要
标题:ConstraintBench:评估大型语言模型直接优化约束推理基准
大型语言模型在具有约束优化基础结构的操作决策中应用日益广泛。现有基准测试评估大型语言模型是否能够将优化问题表述为求解代码,但未解决一个互补问题:大型语言模型是否能够在不使用求解器的情况下直接生成完全指定的约束优化问题的正确解?我们引入了ConstraintBench,这是一个在10个运筹学领域评估大型语言模型直接约束优化的基准测试,所有地面真实解都由Gurobi求解器验证。每个任务提供一个自然语言场景,包含实体、约束和优化目标;模型必须返回一个结构化的解决方案,由确定性验证器检查每个约束和求解器证明的最优解。我们在200个任务上评估了六种前沿模型,发现可行性而非最优性是主要瓶颈。最佳模型的可行性仅为65.0%,但可行解平均为Gurobi最优目标的89%到96%。没有模型在0.1%以内达到求解器参考的联合可行性和最优性超过30.5%。按领域分析显示难度差异巨大,平均可行性从设施定位领域的85.0%到机组分配领域的0.8%不等。此外,系统性失败模式包括持续时间约束误解、实体幻觉以及设施定位和车辆路线中的可行性和最优性脱钩,其中模型在高可行性但0%最优性方面表现不佳。ConstraintBench及其所有评估基础设施将公开发布。
Summary / 总结
ConstraintBench evaluates large language models (LLMs) in directly solving constrained optimization problems without solver access. It consists of 200 tasks across 10 operations research domains, with solutions verified by Gurobi. The primary challenge is feasibility rather than optimality, with the best model achieving only 65.0% feasibility. Feasible solutions are close to optimal, averaging 89 to 96% of the Gurobi-optimal objective. The facility location domain is the easiest, with 85.0% feasibility, while the crew assignment domain is the hardest, at 0.8%. Common failure modes include misunderstanding duration constraints and entity hallucination.
研究旨在评估大型语言模型(LLMs)在没有求解器访问的情况下直接解决约束优化问题的能力。研究引入了ConstraintBench,这是一个覆盖10个运筹学领域的基准,要求LLMs返回的结构化解决方案需通过约束和求解器证明最优解的验证。六种前沿模型在200个任务上的评估结果显示,可行性是主要挑战,最佳模型的可行性仅为65.0%。然而,可行解平均是Gurobi最优解的89到96%。没有模型在可行性与最优性方面同时达到求解器参考值的0.1%以内。设施定位领域是最容易的,平均可行性为85.0%,而机组调度领域是最难的,平均可行性仅为0.8%。常见的失败模式包括误解持续时间约束和实体幻觉。
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
First: 2026-01-15T16:18:00+00:00 · Latest: 2026-02-27T16:41:42+00:00
Comments: 22 pages, 10 figures
Abstract
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
中文标题/摘要
标题:视频生成模型与潜在世界模型的推理时物理对齐
最先进的视频生成模型能够生成令人印象深刻的视觉内容,但往往违反基本的物理原理,限制了它们的应用。虽然有人认为这种缺陷源于预训练时对物理理解不足,但我们发现,物理合理性不足也源于不理想的推理策略。因此,我们引入了WMReward,并将提高视频生成的物理合理性视为一种推理时的对齐问题。具体来说,我们利用潜在世界模型(这里为VJEPA-2)的强物理先验作为奖励,搜索和引导多个候选去噪轨迹,从而实现测试时计算量的扩展,以提高生成性能。实验证明,我们的方法在图像条件、多帧条件和文本条件生成设置中显著提高了物理合理性,得到了人类偏好研究的验证。值得注意的是,在ICCV 2025 Perception Test PhysicsIQ挑战中,我们获得了62.64%的最终得分,获得第一名,并且超越了之前的最先进的技术水平7.42%。我们的工作证明了使用潜在世界模型提高视频生成的物理合理性的可行性,超越了这一特定实例或参数化。
Summary / 总结
This paper addresses the issue of video generative models violating physics principles by proposing WMReward, which treats physics plausibility as an inference-time alignment problem. By using the strong physics prior of a latent world model, the method searches and steers multiple candidate denoising trajectories, enhancing generation performance. The approach significantly improves physics plausibility across different generation settings and won the first place in the ICCV 2025 Perception Test PhysicsIQ Challenge with a score of 62.64%, surpassing the previous state of the art by 7.42%.
本文通过提出WMReward,将物理合理性视为推理时的对齐问题,解决了视频生成模型违反物理原理的问题。通过利用潜在世界模型的强大物理先验,该方法搜索并引导多个候选去噪轨迹,从而提升生成性能。该方法在不同生成设置中显著提高了物理合理性,并在ICCV 2025 Perception Test PhysicsIQ挑战赛中以62.64%的得分获得第一名,超越了之前的最先进水平7.42%。
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
Authors: Chao Xu, Xiaochen Zhao, Xiang Deng, Jingxiang Sun, Zhuo Su, Donglin Di, Yebin Liu
First: 2026-02-27T16:41:21+00:00 · Latest: 2026-02-27T16:41:21+00:00
Comments: 17 pages
Abstract
Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.
中文标题/摘要
标题:GeoDiff4D:几何感知扩散在4D头像重建中的应用
从单张肖像图像重建逼真且可动画化的4D头像仍然是计算机视觉中的一个基本挑战。尽管扩散模型在头像重建中的图像和视频生成方面取得了显著进展,但现有方法主要依赖于2D先验,难以实现一致的3D几何结构。我们提出了一种新的框架,利用几何感知扩散来学习高保真头像重建的强几何先验。我们的方法联合合成肖像图像及其对应的表面法线,而无姿态的表情编码器捕获隐式表情表示。合成的图像和表情潜在变量都被纳入基于3D高斯的头像中,从而实现逼真的几何渲染。大量实验表明,我们的方法在视觉质量、表情保真度和跨身份泛化方面显著优于现有方法,同时支持实时渲染。
Summary / 总结
The research aims to reconstruct photorealistic 4D head avatars from a single portrait image, addressing the challenge of consistent 3D geometry. The method uses a geometry-aware diffusion model that synthesizes both portrait images and surface normals, along with a pose-free expression encoder to capture implicit expression representations. The approach generates 3D Gaussian-based avatars that support photorealistic rendering with accurate geometry. Experiments show that the proposed method outperforms existing techniques in visual quality, expression fidelity, and cross-identity generalization, while enabling real-time rendering.
研究旨在从单张肖像图像中重建逼真且可动画的4D头像,解决3D几何一致性的问题。提出的GeoDiff4D框架采用几何感知扩散来学习强大的几何先验,同时合成肖像图像和表面法线。它还使用无姿态的表情编码器捕获隐式表情表示,这些表示随后被整合到基于3D高斯的头像中,以实现逼真的渲染。实验表明,GeoDiff4D在视觉质量、表情保真度和跨身份泛化方面优于现有方法,同时支持实时渲染。
Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images
Authors: Alexander Vieth, Boudewijn Lelieveldt, Elmar Eisemann, Anna Vilanova, Thomas Höllt
First: 2026-02-27T16:40:54+00:00 · Latest: 2026-02-27T16:40:54+00:00
Comments: 12 pages main paper, 8 pages supplemental material
Abstract
High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional embedding of the attribute space and a conventional image representation. Nowadays, such images can easily contain several million pixels. For such large datasets, hierarchical embedding techniques are better suited to represent the high-dimensional attribute space than flat dimensionality reduction methods. However, available hierarchical dimensionality reduction methods construct the hierarchy purely based on the attribute information and ignore the spatial layout of pixels in the images. This impedes the exploration of regions of interest in the image space, since there is no congruence between a region of interest in image space and the associated attribute abstractions in the hierarchy. In this paper, we present a superpixel hierarchy for high-dimensional images that takes the high-dimensional attribute manifold into account during construction. Through this, our method enables consistent exploration of high-dimensional images in both image and attribute space. We show the effectiveness of this new image-guided hierarchy in the context of embedding exploration by comparing it with classical hierarchical embedding-based image exploration in two use cases.
中文标题/摘要
标题:保持流形的超像素层次结构和嵌入用于高维图像的探索
高维图像,或每个像素具有高维属性向量的图像,通常通过低维属性空间嵌入的协调视图和传统的图像表示进行探索。如今,这样的图像可以包含数百万像素。对于如此大的数据集,分层嵌入技术比平面降维方法更适合表示高维属性空间。然而,现有的分层降维方法仅基于属性信息构建层次结构,而忽略了图像中像素的空间布局。这阻碍了对图像空间中感兴趣区域的探索,因为在图像空间中的感兴趣区域与层次结构中的相关属性抽象之间没有一致性。在本文中,我们提出了一种高维图像的超像素层次结构,该层次结构在构建过程中考虑了高维属性流形。通过这种方式,我们的方法能够在图像空间和属性空间中一致地探索高维图像。我们通过将其与基于嵌入的图像探索的经典分层方法在两个应用场景中进行比较,展示了这种新图像导向层次结构的有效性。
Summary / 总结
This paper addresses the challenge of exploring high-dimensional images by proposing a superpixel hierarchy that considers the high-dimensional attribute manifold. The method constructs a hierarchy that aligns with both the spatial layout of pixels and the attribute space, enabling consistent exploration in both domains. Experimental results demonstrate the effectiveness of this approach in two use cases compared to classical hierarchical embedding methods.
本文提出了一种超像素层次结构,该结构在构建过程中考虑了高维属性流形,以解决高维图像的探索问题。该方法构建的层次结构同时考虑了属性信息和像素的空间布局,使得在图像和属性空间中的一致探索成为可能。实验结果表明,该方法在两个应用场景中比传统的基于层次嵌入的图像探索方法更有效。
HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation
Authors: Keito Suzuki, Kunyao Chen, Lei Wang, Bang Du, Runfa Blark Li, Peng Liu, Ning Bi, Truong Nguyen
Venue: CVPR 2026
First: 2026-02-27T16:27:52+00:00 · Latest: 2026-02-27T16:27:52+00:00
Comments: CVPR 2026 Findings
Abstract
We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.
中文标题/摘要
标题:HumanOrbit:从单张输入图像生成360°环视视频的人体三维重建
我们提出了一种方法,可以从单张输入图像生成围绕一个人的完整360°环视视频。现有方法通常通过适应基于图像的扩散模型来进行多视图合成,但结果在不同视图之间不一致,并且与原始身份不一致。相比之下,最近的视频扩散模型已经展示了其生成与给定提示高度一致的逼真结果的能力。受这些结果的启发,我们提出了一种用于多视图人体图像生成的视频扩散模型——HumanOrbit。我们的方法使模型能够合成围绕主题的连续相机旋转,产生几何上一致的新视角,同时保留人的外观和身份。利用生成的多视图帧,我们进一步提出了一种重建管道,以恢复主题的纹理网格。实验结果验证了HumanOrbit在多视图图像生成中的有效性,并且重建的3D模型在完整性和保真度方面优于最先进的基线方法。
Summary / 总结
The research aims to generate a full 360° orbit video around a person from a single input image. HumanOrbit, a video diffusion model, is proposed to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the person's appearance and identity. The experimental results show that the reconstructed 3D models from HumanOrbit are more complete and faithful compared to state-of-the-art methods.
HumanOrbit 是一种从单张输入图像生成围绕人物的360°全景视频的方法。它使用视频扩散模型进行连续相机旋转的合成,生成几何上一致的新视角,同时保留人物的外观和身份。从生成的多视角帧重建的3D模型在完整性和保真度方面优于最先进的基线方法。
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
Authors: Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang
Venue: ICLR 2026
First: 2025-05-26T11:47:16+00:00 · Latest: 2026-02-27T16:24:21+00:00
Comments: Accepted by ICLR 2026
Abstract
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but it tends to lose reflection ability and harm performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 36% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for easier ones without losing reflection ability. Code is available at https://github.com/hexuandeng/REA-RL.
中文标题/摘要
标题:REA-RL:面向高效推理的反思感知在线强化学习
大型推理模型(LRMs)在复杂任务中表现出色,但往往面临过度思考的挑战,导致推理成本显著增加。现有方法通过合成较短的推理响应来训练LRMs,但在在线使用中效率低下,因为数据生成和过滤过程耗时。同时,在线强化学习主要采用长度奖励来鼓励较短的推理响应,但容易失去反思能力并损害性能。为解决这些问题,我们提出了REA-RL,引入了一个小型反思模型以在线训练中高效扩展,提供并行采样和顺序修订。此外,设计了反思奖励以进一步防止LRMs偏好短而缺乏反思的响应。实验表明,这两种方法在保持或提升性能的同时,显著提高了推理效率。它们的结合在性能和效率之间取得了良好的平衡,减少了36%的推理成本而不牺牲性能。进一步分析表明,我们的方法通过在难题中保持反思频率并在较易问题中适当降低反思频率而有效,而不失去反思能力。代码可在https://github.com/hexuandeng/REA-RL获取。
Summary / 总结
The paper addresses the challenge of overthinking in Large Reasoning Models (LRMs) by proposing REA-RL, which uses a small reflection model for efficient online training. It combines parallel sampling and sequential revision, and introduces a reflection reward to prevent short, non-reflective responses. Experiments show that REA-RL maintains or enhances performance while reducing inference costs by 36%. The method balances performance and efficiency, maintaining reflection frequency for hard problems while reducing it for easier ones without losing reflection ability.
论文提出REA-RL方法,通过引入反思模型解决大型推理模型(LRMs)的过度推理问题,实现在线训练的高效性。该方法采用并行采样和序列修订,并设计了反思奖励以防止产生短而缺乏反思的回答。实验表明,REA-RL在减少36%推理成本的同时保持或提升了性能,且未牺牲反思能力。
CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
Authors: Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim
Venue: ICLR
First: 2025-10-06T18:00:55+00:00 · Latest: 2026-02-27T16:22:09+00:00
Comments: CMT-Benchmark dataset is available at https://huggingface.co/datasets/JVRoggeveen/cmt_benchmark. CMT-Benchmark was referenced in the Gemini 3 Deep Think (February 2026) release at https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/
Abstract
Large language models (LLMs) have shown remarkable progress in coding and math problem-solving, but evaluation on advanced research-level problems in hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a dataset of 50 problems covering condensed matter theory (CMT) at the level of an expert researcher. Topics span analytical and computational approaches in quantum many-body, and classical statistical mechanics. The dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine problems they would want a research assistant to solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte Carlo, density matrix renormalization group (DMRG), quantum/classical statistical mechanics, and model building. We evaluate LLMs by programmatically checking solutions against expert-supplied ground truth. We developed machine-grading, including symbolic handling of non-commuting operators via normal ordering. They generalize across tasks too. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. The best model, GPT5, solves 30\% of the problems; average across 17 models (GPT, Gemini, Claude, DeepSeek, Llama) is 11.4\pm2.1\%. Moreover, 18 problems are solved by none of the 17 models, and 26 by at most one. These unsolved problems span Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe this benchmark will guide development toward capable AI research assistants and tutors.
中文标题/摘要
标题:CMT-Benchmark:由专家研究人员构建的凝聚态理论基准
大型语言模型(LLMs)在编程和数学问题解决方面取得了显著进展,但在硬科学领域的高级研究级问题评估方面仍然稀缺。为填补这一空白,我们提出了CMT-Benchmark,一个包含50个问题的数据集,涵盖了专家级研究人员水平的凝聚态理论(CMT)。主题涵盖了量子多体和经典统计力学的分析和计算方法。该数据集由来自世界各地的专家研究人员设计并验证。我们通过协作环境构建了数据集,挑战专家小组编写并完善他们希望研究助理解决的问题,包括哈特里-福克、精确对角化、量子/变分蒙特卡洛、密度矩阵重正化群(DMRG)、量子/经典统计力学和模型构建。我们通过程序化检查解决方案与专家提供的正确答案进行比较来评估LLMs。我们开发了机器评分,包括通过正规排序处理非交换算子的符号处理。它们在任务之间具有泛化能力。我们的评估表明,前沿模型在数据集中的所有问题上都遇到了困难,突显了当前LLMs在物理推理技能方面的差距。值得注意的是,专家通过与LLMs交互并利用常见失败模式来识别创建越来越难的问题的策略。最佳模型GPT5解决了30%的问题;17个模型(GPT、Gemini、Claude、DeepSeek、Llama)的平均值为11.4±2.1%。此外,18个问题没有被17个模型中的任何一个解决,26个问题最多只有一个模型解决。这些未解决的问题涵盖了量子蒙特卡洛、变分蒙特卡洛和DMRG。答案有时违反了基本对称性或具有不物理的标度维度。我们相信,这个基准将指导开发能够胜任的AI研究助理和导师。
Summary / 总结
The CMT-Benchmark is a dataset of 50 expert-level problems in condensed matter theory, designed to evaluate large language models (LLMs) on advanced research tasks. The dataset covers topics such as quantum many-body and classical statistical mechanics, and was verified by expert researchers. Evaluations show that current LLMs struggle with these problems, with only 30% of problems solved by the best model and 18 problems unsolved by all 17 models tested. These unsolved problems include those related to Quantum Monte Carlo, Variational Monte Carlo, and DMRG, often due to violations of fundamental symmetries or unphysical scaling dimensions.
CMT-Benchmark 是由专家研究人员创建的数据集,包含 50 个高级凝聚态理论问题,用于评估大型语言模型(LLMs)在研究级别任务上的表现。该数据集涵盖了量子多体和经典统计力学等主题,并由专家小组验证。评估结果显示,当前的 LLM 在所有问题上都表现不佳,最好的模型 GPT5 只解决了 30% 的问题,而 17 模型中有许多问题未被任何模型解决。该基准测试揭示了 LLM 在物理推理能力上的不足,并将指导未来 AI 研究助理和导师的发展。