arXiv 论文速递

2026-03-04 03:47
Snapshot: 20260304_0347
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Authors: Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja, Yujia Bao
First: 2026-02-22T23:39:21+00:00 · Latest: 2026-03-02T18:58:07+00:00
Abstract
Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.
中文标题/摘要
标题:基于多臂老虎机的自适应数据增强:样本高效嵌入校准以识别隐含模式
识别隐含的视觉和文本模式在现代人工智能的许多实际应用中至关重要。然而,当前的预训练基础模型如LLMs和VLMs在处理长尾模式识别任务时仍然面临挑战。虽然微调预训练模型可以提高识别隐含模式的准确性,但由于缺乏训练数据和高计算开销,通常不可行。在本文中,我们提出了一种高效的嵌入校准框架ADAMAB,用于少量样本模式识别。为了最大限度地减少计算成本,ADAMAB 在固定嵌入模型之上训练嵌入器无关的轻量级校准器,而不访问其参数。为了减少大规模训练数据的需求,我们引入了一种基于多臂老虎机(MAB)机制的自适应数据增强策略。通过修改的上确界算法,ADAMAB 减少了梯度偏移,并在少量样本训练中提供了理论保证的收敛性。我们的多模态实验证明了ADAMAB的优越性能,在使用每个类别少于5个初始数据样本进行训练时,准确率提高了40%。
Summary / 总结
The paper proposes ADAMAB, an embedding calibration framework for few-shot pattern recognition that reduces computational costs by training lightweight calibrators on top of fixed embedding models. It introduces an adaptive data augmentation strategy using the Multi-Armed Bandit mechanism to mitigate the need for large-scale training data. Experimental results show that ADAMAB can achieve up to 40% accuracy improvement with less than 5 initial data samples per class.
ADAMAB 是一种嵌入校准框架,旨在提高 AI 应用中的少量样本模式识别。它使用不访问参数的固定嵌入模型上训练的轻量级校准器来降低计算成本。ADAMAB 结合了基于多臂老虎机机制的自适应数据增强策略,有助于减少对大规模训练数据的需求。实验表明,ADAMAB 在每个类别少于 5 个初始数据样本的情况下,可以实现高达 40% 的准确率提升。
Tool Verification for Test-Time Reinforcement Learning
Authors: Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy
First: 2026-03-02T18:57:52+00:00 · Latest: 2026-03-02T18:57:52+00:00
Comments: 12 pages, 11 figures
Abstract
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
中文标题/摘要
标题:测试时强化学习工具验证
测试时强化学习(TTRL)已成为自演化大型推理模型(LRMs)的一个有前途的范式,通过自我诱导的奖励(例如通过多数投票)在线适应未标记的测试输入。然而,一种虚假但频率高的未经验证的一致性可能会成为有偏见的强化奖励信号,导致错误的模式崩溃。我们通过T^3RL(测试时强化学习工具验证)解决了这一失败模式,它将测试时工具验证引入奖励估计中。具体来说,验证器使用外部工具作为证据(例如代码执行)来在验证感知投票中提升验证过的展开,从而产生更可靠的伪标签用于训练。在各种数学难题(MATH-500、AMC和2024年AIME)和不同类型的骨干网络中,T^3RL显著优于TTRL,对更难的问题有更大的改进。更广泛地说,T^3RL可以被视为验证的在线数据合成,突显测试时工具验证作为稳定自我演化的关键机制。
Summary / 总结
T^3RL (Tool-Verification for Test-Time Reinforcement Learning) addresses the issue of biased reward signals in test-time reinforcement learning (TTRL) by incorporating tool verification into reward estimation. This method uses an external tool to verify rollouts and upweight them in a verification-aware voting process, producing more reliable pseudo-labels for training. T^3RL significantly improves over TTRL across various math difficulties and backbone types, with larger gains on harder problems, demonstrating the importance of test-time tool verification for stabilizing self-evolution.
论文解决了测试时强化学习(TTRL)在大型推理模型中出现的偏差奖励信号问题,这可能导致错误的模式坍缩。它引入了T^3RL,通过测试时的工具验证来增强奖励估计。通过使用外部工具作为证据,T^3RL生成了更可靠的伪标签,从而在各种数学问题和骨干网络类型上显著改善了TTRL的表现,特别是在更难的问题上取得了更大的改进。
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Authors: Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu
Venue: CVPR 2026
First: 2026-02-23T18:59:45+00:00 · Latest: 2026-03-02T18:56:43+00:00
Comments: Accepted by CVPR 2026. Project Page: https://cwchenwang.github.io/tttLRM
Abstract
We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.
中文标题/摘要
标题:tttLRM:测试时训练的长上下文和自回归3D重建
我们提出了一种名为tttLRM的新颖大型3D重建模型,该模型利用测试时训练(TTT)层,以线性计算复杂度实现长上下文和自回归3D重建,进一步扩展了模型的能力。我们的框架高效地将多个图像观察压缩到TTT层的快速权重中,在潜在空间中形成隐式的3D表示,可以解码为各种显式格式,例如用于下游应用的高斯斑点(GS)。我们模型的在线学习变体支持从流式观察中进行渐进的3D重建和细化。实验表明,预训练在新颖视图合成任务上的效果可以有效转移到显式3D建模,从而提高重建质量和加快收敛速度。大量实验表明,与现有方法相比,我们的方法在物体和场景的前向3D高斯重建中表现出更优的性能。
Summary / 总结
The research proposes tttLRM, a novel 3D reconstruction model that uses a Test-Time Training (TTT) layer to enable efficient long-context, autoregressive 3D reconstruction with linear computational complexity. The model compresses multiple image observations into fast weights, forming an implicit 3D representation that can be decoded into various explicit formats. The method demonstrates improved reconstruction quality and faster convergence compared to state-of-the-art approaches, especially in feedforward 3D Gaussian reconstruction on both objects and scenes.
tttLRM 是一种新型的 3D 重建模型,通过使用 Test-Time Training (TTT) 层来实现高效的大上下文、自回归 3D 重建,并具有线性计算复杂度。该模型将多个图像观察压缩到 TTT 层的快速权重中,形成隐式的 3D 表示,可以解码为各种格式。预训练在新颖视图合成任务上可以提高重建质量和收敛速度。实验表明,tttLRM 在物体和场景的 feedforward 3D 高斯重建方面优于最先进的方法。
From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
Authors: Mateus Karvat, Bram Adams, Sidney Givigi
First: 2026-03-02T18:54:28+00:00 · Latest: 2026-03-02T18:54:28+00:00
Abstract
Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.
中文标题/摘要
标题:从排行榜到部署:自动驾驶感知仓库中的代码质量挑战
自动驾驶(AV)感知模型通常仅根据基准性能指标进行评估,对代码质量、生产准备性和长期可维护性关注较少。这在国际安全标准下的安全关键系统中造成了研究卓越与实际部署之间的巨大差距。为解决这一问题,我们首次进行了大规模的软件质量实证研究,系统分析了来自Kitti和NuScenes 3D物体检测排行榜的178个独特模型。使用静态分析工具(Pylint、Bandit和Radon),我们评估了代码错误、安全漏洞、可维护性和开发实践。研究发现,只有7.3%的仓库符合基本的生产准备标准,即没有关键错误且无高危安全漏洞。安全问题高度集中,前五个问题占了近80%的出现次数,促使我们制定了防止这些问题的一系列可操作指南。此外,持续集成/持续部署管道的采用与更好的代码可维护性相关。我们的研究结果表明,排行榜性能并不能反映生产准备性,有针对性的干预措施可以显著提高AV感知代码的质量和安全性。
Summary / 总结
This study addresses the gap between research excellence and real-world deployment in autonomous vehicle perception systems by evaluating code quality in 178 unique models from KITTI and NuScenes leaderboards using static analysis tools. The research found that only 7.3% of repositories met basic production-readiness criteria, with critical errors and high-severity security vulnerabilities being prevalent issues. The study also identified that the top five security issues were responsible for almost 80% of occurrences and that the adoption of Continuous Integration/Continuous Deployment pipelines correlated with better code maintainability. The findings suggest that leaderboard performance does not reflect production readiness and that targeted interventions are needed to improve the quality and safety of AV perception code.
研究通过使用静态分析工具评估来自KITTI和NuScenes排行榜的178个独特模型的代码质量,填补了自动驾驶感知模型研究卓越与实际部署之间的差距。研究发现,只有7.3%的仓库达到了基本的生产就绪标准,安全问题高度集中。研究还与更好的代码可维护性相关联,开发了一套防止常见安全问题的可操作指南。
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Authors: Divyanshu Daiya, Aniket Bera
Venue: CVPR 2026
First: 2026-03-02T18:52:51+00:00 · Latest: 2026-03-02T18:52:51+00:00
Comments: Accepted to CVPR 2026 Main Conference (11 pages, 5 figures)
Abstract
We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.
中文标题/摘要
标题:Sketch2Colab: 基于草图条件的多人体动画控制流蒸馏
我们提出了Sketch2Colab,它可以将故事板风格的2D草图转化为连贯且物体感知的3D多人体运动,并且可以对代理、关节、时间以及接触进行细粒度控制。传统的基于扩散的运动生成器已经提高了现实感;然而,要实现对丰富交互约束的精确遵循通常需要大量的训练和/或昂贵的后验引导,且在强多实体条件下的性能会下降。Sketch2Colab首先学习一个由草图驱动的扩散先验,然后将其蒸馏为一个在潜在空间中高效运行的修正流学生模型,以实现快速且稳定的采样。关键帧、轨迹以及基于物理的约束的可微能量直接塑造学生的传输场,引导样本向符合故事板且物理上合理的运动演变。为了捕捉协调的交互,我们通过连续时间马尔可夫链(CTMC)规划器对接触、抓取和传递等离散事件进行调度,调节动力学以产生清晰且节奏良好的人-物-人协作。在CORE4D和InterHuman上的实验表明,Sketch2Colab在约束遵循和感知质量方面达到了最先进的水平,同时提供了显著快于仅基于扩散的基线模型的推理速度。
Summary / 总结
Sketch2Colab turns 2D sketches into coherent 3D multi-human animations with fine control over agents, joints, timing, and contacts. It uses a sketch-driven diffusion prior distilled into an efficient rectified-flow model for fast, stable sampling. Differentiable energies and a CTMC planner ensure adherence to storyboard constraints and physical plausibility, achieving state-of-the-art results on CORE4D and InterHuman datasets.
Sketch2Colab 将2D故事情节草图转换为连贯的3D多人动画,并对代理、关节、时间、接触等进行精确控制。它使用基于草图的扩散先验,并将其提炼成高效的修正流模型以实现快速采样。可微能量引导模型生成符合物理原理且满足故事情节约束的运动。实验表明,Sketch2Colab 在约束遵守和感知质量方面优于仅基于扩散的方法,并且具有更快的推理时间。
MAC: A Conversion Rate Prediction Benchmark Featuring Labels Under Multiple Attribution Mechanisms
Authors: Jinqi Wu, Sishuo Chen, Zhangming Chan, Yong Bai, Lei Zhang, Sheng Chen, Chenghuan Hou, Xiang-Rong Sheng, Han Zhu, Jian Xu, Bo Zheng, Chaoyou Fu
First: 2026-03-02T18:51:01+00:00 · Latest: 2026-03-02T18:51:01+00:00
Comments: Code and data available at https://github.com/alimama-tech/PyMAL
Abstract
Multi-attribution learning (MAL), which enhances model performance by learning from conversion labels yielded by multiple attribution mechanisms, has emerged as a promising learning paradigm for conversion rate (CVR) prediction. However, the conversion labels in public CVR datasets are generated by a single attribution mechanism, hindering the development of MAL approaches. To address this data gap, we establish the Multi-Attribution Benchmark (MAC), the first public CVR dataset featuring labels from multiple attribution mechanisms. Besides, to promote reproducible research on MAL, we develop PyMAL, an open-source library covering a wide array of baseline methods. We conduct comprehensive experimental analyses on MAC and reveal three key insights: (1) MAL brings consistent performance gains across different attribution settings, especially for users featuring long conversion paths. (2) The performance growth scales up with objective complexity in most settings; however, when predicting first-click conversion targets, simply adding auxiliary objectives is counterproductive, underscoring the necessity of careful selection of auxiliary objectives. (3) Two architectural design principles are paramount: first, to fully learn the multi-attribution knowledge, and second, to fully leverage this knowledge to serve the main task. Motivated by these findings, we propose Mixture of Asymmetric Experts (MoAE), an effective MAL approach incorporating multi-attribution knowledge learning and main task-centric knowledge utilization. Experiments on MAC show that MoAE substantially surpasses the existing state-of-the-art MAL method. We believe that our benchmark and insights will foster future research in the MAL field. Our MAC benchmark and the PyMAL algorithm library are publicly available at https://github.com/alimama-tech/PyMAL.
中文标题/摘要
标题:MAC:一种基于多种归因机制标签的转化率预测基准
多归因学习(MAL),通过学习由多种归因机制产生的转化标签来提升模型性能,已成为转化率(CVR)预测的一种有前景的学习范式。然而,公共CVR数据集中转化标签仅由单一归因机制生成,阻碍了MAL方法的发展。 为解决这一数据缺口,我们建立了多归因基准(MAC),这是首个包含多种归因机制标签的公共CVR数据集。此外,为了促进可复现的MAL研究,我们开发了PyMAL,一个包含广泛基线方法的开源库。我们在MAC上进行了全面的实验分析,并揭示了三个关键见解:(1)MAL在不同归因设置中带来了持续的性能提升,尤其是对于具有长转化路径的用户;(2)在大多数情况下,性能增长随着目标复杂性的增加而增加;然而,在预测首次点击转化目标时,简单地添加辅助目标是无效的,强调了精心选择辅助目标的必要性;(3)两个架构设计原则至关重要:首先,要充分学习多归因知识;其次,要充分利用这种知识来服务于主要任务。受这些发现的启发,我们提出了混合非对称专家(MoAE),这是一种结合多归因知识学习和主要任务为中心的知识利用的有效MAL方法。在MAC上的实验表明,MoAE显著超越了现有的最佳MAL方法。我们认为,我们的基准和见解将促进未来在MAL领域的研究。我们的MAC基准和PyMAL算法库可在https://github.com/alimama-tech/PyMAL上公开获取。
Summary / 总结
The paper introduces the Multi-Attribution Benchmark (MAC), the first public dataset for conversion rate prediction that includes labels from multiple attribution mechanisms. It also presents PyMAL, an open-source library for multi-attribution learning (MAL). Comprehensive experiments on MAC reveal that MAL consistently improves model performance, especially for users with long conversion paths. The study also finds that adding auxiliary objectives can be counterproductive for first-click conversion targets, highlighting the importance of careful objective selection. Based on these insights, the authors propose Mixture of Asymmetric Experts (MoAE), which significantly outperforms existing MAL methods. The benchmark and insights are expected to advance research in the MAL field.
该研究介绍了Multi-Attribution Benchmark (MAC),这是首个包含多种归因机制标签的公开转换率预测数据集。同时,还推出了PyMAL开源库,用于多归因学习(MAL)方法。全面的MAC实验表明,MAL方法在大多数情况下都能显著提升模型性能,尤其是对于具有长转换路径的用户。研究还发现,在大多数情况下增加辅助目标是有益的,但在预测首次点击转换目标时则不然。基于这些发现,作者提出了Mixture of Asymmetric Experts (MoAE),该方法在MAC上的表现显著优于现有最先进的MAL方法。该基准和发现有望推动MAL领域的研究进展。
Reservoir Subspace Injection for Online ICA under Top-n Whitening
Authors: Wenjun Xiao, Yuda Bi, Vince D Calhoun
First: 2026-03-02T18:49:02+00:00 · Latest: 2026-03-02T18:49:02+00:00
Abstract
Reservoir expansion can improve online independent component analysis (ICA) under nonlinear mixing, yet top-$n$ whitening may discard injected features. We formalize this bottleneck as \emph{reservoir subspace injection} (RSI): injected features help only if they enter the retained eigenspace without displacing passthrough directions. RSI diagnostics (IER, SSO, $ρ_x$) identify a failure mode in our top-$n$ setting: stronger injection increases IER but crowds out passthrough energy ($ρ_x: 1.00\!\rightarrow\!0.77$), degrading SI-SDR by up to $2.2$\,dB. A guarded RSI controller preserves passthrough retention and recovers mean performance to within $0.1$\,dB of baseline $1/N$ scaling. With passthrough preserved, RE-OICA improves over vanilla online ICA by $+1.7$\,dB under nonlinear mixing and achieves positive SI-SDR$_{\mathrm{sc}}$ on the tested super-Gaussian benchmark ($+0.6$\,dB).
中文标题/摘要
标题:水库子空间注入以实现在线ICA下的Top-n白化
水库扩展可以改善非线性混合下的在线独立成分分析(ICA),但Top-$n$白化可能会丢弃注入的特征。我们将这一瓶颈形式化为“水库子空间注入”(RSI):注入的特征只有在不挤占通过方向的情况下进入保留的特征空间时才有效。RSI诊断指标(IER、SSO、$ρ_x$)在我们的Top-$n$设置中识别出一种失效模式:更强的注入会增加IER,但挤占了通过能量($ρ_x: 1.00\!\rightarrow\!0.77$),导致SI-SDR下降最多2.2 dB。一种谨慎的RSI控制器保持了通过方向的保留,并将平均性能恢复到基线$1/N$缩放的0.1 dB以内。在保留通过方向的情况下,RE-OICA在非线性混合下比纯在线ICA提高了1.7 dB,并在测试的超高斯基准上实现了正的SI-SDR$_{\mathrm{sc}}$(+0.6 dB)。
Wikipedia in the Era of LLMs: Evolution and Risks
Authors: Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
First: 2025-03-04T18:58:13+00:00 · Latest: 2026-03-02T18:48:33+00:00
Comments: Accepted by TMLR: https://openreview.net/forum?id=ahVmnYkVLt
Abstract
In this paper, we present a comprehensive analysis and monitoring framework for the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing article content and page views to study the recent changes in Wikipedia and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been affected by LLMs, with an impact of approximately 1% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift. Moreover, the effectiveness of RAG might decrease if the knowledge has been contaminated by LLMs. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks in NLP research. We release all the experimental dataset and source code at: https://github.com/HSM316/LLM_Wikipedia
中文标题/摘要
标题:在大语言模型时代的维基百科:演变与风险
在本文中,我们提出了一种全面的分析和监控框架,以评估大语言模型(LLMs)对维基百科的影响,通过现有数据研究维基百科的演变,并使用模拟探索潜在风险。我们首先分析文章内容和页面浏览量,研究维基百科的近期变化并评估LLMs的影响。随后,我们评估LLMs如何影响与维基百科相关的各种自然语言处理(NLP)任务,包括机器翻译和检索增强生成(RAG)。我们的发现和模拟结果表明,维基百科文章已受到LLMs的影响,在某些类别中的影响约为1%。如果基于维基百科的机器翻译基准受到LLMs的影响,模型的得分可能会被夸大,模型之间的比较结果可能会发生变化。此外,如果知识被LLMs污染,RAG的有效性可能会降低。尽管LLMs尚未完全改变维基百科的语言和知识结构,我们认为我们的实证发现表明了未来NLP研究中需要谨慎考虑潜在风险的必要性。我们将在以下地址发布所有实验数据集和源代码:https://github.com/HSM316/LLM_Wikipedia
Summary / 总结
This paper analyzes the impact of Large Language Models (LLMs) on Wikipedia by examining article content and page views, and evaluating their effects on NLP tasks such as machine translation and RAG. The study finds that LLMs have affected Wikipedia articles, with an impact of about 1% in certain categories, and may inflate model scores and reduce the effectiveness of RAG. The authors suggest careful consideration of potential future risks in NLP research and release their experimental dataset and source code for further study.
本文通过内容分析、页面访问量监控和模拟,分析了大型语言模型(LLMs)对维基百科的影响。研究发现,LLMs 对维基百科文章产生了影响,特别是在某些类别中,影响约为 1%。研究还表明,基于维基百科的机器翻译基准可能会显示模型得分膨胀,而如果知识被 LLMs 污染,检索增强生成(RAG)的有效性可能会下降。作者强调了在自然语言处理研究中需要谨慎考虑潜在的未来风险。
Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Authors: Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
First: 2026-03-02T18:46:28+00:00 · Latest: 2026-03-02T18:46:28+00:00
Abstract
Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.
中文标题/摘要
标题:Kiwi-Edit:基于指令和参考指导的多功能视频编辑
基于指令的视频编辑已经取得了快速进展,但当前的方法往往难以实现精确的视觉控制,因为自然语言在描述复杂的视觉细微差别方面是有限的。尽管参考指导编辑提供了一个强大的解决方案,但其潜力目前受到高质量配对训练数据稀缺的限制。为了弥合这一差距,我们引入了一种可扩展的数据生成管道,将现有的视频编辑配对转换为高保真度的训练四元组,利用图像生成模型创建合成的参考支架。使用此管道,我们构建了RefVIE,一个针对指令-参考跟随任务的大规模数据集,并建立了RefVIE-Bench进行全面评估。此外,我们提出了一种统一的编辑架构Kiwi-Edit,该架构结合了可学习的查询和潜在视觉特征,以实现参考语义指导。通过渐进的多阶段训练课程,我们的模型在指令跟随和参考保真度方面取得了显著的提升。广泛的实验表明,我们的数据和架构在可控视频编辑中建立了新的最先进的水平。所有数据集、模型和代码均发布在https://github.com/showlab/Kiwi-Edit。
Summary / 总结
The research aims to improve the precision of instruction-based video editing by addressing the limitations of natural language in describing complex visual details and the scarcity of high-quality training data. The authors introduce a scalable data generation pipeline to create high-fidelity training quadruplets and propose Kiwi-Edit, a unified editing architecture that combines learnable queries and latent visual features for reference semantic guidance. The model shows significant improvements in instruction following and reference fidelity, setting a new state-of-the-art in controllable video editing.
研究旨在通过解决自然语言在描述复杂视觉细节方面的局限性和高质量训练数据稀缺性问题,提高基于指令的视频编辑精度。作者提出了一种可扩展的数据生成管道来创建高保真训练四元组,并提出了一种统一的编辑架构Kiwi-Edit,该架构结合了可学习查询和潜在视觉特征的参考语义指导。该模型在指令跟随和参考保真度方面显示出显著的改进,建立了可控视频编辑的新状态-艺术水平。
GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
Authors: Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs
First: 2026-03-02T18:42:15+00:00 · Latest: 2026-03-02T18:42:15+00:00
Comments: 26 pages, 17 figures
Abstract
We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.
中文标题/摘要
标题:GeoDiT:基于点控制的卫星图像合成扩散变换器
我们介绍了GeoDiT,一种用于文本到卫星图像生成的扩散变换器,支持基于点的控制。现有的可控卫星图像生成模型通常需要像素级的地图,这些地图获取耗时且语义有限。为了解决这一限制,我们引入了一种新颖的基于点的控制框架,通过点的空间位置及其关联的文本描述来控制生成过程,提供丰富的语义控制信号。这种方法使得卫星图像生成的推断更加灵活、注释友好且计算简单。为此,我们引入了一种自适应局部注意力机制,该机制根据输入的点查询有效正则化注意力分数。我们系统地评估了训练GeoDiT的各种领域特定设计选择,包括卫星图像表示的选择以及地理定位表示的选择。我们的实验表明,GeoDiT在生成性能上取得了令人印象深刻的成果,超越了最先进的遥感生成模型。
Summary / 总结
GeoDiT is a diffusion transformer designed for generating satellite images with text and point-based control. It addresses the limitations of existing models by using a point-based conditioning framework, which allows for more flexible and semantically rich control signals. GeoDiT includes an adaptive local attention mechanism to improve the generation process. The model outperforms state-of-the-art remote sensing generative models in terms of generation performance.
GeoDiT 是一种用于通过文本和点控制生成卫星图像的扩散变压器。它通过空间位置和文本描述提供丰富的语义控制信号,解决了现有模型的局限性。GeoDiT 包含一种自适应局部注意力机制以改进生成过程。实验表明,GeoDiT 在生成性能上超越了最先进的遥感生成模型。
Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution
Authors: Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji
Venue: ICLR 2026
First: 2025-07-09T05:03:57+00:00 · Latest: 2026-03-02T18:40:44+00:00
Comments: This paper has been accepted at ICLR 2026
Abstract
While diffusion models excel at image generation, their growing adoption raises critical concerns about copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that are of primary concern to stakeholders. To address this gap, we introduce concept-level attribution through a novel method called Concept-TRAK, which extends influence functions with a key innovation: specialized training and utility loss functions designed to isolate concept-specific influences rather than overall reconstruction quality. We evaluate Concept-TRAK on novel concept attribution benchmarks using Synthetic and CelebA-HQ datasets, as well as the established AbC benchmark, showing substantial improvements over prior methods in concept-level attribution scenarios. We further demonstrate its versatility on real-world text-to-image generation with compositional and multi-concept prompts.
中文标题/摘要
标题:Concept-TRAK:理解扩散模型如何通过概念级归因学习概念
尽管扩散模型在图像生成方面表现出色,但其日益广泛的应用引发了关于版权问题和模型透明度的关键关注。现有的归因方法能够识别影响整个图像的训练示例,但在隔离对特定元素(如风格或物体)的贡献方面存在不足,这些元素是利益相关者的主要关注点。为了解决这一差距,我们引入了一种名为Concept-TRAK的新方法,该方法通过一种创新的方法扩展了影响函数:专门的训练和效用损失函数设计,旨在隔离特定概念的影响,而不是整体重建质量。我们使用Synthetic和CelebA-HQ数据集以及已建立的AbC基准测试Concept-TRAK在新型概念归因基准上的表现,结果显示在概念级归因场景中显著优于先前的方法。我们还进一步展示了其在真实世界文本到图像生成中的灵活性,特别是在组合和多概念提示方面。
Summary / 总结
Concept-TRAK is introduced to address the limitations of existing attribution methods in isolating specific elements in diffusion models. It uses a novel method that extends influence functions with specialized training and utility loss functions to focus on concept-specific influences rather than overall image quality. The method shows significant improvements in concept-level attribution benchmarks and is versatile for real-world text-to-image generation with compositional and multi-concept prompts.
Concept-TRAK旨在解决现有归因方法在隔离扩散模型中特定元素方面的局限性。它通过扩展影响函数并使用专门的训练和实用损失函数来专注于概念特定的影响,而不是整体图像质量。该方法在概念级归因基准测试中表现出显著改进,并且适用于具有组合和多概念提示的真实世界文本到图像生成。
Astral: training physics-informed neural networks with error majorants
Authors: Vladimir Fanaskov, Tianchi Yu, Alexander Rudikov, Ivan Oseledets
Venue: ICLR 2026
First: 2024-06-04T13:11:49+00:00 · Latest: 2026-03-02T18:39:47+00:00
Comments: Accepted to ICLR 2026 workshop AI&PDE, reviewed at https://openreview.net/forum?id=TcFpJK2FcN
Abstract
The primal approach to physics-informed learning is a residual minimization. We argue that residual is, at best, an indirect measure of the error of approximate solution and propose to train with error majorant instead. Since error majorant provides a direct upper bound on error, one can reliably estimate how close PiNN is to the exact solution and stop the optimization process when the desired accuracy is reached. We call loss function associated with error majorant \textbf{Astral}: neur\textbf{A}l a po\textbf{ST}erio\textbf{R}i function\textbf{A}l \textbf{L}oss. To compare Astral and residual loss functions, we illustrate how error majorants can be derived for various PDEs and conduct experiments with diffusion equations (including anisotropic and in the L-shaped domain), convection-diffusion equation, temporal discretization of Maxwell's equation, magnetostatics and nonlinear elastoplasticity problems. The results indicate that Astral loss is competitive to the residual loss, typically leading to faster convergence and lower error. The main benefit of using Astral loss comes from its ability to estimate error, which is impossible with other loss functions. Our experiments indicate that the error estimate obtained with Astral loss is usually tight enough, e.g., for a highly anisotropic equation, on average, Astral overestimates error by a factor of $1.5$, and for convection-diffusion by a factor of $1.7$. We further demonstrate that Astral loss is better correlated with error than residual and is a more reliable predictor of the error value. Moreover, unlike residual, the error indicator obtained from Astral loss has a superb spatial correlation with error. Backed with the empirical and theoretical results, we argue that one can productively use Astral loss to perform reliable error analysis and approximate PDE solutions with accuracy similar to standard residual-based techniques.
中文标题/摘要
标题:Astral:使用误差上界训练物理知情神经网络
物理知情学习的原始方法是残差最小化。我们认为残差最多只是一个近似解误差的间接度量,并建议使用误差上界进行训练。由于误差上界直接提供了误差的上限,因此可以可靠地估计PiNN与精确解的接近程度,并在达到所需精度时停止优化过程。我们将与误差上界相关的损失函数称为Astral:神经先验函数损失。为了比较Astral损失函数和残差损失函数,我们展示了如何为各种偏微分方程(PDE)推导误差上界,并进行了扩散方程(包括各向异性以及L形域)、对流扩散方程、麦克斯韦方程的时间离散化、磁静力学和非线性弹塑性问题的实验。结果表明,Astral损失与残差损失具有竞争力,通常导致更快的收敛和更低的误差。使用Astral损失的主要好处在于其能够估计误差,这是其他损失函数无法做到的。我们的实验表明,使用Astral损失获得的误差估计通常足够精确,例如,在高度各向异性方程中,Astral平均高估误差1.5倍;在对流扩散方程中,平均高估1.7倍。此外,我们还证明了Astral损失与误差的相关性优于残差损失,并且是更可靠的误差值预测器。此外,与残差不同,从Astral损失获得的误差指标在空间上与误差有极好的相关性。基于实证和理论结果,我们认为可以使用Astral损失进行可靠的误差分析,并以与标准残差为基础的技术相似的精度近似PDE解。
Summary / 总结
The paper proposes training physics-informed neural networks (PINNs) using error majorants instead of residuals to directly estimate the error of approximate solutions. Experiments with various PDEs show that the Astral loss, which is based on error majorants, is competitive with residual loss, often leading to faster convergence and lower error. The Astral loss provides a reliable error estimate, which is not possible with residual loss, and has a strong spatial correlation with the actual error.
论文提出使用误差上界而非残差来训练物理知情神经网络(PINNs),以直接估计近似解的误差。实验表明,基于误差上界的Astral损失与残差损失相当,通常能更快收敛并降低误差。Astral损失能够可靠地估计误差,这是残差损失无法做到的,并且其误差估计与实际误差的空间相关性很强。
Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning
Authors: Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, Pan Xu
Venue: Transactions on Machine Learning Research, 2026
First: 2024-10-30T20:46:26+00:00 · Latest: 2026-03-02T18:38:36+00:00
Comments: 26 pages, 11 tables, 8 figures. Published in Transactions on Machine Learning Research (TMLR)
Abstract
We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations REAG$_\text{Dara}^{*}$ and REAG$_\text{MV}^{*}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.
中文标题/摘要
标题:基于增强回报的决策变换器用于离线离域强化学习
我们研究了离线离域强化学习(RL),利用易于访问的数据源领域的数据来增强目标领域有限数据的策略学习。我们的方法集中在回报条件监督学习(RCSL),特别是集中在决策变换器(DT)类型框架上,这些框架可以预测在期望回报指导和完整轨迹历史条件下采取的动作。先前的工作通过调整数据源领域轨迹中的奖励,使其与目标领域中的最优轨迹匹配来解决动态变化问题。然而,由于(1)RCSL策略类的独特形式,它显式地依赖于回报,以及(2)最优轨迹分布的直接表示不存在,这一策略在RCSL中不能直接适用。我们为DT类型框架提出了增强回报(REAG)方法,其中通过将数据源领域的回报分布与目标领域的分布对齐来增强回报。我们提供了理论分析,证明从REAG学习到的RCSL策略的次优性与没有动态变化时相同。我们介绍了两种实用实现REAG$_\text{Dara}^{*}$和REAG$_\text{MV}^{*}$。在D4RL数据集和各种DT类型基线上的全面实验表明,我们的方法在离域RL中始终增强了DT类型框架的性能。
Summary / 总结
This paper addresses offline off-dynamics reinforcement learning by utilizing data from an easily accessible source domain to improve policy learning in a target domain with limited data. It proposes the Return Augmented (REAG) method for Decision Transformer (DT) frameworks, which aligns the return distribution in the source domain with that in the target domain. Theoretical analysis shows that the RCSL policy learned from REAG achieves the same level of suboptimality as without dynamics shift. Experiments on D4RL datasets and various DT baselines show consistent performance enhancement for DT frameworks in off-dynamics RL.
该论文研究了利用易于获取的数据源领域数据来改进目标领域中数据有限的强化学习策略。提出了一种名为Return Augmented (REAG)的方法,用于Decision Transformer (DT)框架,该方法通过使源域和目标域的回报分布对齐来提高性能。理论分析表明,从REAG学习到的RCSL策略在动态偏差下达到与无动态偏差时相同的次优性。实验结果表明,该方法在D4RL数据集和各种DT基线上的表现均优于现有方法。
How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks
Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
First: 2026-03-02T18:19:49+00:00 · Latest: 2026-03-02T18:19:49+00:00
Abstract
Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as Qwen2.5-7B and Olmo-3-7B demonstrate strong reasoning capability, their computational footprint limits deployment in latency-sensitive, edge-native infrastructures. This paper presents a systematic empirical study of the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems. Using 6G-Bench, a standardization-aligned benchmark comprising 30 decision-making tasks across five capability domains, we evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B. Deterministic accuracy (pass@1) increases from 0.224 at 135M to 0.707 at 7B, but scaling gains are highly non-uniform. A pronounced stability transition occurs in the 1 to 1.5B range, where accuracy rises from 0.373 (Llama-3.2-1B) to 0.531 (Qwen2.5-1.5B) and the instability gap Delta_5 contracts from 0.356 to 0.138. Beyond 3B parameters, improvements diminish (+0.064 from 3B to 7B). Through single-query inference profiling and an Edge Score metric that normalizes accuracy by latency and memory footprint, we show that semantic reliability per unit edge resource does not scale monotonically with parameter count. Instead, mid-scale models (approximately 1.5 to 3B) achieve the most favorable balance between deterministic stability and computational efficiency, providing deployment-relevant guidance for AI-native 6G architectures. All scripts and results are publicly available at https://github.com/maferrag/6G-Bench
中文标题/摘要
标题:6G 能推理到多小?面向 AI 原生网络的超小型语言模型缩放研究
6G 的新兴愿景,反映在 3GPP、IETF、ETSI、ITU-T 和 O-RAN 联盟正在进行的标准制定工作中,越来越多地将网络视为 AI 原生系统,在这些系统中,高级语义推理层位于标准化的控制和数据平面功能之上。尽管前沿规模的大语言模型(LLMs)如 Qwen2.5-7B 和 Olmo-3-7B 展示了强大的推理能力,但其计算足迹限制了它们在低延迟、边缘原生基础设施中的部署。本文对超小型语言模型在网络级语义推理中的缩放行为和部署效率进行了系统性的实证研究。使用 6G-Bench,一个包含 30 项决策任务的标准对齐基准,涵盖了五个能力域,我们评估了从 135M(SmolLM2-135M)到 7B 参数(Qwen2.5-7B)的模型,包括中等规模的架构如 Llama-3.2-1B、Granite-1B 和 Qwen2.5-3B。确定性准确性(pass@1)从 135M 的 0.224 增加到 7B 的 0.707,但缩放收益非常不均匀。在 1 到 1.5B 范围内发生明显的稳定性转变,准确性从 0.373(Llama-3.2-1B)上升到 0.531(Qwen2.5-1.5B),不稳定性差距 Delta_5 从 0.356 收缩到 0.138。超过 3B 参数后,改进幅度减小(从 3B 到 7B 增加 0.064)。通过单查询推理分析和一个将准确性归一化为延迟和内存足迹的 Edge Score 指标,我们表明每单位边缘资源的语义可靠性并不随参数数量单调增加。相反,中等规模的模型(大约 1.5 到 3B)在确定性稳定性和计算效率之间实现了最有利的平衡,为 AI 原生 6G 架构的部署提供了相关指导。所有脚本和结果均可在 https://github.com/maferrag/6G-Bench 公开获取
Summary / 总结
This paper investigates the scaling behavior of compact language models for network-level semantic reasoning in AI-native 6G systems. Using a standardized benchmark, the study evaluates models from 135M to 7B parameters, showing that accuracy increases from 0.224 at 135M to 0.707 at 7B but with highly non-uniform gains. A stability transition occurs around 1 to 1.5B parameters, where mid-scale models (approximately 1.5 to 3B) achieve the best balance between deterministic stability and computational efficiency, providing valuable insights for deploying AI-native 6G architectures.
本文研究了紧凑型语言模型在网络级语义推理中的扩展行为,特别是在AI原生6G系统中的应用。通过标准化基准测试,研究评估了从135M到7B参数的模型,显示了从135M的0.224到7B的0.707的准确性提升,但增长并不均匀。在1到1.5B参数之间出现稳定性转变,大约1.5到3B参数的中型模型在确定性稳定性和计算效率之间取得了最佳平衡,为部署AI原生6G架构提供了有价值的指导。
Near-Optimal Regret for KL-Regularized Multi-Armed Bandits
Authors: Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu
First: 2026-03-02T18:17:33+00:00 · Latest: 2026-03-02T18:17:33+00:00
Abstract
Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.
Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning
Authors: Nhat Nguyen, Duong Nguyen, Gianluca Rizzo, Hung Nguyen
First: 2026-03-02T18:15:39+00:00 · Latest: 2026-03-02T18:15:39+00:00
Comments: To appear in ICAPS 2026
Abstract
Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.
中文标题/摘要
标题:基于玻尔兹曼探索的鲁棒分布式多智能体规划
分布式蒙特卡洛树搜索(Dec-MCTS)广泛用于协同多智能体规划,但在稀疏或偏斜奖励环境中表现不佳。我们引入了协调玻尔兹曼MCTS(CB-MCTS),用随机玻尔兹曼策略和衰减的熵奖励替代确定性的UCT,以实现持续而集中的探索。虽然单智能体MCTS中已经研究了玻尔兹曼探索,但在多智能体系统中应用它则面临独特的挑战。CB-MCTS 是第一个解决这一问题的方法。我们在简单后悔设置中分析了CB-MCTS,并在模拟中展示了它在欺骗性场景中优于Dec-MCTS,同时在标准基准上保持竞争力,为多智能体规划提供了鲁棒的解决方案。
Summary / 总结
The research aims to improve decentralized multi-agent planning in sparse or skewed reward environments by introducing Coordinated Boltzmann MCTS (CB-MCTS), which uses a stochastic Boltzmann policy and a decaying entropy bonus to enhance exploration. The method outperforms Dec-MCTS in deceptive scenarios while maintaining competitiveness on standard benchmarks, providing a robust solution for multi-agent planning.
研究旨在通过引入协调的Boltzmann MCTS (CB-MCTS) 来改善在稀疏或偏斜奖励环境中的多智能体协同规划。该方法使用随机的Boltzmann策略和衰减的熵奖励来增强探索。实验结果表明,CB-MCTS 在欺骗性场景中优于 Dec-MCTS,并在标准基准测试中保持竞争力,提供了一个多智能体规划的稳健解决方案。
Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
Authors: Luigi Medrano, Arush Verma, Mukul Chhabra
First: 2026-03-02T18:15:09+00:00 · Latest: 2026-03-02T18:15:09+00:00
Abstract
Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness under realistic production constraints remains underexplored. In this work, we evaluate retrieval fusion in a production-style RAG pipeline operating over an enterprise knowledge base, with fixed retrieval depth, re-ranking budgets, and latency constraints. Across multiple fusion configurations, we find that retrieval fusion does increase raw recall, but these gains are largely neutralized after re-ranking and truncation. In our setting, fusion variants fail to outperform single-query baselines on KB-level Top-$k$ accuracy, with Hit@10 decreasing from $0.51$ to $0.48$ in several configurations. Moreover, fusion introduces additional latency overhead due to query rewriting and larger candidate sets, without corresponding improvements in downstream effectiveness. Our analysis suggests that recall-oriented fusion techniques exhibit diminishing returns once realistic re-ranking limits and context budgets are applied. We conclude that retrieval-level improvements do not reliably translate into end-to-end gains in production RAG systems, and argue for evaluation frameworks that jointly consider retrieval quality, system efficiency, and downstream impact.
中文标题/摘要
标题:使用RAG融合扩展检索增强生成:来自工业部署的经验教训
检索增强生成(RAG)系统通常采用多查询检索和互惠排名融合(RRF)等检索融合技术来提高文档召回率,假设更高的召回率会导致更好的答案质量。虽然这些方法在孤立的检索基准测试中表现出一致的改进,但在现实生产约束下的有效性仍然未被充分探索。在本文中,我们在一个针对企业知识库的生产风格RAG流水线中评估了检索融合,该流水线具有固定的检索深度、重排序预算和延迟限制。 在多种融合配置中,我们发现检索融合确实增加了原始召回率,但这些增益在重排序和截断后被大部分抵消。在我们的设置中,融合变体在KB级别的Top-$k$准确性上未能超越单查询基线,多个配置中Hit@10从$0.51$下降到$0.48$。此外,融合由于查询重写和更大的候选集引入了额外的延迟开销,而没有相应的下游效果改进。 我们的分析表明,一旦应用了现实的重排序限制和上下文预算,面向召回的融合技术将表现出递减的回报。我们得出结论,检索级别的改进在生产RAG系统中并不可靠地转化为端到端的增益,并且建议采用同时考虑检索质量、系统效率和下游影响的评估框架。
Summary / 总结
This study evaluates retrieval fusion techniques in a production RAG pipeline, finding that while these methods increase raw recall, the gains are largely negated by re-ranking and truncation. Fusion variants did not outperform single-query baselines on KB-level Top-$k$ accuracy, with Hit@10 decreasing in several configurations. Additionally, fusion introduces latency overhead without corresponding improvements in effectiveness.
该研究在具有固定约束的生产RAG管道中评估了检索融合技术。虽然检索融合增加了原始召回率,但在重新排序和截断后这些增益被大大抵消。在多个配置中,融合变体在KB级别的Top-$k$准确性上不如单查询基线,Hit@10在几个配置中有所下降。此外,融合引入了额外的延迟开销,而没有相应的下游效果改进。
3D Field of Junctions: A Noise-Robust, Training-Free Structural Prior for Volumetric Inverse Problems
Authors: Namhoon Kim, Narges Moeini, Justin Romberg, Sara Fridovich-Keil
First: 2026-03-02T18:11:59+00:00 · Latest: 2026-03-02T18:11:59+00:00
Comments: Code will be released soon
Abstract
Volume denoising is a foundational problem in computational imaging, as many 3D imaging inverse problems face high levels of measurement noise. Inspired by the strong 2D image denoising properties of Field of Junctions (ICCV 2021), we propose a novel, fully volumetric 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges that best explain each 3D patch of a full volume, while encouraging consistency between overlapping patches. In addition to direct volume denoising, we leverage our 3D FoJ representation as a structural prior that: (i) requires no training data, and thus precludes the risk of hallucination, (ii) preserves and enhances sharp edge and corner structures in 3D, even under low signal to noise ratio (SNR), and (iii) can be used as a drop-in denoising representation via projected or proximal gradient descent for any volumetric inverse problem with low SNR. We demonstrate successful volume reconstruction and denoising with 3D FoJ across three diverse 3D imaging tasks with low-SNR measurements: low-dose X-ray computed tomography (CT), cryogenic electron tomography (cryo-ET), and denoising point clouds such as those from lidar in adverse weather. Across these challenging low-SNR volumetric imaging problems, 3D FoJ outperforms a mixture of classical and neural methods.
中文标题/摘要
标题:3D节点场:一种鲁棒的、无需训练的体数据结构先验
体数据去噪是计算成像中的基础问题,因为许多3D成像逆问题面临高测量噪声水平。受2D图像去噪中Field of Junctions(ICCV 2021)强大特性的启发,我们提出了一种新颖的全3D Field of Junctions(3D FoJ)表示法,该表示法优化了一个3D楔形的交点,以最好地解释体数据中的每个3D块,同时鼓励重叠块之间的一致性。除了直接的体数据去噪外,我们还利用3D FoJ表示法作为结构先验:(i) 不需要训练数据,从而避免了幻觉的风险;(ii) 即使在低信噪比(SNR)下也能保留和增强3D中的锐利边缘和角结构;(iii) 可以通过投影或近端梯度下降作为任何低SNR体数据逆问题的即插即用去噪表示。我们展示了在低SNR测量的三个不同3D成像任务中使用3D FoJ进行成功的体数据重建和去噪:低剂量X射线计算机断层扫描(CT)、冷冻电子断层扫描(cryo-ET)以及激光雷达点云去噪等。在这些具有挑战性的低SNR体数据成像问题中,3D FoJ优于经典和神经方法的混合。
Summary / 总结
The paper introduces a 3D Field of Junctions (3D FoJ) for volume denoising in computational imaging, which optimizes a junction of 3D wedges to explain each volume patch while ensuring consistency between overlapping patches. This method does not require training data, making it robust to noise and capable of preserving sharp edges and corners in 3D images. The 3D FoJ outperforms classical and neural methods in low-SNR scenarios across various 3D imaging tasks such as low-dose X-ray CT, cryo-ET, and lidar point cloud denoising.
论文针对低信噪比(SNR)场景下的体积去噪问题,提出了一种3D Field of Junctions(3D FoJ)表示方法,通过优化每个体积块中的3D楔形交点,并确保重叠块之间的一致性。3D FoJ被用作体积逆问题的结构先验,无需训练数据,并能增强尖锐的边缘和角落结构。实验结果表明,在低剂量X射线CT、冷冻电子断层扫描(cryo-ET)和激光雷达点云去噪等多种3D成像任务中,3D FoJ在低SNR场景下优于经典和神经网络方法。
Data-to-Energy Stochastic Dynamics
Authors: Kirill Tamogashev, Nikolay Malkin
First: 2025-09-30T15:03:55+00:00 · Latest: 2026-03-02T18:11:04+00:00
Abstract
The Schrödinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schrödinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schrödinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method. Code: https://github.com/mmacosha/d2e-stochastic-dynamics
中文标题/摘要
标题:数据到能量随机动力学
薛定谔桥问题关注的是寻找一个随机动力学系统,该系统连接两个边缘分布并最小化某种运输成本。这一问题由于其与扩散模型和流匹配的联系以及在自然科学中的应用而受到关注。然而,现有的所有算法只能在可以从两个分布中获取样本的情况下推断出这样的动力学。在本文中,我们提出了第一个在给定一个(或两个)分布由其未归一化密度表示的情况下,无需访问数据样本即可建模薛定谔桥的通用方法。我们的算法依赖于迭代比例拟合(IPF)程序在无数据情况下的推广,受到最近在无策略强化学习中训练扩散采样器方面的进展的启发。我们展示了所提出的无数据到能量IPF在合成问题上的有效性,发现它可以成功地学习多模态分布之间的传输。作为我们强化学习表述的次要结果,我们发现现有的无数据到数据薛定谔桥算法可以通过学习动力学的扩散系数来显著改进。最后,我们将新开发的算法应用于生成模型潜在空间中后验分布的采样问题,从而创建了一种无数据的图像到图像转换方法。代码:https://github.com/mmacosha/d2e-stochastic-dynamics
Summary / 总结
This paper addresses the Schrödinger bridge problem, which seeks to find a stochastic dynamical system connecting two marginal distributions while minimizing transportation cost. The authors propose a novel method called data-to-energy IPF that can model such dynamics even when only unnormalised densities are available, without access to data samples. The method generalizes the iterative proportional fitting procedure and improves upon existing data-to-data Schrödinger bridge algorithms by learning the diffusion coefficient. Experiments show that the proposed method can effectively learn transports between multimodal distributions and can be applied to sample from posterior distributions in generative models, enabling data-free image-to-image translation.
本文解决了Schrödinger桥问题,目标是在两个边缘分布之间找到一个最小化运输成本的随机动力系统。作者提出了一种新的方法,在仅提供非归一化密度且无数据样本访问的情况下推断此类动力系统。该方法将迭代比例拟合过程进行了扩展,并通过学习动力系统的扩散系数改进了现有的数据到数据Schrödinger桥算法。所提出的算法成功地在多模态分布之间学习了运输,并应用于生成模型的后验分布采样,从而实现了无数据的图像到图像的转换。
Multi-Marginal Flow Matching with Adversarially Learnt Interpolants
Authors: Oskar Kviman, Kirill Tamogashev, Nicola Branchini, Víctor Elvira, Jens Lagergren, Nikolay Malkin
First: 2025-10-01T17:47:27+00:00 · Latest: 2026-03-02T18:10:31+00:00
Abstract
Learning the dynamics of a process given sampled observations at several time points is an important but difficult task in many scientific applications. When no ground-truth trajectories are available, but one has only snapshots of data taken at discrete time steps, the problem of modelling the dynamics, and thus inferring the underlying trajectories, can be solved by multi-marginal generalisations of flow matching algorithms. This paper proposes a novel flow matching method that overcomes the limitations of existing multi-marginal trajectory inference algorithms. Our proposed method, ALI-CFM, uses a GAN-inspired adversarial loss to fit neurally parametrised interpolant curves between source and target points such that the marginal distributions at intermediate time points are close to the observed distributions. The resulting interpolants are smooth trajectories that, as we show, are unique under mild assumptions. These interpolants are subsequently marginalised by a flow matching algorithm, yielding a trained vector field for the underlying dynamics. We showcase the versatility and scalability of our method by outperforming the existing baselines on spatial transcriptomics and cell tracking datasets, while performing on par with them on single-cell trajectory prediction. Code: https://github.com/mmacosha/adversarially-learned-interpolants.
中文标题/摘要
标题:多边际流匹配与对抗学习插值
在多个时间点采样观测值的情况下学习过程的动力学是一个在许多科学应用中既重要又困难的任务。当没有真实路径可供参考,只有在离散时间步长上采集的数据快照时,通过多边际流匹配算法的泛化来建模动力学并推断潜在路径的问题可以得到解决。本文提出了一种新颖的流匹配方法,克服了现有多边际轨迹推断算法的局限性。我们提出的ALI-CFM方法使用GAN启发式的对抗损失来拟合神经参数化的插值曲线,使得中间时间点的边际分布接近观察到的分布。由此产生的插值是平滑的轨迹,如我们所展示的,在轻微假设下是唯一的。这些插值随后通过流匹配算法进行边际化,产生一个训练后的向量场,用于潜在的动力学。我们通过在空间转录组学和细胞追踪数据集上优于现有基线,在单细胞轨迹预测上与它们表现相当,展示了我们方法的灵活性和可扩展性。代码:https://github.com/mmacosha/adversarially-learned-interpolants.
Summary / 总结
This paper addresses the challenge of inferring the dynamics of a process from discrete snapshots without ground-truth trajectories. It introduces ALI-CFM, a novel flow matching method that uses an adversarial loss to fit smooth interpolant curves between source and target points, ensuring that the marginal distributions at intermediate time points match observed data. The method outperforms existing baselines on spatial transcriptomics and cell tracking datasets and performs comparably on single-cell trajectory prediction tasks.
该论文解决了从离散快照中推断动态而没有地面真实轨迹的问题,提出了一种名为ALI-CFM的新颖流匹配方法,该方法使用对抗学习来拟合平滑的插值曲线。该方法确保中间时间点的边际分布与观测数据匹配,在温和假设下生成唯一的轨迹。ALI-CFM在空间转录组学和细胞追踪数据集上优于现有方法,并在单细胞轨迹预测任务上表现相当。
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
Authors: Han Xue, Nan Min, Xiaotong Liu, Wendi Chen, Yuan Fang, Jun Lv, Cewu Lu, Chuan Wen
Venue: CVPR 2026
First: 2026-03-02T18:00:37+00:00 · Latest: 2026-03-02T18:00:37+00:00
Comments: 22 pages, 15 figures, Accecpted by CVPR 2026
Abstract
The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on https://robo-fisheye.github.io/
中文标题/摘要
标题:重新思考相机选择:机器人操作中鱼眼相机属性的实证研究
鱼眼相机在机器人操作中的应用,得益于其极宽的视野(FoV),正迅速超越对其下游效果的系统理解。本文首次进行全面的实证研究,严格分析了腕部安装的鱼眼相机在模仿学习中的属性。通过在仿真和现实世界中的大量实验,我们探讨了三个关键研究问题:空间定位、场景泛化和硬件泛化。我们的研究发现:(1)宽广的FoV显著增强了空间定位,但这种优势取决于环境的视觉复杂度。(2)虽然鱼眼训练的策略在简单场景中容易过拟合,但在具有足够环境多样性的场景中训练时,可以解锁更好的场景泛化。(3)虽然简单的跨相机转移会导致失败,但我们确定其根本原因是尺度过拟合,并证明通过简单的随机尺度增强(RSA)策略可以提高硬件泛化性能。综上所述,我们的研究结果为大规模收集和有效使用鱼眼数据集提供了具体的、可操作的指导。更多结果和视频可在https://robo-fisheye.github.io/获取。
Summary / 总结
This paper addresses the rapid adoption of fisheye cameras in robotic manipulation by conducting a comprehensive empirical study. Through simulation and real-world experiments, the authors investigate spatial localization, scene generalization, and hardware generalization. Key findings include that the wide FoV enhances spatial localization but is environment-dependent, fisheye-trained policies generalize better with diverse scenes, and hardware generalization can be improved with Random Scale Augmentation. These results offer practical guidance for using fisheye datasets in robotic learning.
本文通过全面的实证研究,探讨了鱼眼相机在机器人操作中的应用。通过仿真和实际实验,作者研究了鱼眼相机对空间定位、场景泛化和硬件泛化的影响。主要发现包括:宽视野显著提升了空间定位,但依赖于环境复杂度;鱼眼训练的策略在多样化的场景中表现出更好的泛化能力;通过简单的随机尺度增强(RSA)策略,可以改善硬件泛化性能。这些结果为鱼眼相机在机器人学习中的有效使用提供了实用指导。
NextAds: Towards Next-generation Personalized Video Advertising
Authors: Yiyan Xu, Ruoxuan Xia, Wuqiang Zheng, Fengbin Zhu, Wenjie Wang, Fuli Feng
First: 2026-03-02T17:58:07+00:00 · Latest: 2026-03-02T17:58:07+00:00
Abstract
With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, selecting the optimal one from a small set of professionally pre-produced creatives for each user. Such static and finite inventories limits both the granularity and the timeliness of personalization, and prevents the creatives from being continuously refined based on online user feedback. Recent advances in generative AI make it possible to move beyond retrieval toward optimizing video creatives in a continuous space at serving time. In this light, we propose NextAds, a generation-based paradigm for next-generation personalized video advertising, and conceptualize NextAds with four core components. To enable comparable research progress, we formulate two representative tasks: personalized creative generation and personalized creative integration, and introduce corresponding lightweight benchmarks. To assess feasibility, we instantiate end-to-end pipelines for both tasks and conduct initial exploratory experiments, demonstrating that GenAI can generate and integrate personalized creatives with encouraging performance. Moreover, we discuss the key challenges and opportunities under this paradigm, aiming to provide actionable insights for both researchers and practitioners and to catalyze progress in personalized video advertising.
中文标题/摘要
标题:NextAds:迈向下一代个性化视频广告
随着在线视频消费的快速增长,视频广告已成为数字广告领域中越来越重要的组成部分。然而,多样化的用户和观看环境使得一刀切的广告创意无法保证持续的效果,突显了个性化的重要性。实践中,大多数个性化视频广告系统遵循一种检索式范式,为每位用户从少量的专业预制作创意中选择最合适的创意。这种静态和有限的库存限制了个性化的时间性和精细度,并阻止了创意根据在线用户反馈进行持续优化。最近生成式AI的进步使得在服务时优化视频创意成为可能,从而超越了检索式范式。 在此背景下,我们提出了NextAds,一种基于生成的下一代个性化视频广告范式,并构思了NextAds的四个核心组件。为了促进可比的研究进展,我们制定了两个代表性任务:个性化创意生成和个性化创意整合,并引入了相应的轻量级基准。为了评估可行性,我们为这两个任务实例化了端到端的管道并进行了初步探索性实验,证明了生成式AI可以生成和整合具有令人鼓舞性能的个性化创意。此外,我们讨论了该范式下的关键挑战和机遇,旨在为研究人员和实践者提供可操作的见解,并促进个性化视频广告的发展。
Summary / 总结
NextAds aims to enhance personalized video advertising by leveraging generative AI to optimize video creatives in real-time. It introduces a generation-based paradigm with four core components and two benchmark tasks. Initial experiments show promising results in generating and integrating personalized creatives, highlighting the potential of GenAI in overcoming the limitations of traditional retrieval-based systems.
NextAds 旨在通过利用生成式 AI 实时优化视频创意来提升个性化视频广告。该系统包含四个核心组件和两个基准任务以促进研究。初步实验显示,在生成和整合个性化创意方面表现出色,解决了传统检索式系统中静态库存的局限性。
MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination
Authors: Ziyan Wu, Ivan Korolija, Rui Tang
First: 2025-08-19T05:44:06+00:00 · Latest: 2026-03-02T17:55:58+00:00
Comments: The platform is released open-source on GitHub: https://github.com/BuildNexusX/MuFlex
Abstract
With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for multi-building flexibility coordination, was developed. MuFlex enables synchronous information exchange and co-simulation across multiple detailed building models programmed in EnergyPlus and Modelica, and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform's physics-based capabilities and workflow were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm. The results show that under four buildings' coordination, SAC effectively reduced the aggregated peak demand by nearly 12% with maintained indoor comfort to ensure the power demand below the threshold. Additionally, the platform's scalability was investigated through computational benchmarking on building clusters with varying sizes, model types, and simulation programs.
中文标题/摘要
标题:MuFlex:一种可扩展的基于物理的多建筑灵活性分析与协调平台
随着可再生能源在电网中的渗透率不断提高,维持系统平衡需要通过建筑聚合体协调需求灵活性。由于其无模型特性,强化学习在建筑控制中得到了广泛应用。开源仿真测试床不仅对于训练RL代理至关重要,而且对于公平地评估控制策略也必不可少。然而,大多数建筑领域的测试床仅针对单个建筑;多建筑平台相对较少,通常依赖于简化模型(如电阻-电容模型)或数据驱动方法,缺乏完全捕捉物理复杂性和中间变量的能力,以解释控制性能。此外,这些平台往往固定输入、输出和模型格式,限制了它们作为跨不同控制场景的基准工具的应用。为了解决这些差距,开发了MuFlex,这是一种可扩展的多建筑灵活性协调开源平台。MuFlex允许跨多个使用EnergyPlus和Modelica编写的详细建筑模型进行同步信息交换和联合仿真,并遵循最新的OpenAI Gym接口,提供模块化、标准化的RL实现。通过使用Soft Actor-Critic算法协调四栋办公楼的需求灵活性,案例研究展示了平台的物理基础能力和工作流程。结果显示,在四栋建筑的协调下,SAC有效降低了约12%的峰值需求,同时保持室内舒适度,确保电力需求低于阈值。此外,通过在不同规模、不同类型模型和仿真程序的建筑集群上进行计算基准测试,研究了平台的可扩展性。
Summary / 总结
MuFlex is a scalable, physics-based platform designed for analyzing and coordinating multi-building flexibility in the power grid. It uses reinforcement learning to manage demand flexibility across multiple buildings, leveraging detailed models from EnergyPlus and Modelica. MuFlex demonstrates its effectiveness by reducing the aggregated peak demand by nearly 12% across four office buildings using the Soft Actor-Critic algorithm, while maintaining indoor comfort. The platform’s scalability is also validated through computational benchmarks on various building clusters.
MuFlex 是一个可扩展的基于物理的平台,用于分析和协调多栋建筑的需求响应灵活性。它使用强化学习来管理建筑控制,并遵循 OpenAI Gym 接口进行标准化实现。MuFlex 通过减少四栋办公楼的总峰值需求近 12% 来证明其有效性,同时保持室内舒适度。该平台的可扩展性还通过各种建筑集群的计算基准测试得到了验证。
OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
Authors: Chong Xia, Fangfu Liu, Yule Wang, Yize Pang, Yueqi Duan
First: 2026-03-02T17:52:02+00:00 · Latest: 2026-03-02T17:52:02+00:00
Abstract
Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.
中文标题/摘要
标题:OnlineX:统一的在线3D重建与理解及动态到稳定状态演变
通用3D高斯点云(3DGS)的最新进展使得在几秒钟内快速重建3D场景成为可能,消除了对每场景优化的需要。然而,现有方法主要遵循离线重建范式,缺乏连续重建的能力,限制了其在机器人技术和VR/AR等在线场景中的应用。本文中,我们提出了OnlineX,这是一种前馈框架,仅使用流式图像在线重建3D视觉外观和语言领域。在线建模中的一个关键挑战是累积漂移问题,其根源在于记忆状态的两种对立角色之间的基本冲突:一种活跃角色不断刷新以捕捉高频局部几何结构,另一种稳定角色保守地累积并保存长期全局结构。为了解决这一问题,我们引入了一种分离的活跃到稳定状态演变范式。我们的框架将记忆状态分离为专用的活跃状态和持久的稳定状态,然后将前者的相关信息融合到后者中,以实现准确性和稳定性。此外,我们联合建模视觉外观和语言领域,并引入了一个隐式高斯融合模块以提高重建质量。在主流数据集上的实验表明,我们的方法在新颖视图合成和语义理解方面始终优于先前的工作,展示了在不同长度的输入序列中具有鲁棒性能的实时推理速度。
Summary / 总结
The research aims to address the limitations of offline 3D reconstruction methods by developing OnlineX, a feed-forward framework for online 3D scene reconstruction and understanding. It tackles the cumulative drift issue through a decoupled active-to-stable state evolution paradigm, which separates the memory state into an active state for refreshing local geometry and a stable state for preserving global structure. The method also jointly models visual appearance and language fields, and incorporates an implicit Gaussian fusion module to improve reconstruction quality. Experimental results show that OnlineX outperforms previous methods in novel view synthesis and semantic understanding, with real-time inference capabilities.
研究动机是解决现有离线3D重建方法在处理机器人和VR/AR等在线场景时的局限性。主要方法是提出一种前馈框架OnlineX,该框架使用流媒体图像在线重建3D视觉外观和语言字段。实验结果表明,OnlineX在新颖视图合成和语义理解方面优于先前的方法,具有在不同输入序列长度下表现出的鲁棒性能和实时推理速度。
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
Authors: Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, Yueqi Duan
First: 2026-03-02T17:51:45+00:00 · Latest: 2026-03-02T17:51:45+00:00
Abstract
Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.
中文标题/摘要
标题:SimRecon: SimReady 组合场景重建从真实视频
组合场景重建旨在从真实世界的视频中创建以对象为中心的表示,而不是整体场景,这原生适用于模拟和交互。传统的组合重建方法主要强调视觉外观,并在现实世界场景中的泛化能力有限。在本文中,我们提出了一种名为SimRecon的框架,该框架实现了“感知-生成-模拟”流水线,首先从视频输入中进行场景级语义重建,然后进行单个对象生成,最后在模拟器中组装这些资产。然而,简单地将这三个阶段结合起来会导致生成资产的视觉不真实以及最终场景的物理不可信,特别是在复杂场景中问题尤为严重。因此,我们进一步提出了三个阶段之间的两个桥梁模块来解决这个问题。具体而言,对于从感知到生成的过渡,对于视觉真实性的关键,我们引入了主动视角优化,该方法在三维空间中主动搜索以获取单个对象完成的最佳投影图像作为条件。此外,对于从生成到模拟的过渡,对于物理可信度至关重要,我们提出了一种场景图合成器,该合成器在三维模拟器中从零开始引导构建,反映了现实世界原生的构造原则。在ScanNet数据集上的广泛实验验证了我们方法在以前最先进的方法上的优越性能。
Summary / 总结
SimRecon is a framework that addresses the limitations of conventional compositional scene reconstruction by introducing a 'Perception-Generation-Simulation' pipeline. It first reconstructs scene-level semantics, then generates single objects, and finally assembles them in a simulator. To improve visual fidelity and physical plausibility, SimRecon includes Active Viewpoint Optimization and a Scene Graph Synthesizer. Experiments show that SimRecon outperforms previous methods on the ScanNet dataset.
SimRecon 是一个框架,通过遵循‘感知-生成-模拟’管道从真实世界视频中重建杂乱场景。它首先进行场景级别的语义重建,然后生成单个对象,最后在模拟器中组装它们。为了解决视觉不真实性和物理不现实性问题,SimRecon 引入了 Active Viewpoint Optimization 用于感知-生成过渡,并提出了 Scene Graph Synthesizer 用于生成-模拟过渡。实验结果表明,SimRecon 在视觉真实性和物理现实性方面均优于先前的方法。
Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera
Authors: Tutian Tang, Xingyu Ji, Yutong Li, MingHao Liu, Wenqiang Xu, Cewu Lu
Venue: ICRA 2026
First: 2026-03-02T17:46:38+00:00 · Latest: 2026-03-02T17:46:38+00:00
Comments: The code, data, and supplementary materials are available at \url{https://sites.google.com/view/stereo-inertial-poser}. Accepted to ICRA 2026
Abstract
Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
中文标题/摘要
标题:立体惯性姿态器:利用稀疏IMU和单个立体相机实现度量准确的形状感知运动捕捉
视觉惯性运动捕捉系统的最新进展表明,将单目相机与稀疏惯性测量单元(IMU)结合使用作为低成本解决方案的潜力,这有效地缓解了单一模态系统固有的遮挡和漂移问题。然而,它们仍然受到源自单目深度歧义的全局平移度量不准确性的限制,以及忽略人体测量变异性的形状无关的局部运动估计。我们提出了立体惯性姿态器,这是一种实时运动捕捉系统,利用单个立体相机和六个IMU来估计度量准确且形状感知的3D人体运动。通过用立体视觉替换单目RGB,我们的系统通过校准基线几何结构解决深度歧义,从而直接提取3D关键点并估计身体形状参数。IMU数据和视觉线索被融合以预测漂移补偿的关节位置和根部运动,同时一个新颖的形状感知融合模块动态地将人体测量变异与全局平移进行协调。我们的端到端管道在无需优化后处理的情况下实现了超过200 FPS,使实时部署成为可能。在各种数据集上的定量评估表明了最先进的性能。定性结果表明,我们的方法在长时间录制下产生无漂移的全局平移,并减少了脚滑效果。
Summary / 总结
Stereo-Inertial Poser is a real-time motion capture system that uses a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By leveraging stereo vision, the system resolves monocular depth ambiguity and directly extracts 3D keypoints and body shape parameters. IMU and visual cues are fused to predict drift-compensated joint positions and root movements, with a novel shape-aware fusion module that harmonizes anthropometric variations with global translations. The system achieves over 200 FPS and demonstrates state-of-the-art performance in quantitative evaluations and produces drift-free global translation and reduced foot-skating effects in qualitative results.
Stereo-Inertial Poser 是一种结合单目立体相机和六个 IMU 的实时动作捕捉系统,能够实现高精度和体型感知的 3D 人体动作捕捉。通过使用立体视觉,系统解决了单目深度模糊问题,并直接提取 3D 关键点和体型参数。IMU 数据和视觉线索被融合以预测补偿漂移的关节位置和根部运动,同时引入了一个新的体型感知融合模块,以动态协调体型变异与全局平移。该系统实现超过 200 FPS,并在定量和定性评估中表现出最先进的性能,展示了长时间记录下的无漂移全局平移和减少的脚滑效果。
LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation
Authors: Hualiang Wei, Shunran Jia, Jialun Liu, Wenhui Li
First: 2026-03-02T17:46:32+00:00 · Latest: 2026-03-02T17:46:32+00:00
Comments: 19 pages, 11 figures
Abstract
We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
中文标题/摘要
标题:LiftAvatar:基于运动空间完成的表情可控3D高斯 avatar 动画
我们提出了LiftAvatar,一种新的范式,用于在运动空间中完成稀疏的单目观测(例如面部表情和头部姿态),并使用完成的信号来驱动高质量的avatar动画。LiftAvatar是一个细粒度的表情可控大规模视频扩散Transformer,能够根据单张或多张参考图像合成高质量、时间上一致的表情序列。关键思想是将不完整的输入数据提升到更丰富的运动表示,从而在下游3D avatar管道中增强重建和动画。为此,我们引入了(i)一种多粒度的表情控制方案,结合光照图与表情系数,实现精确和稳定的驱动,以及(ii)一种多参考条件机制,从多帧中聚合互补线索,实现强大的3D一致性和可控性。作为即插即用增强器,LiftAvatar直接解决了基于稀疏运动线索的3D高斯点绘制avatar的有限表现力和重建伪影。通过将不完整的观测扩展为多样的姿态-表情变化,LiftAvatar还能够将大规模视频生成模型中的先验知识有效地提炼到3D管道中,从而带来显著的提升。大量实验表明,LiftAvatar在提升3D avatar方法的动画质量和定量指标方面表现出色,尤其是在极端、未见过的表情下。
Summary / 总结
LiftAvatar is a method that completes sparse monocular observations in kinematic space to drive high-fidelity 3D avatar animation. It uses a video diffusion Transformer to synthesize high-quality, temporally coherent expression sequences. Key findings include improved animation quality and quantitative metrics, especially for extreme expressions, by expanding incomplete observations into diverse pose-expression variations. This approach enhances both reconstruction and animation in 3D avatar pipelines, addressing limitations of 3D Gaussian Splatting-based avatars due to sparse kinematic cues.
LiftAvatar 是一种方法,它通过完成稀疏的单目观测数据在运动学空间中的表示,并使用这些数据来驱动高质量的虚拟角色动画。它利用视频扩散变换器从单个或多个参考图像中合成高质量、时间上连贯的表情序列。主要发现包括提升最先进的 3D 虚拟角色方法的动画质量和定量指标,特别是在极端表情方面。该方法通过将不完整的观测数据扩展为多样化的姿态-表情变化,增强了表达性和减少了基于 3D 高斯散射的虚拟角色的重建伪影。
A 3D mesh convolution-based autoencoder for geometry compression
Authors: Germain Bregeon, Marius Preda, Radu Ispas, Titus Zaharia
First: 2026-03-02T17:42:58+00:00 · Latest: 2026-03-02T17:42:58+00:00
Abstract
In this paper, we introduce a novel 3D mesh convolution-based autoencoder for geometry compression, able to deal with irregular mesh data without requiring neither preprocessing nor manifold/watertightness conditions. The proposed approach extracts meaningful latent representations by learning features directly from the mesh faces, while preserving connectivity through dedicated pooling and unpooling operations. The encoder compresses the input mesh into a compact base mesh space, which ensures that the latent space remains comparable. The decoder reconstructs the original connectivity and restores the compressed geometry to its full resolution. Extensive experiments on multi-class datasets demonstrate that our method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks. Code available at: github.com/germainGB/MeshConv3D
中文标题/摘要
标题:基于3D网格卷积的自编码器用于几何压缩
在本文中,我们提出了一种基于3D网格卷积的自编码器,用于处理不规则网格数据,无需预处理或满足流形/水密条件。所提出的方法通过直接从网格面学习特征来提取有意义的潜在表示,并通过专用的池化和反池化操作保持连接性。编码器将输入网格压缩到紧凑的基本网格空间,这确保了潜在空间的一致性。解码器重建原始连接性并恢复压缩的几何形状到其全分辨率。在多类数据集上的广泛实验表明,我们的方法在3D网格几何重建和潜在空间分类任务中均优于现有最先进的方法。代码可在github.com/germainGB/MeshConv3D获取
Summary / 总结
The research aims to develop a 3D mesh convolution-based autoencoder for geometry compression that can handle irregular mesh data without preprocessing or specific conditions. The method learns features directly from mesh faces and uses pooling and unpooling operations to preserve connectivity. Experiments show that the proposed approach outperforms existing methods in both 3D mesh geometry reconstruction and latent space classification tasks.
本文提出了一种基于3D网格卷积的自编码器,用于几何压缩,能够处理不规则网格数据,无需预处理或特定流形条件。该模型直接从网格面学习特征,并使用池化和反池化操作来保持连接性。编码器将输入网格压缩到一个紧凑的空间中,解码器则重建原始几何形状。实验表明,该方法在3D网格几何重建和潜在空间分类任务中均优于现有方法。
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Authors: Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu, Xu Yang, Zhide Chen
First: 2026-03-02T17:42:33+00:00 · Latest: 2026-03-02T17:42:33+00:00
Comments: 17 pages,8 figures, The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
Abstract
The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
中文标题/摘要
标题:Nano-EmoX:从感知到共情的统一多模态情感智能
情感多模态语言模型(MLM)的发展长期以来受到低级感知与高级交互之间差距的限制,导致情感能力碎片化且泛化能力有限。为弥合这一差距,我们提出了一种认知启发式的三层层次结构,根据认知深度组织情感任务为感知、理解与交互,并提供统一的概念基础以推进情感建模。受此层次结构的指导,我们引入了Nano-EmoX,一种小型多任务MLM,以及P2E(感知到共情)基于课程的学习框架。Nano-EmoX 结合了一系列全模态编码器,包括增强的面部编码器和融合编码器,以捕捉关键的多模态情感线索并提高跨任务的迁移性。输出通过异构适配器投影到统一的语言空间中,赋予轻量级语言模型处理各种情感任务的能力。同时,P2E 通过将快速感知与基于链式思维的共情对齐,逐步培养情感智能。据我们所知,Nano-EmoX 是第一个统一三个层次结构中所有六个核心情感任务的小型MLM(2.2B),在多个基准测试中实现了最先进的或高度竞争力的性能,展示了出色的效率和泛化能力。
Summary / 总结
The paper addresses the gap between low-level perception and high-level interaction in affective multimodal language models (MLMs) by proposing a three-level hierarchy and introducing Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy) training framework. Nano-EmoX integrates omni-modal encoders and heterogeneous adapters to capture multimodal affective cues and improve cross-task transferability, while P2E progressively cultivates emotional intelligence by aligning perception with empathy. The model achieves state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
论文提出了一种三级认知启发式层次结构,以弥合情感多模态语言模型中低级感知与高级交互之间的差距,并引入了Nano-EmoX,这是一种小型多任务MLM,以及基于课程的训练框架P2E。Nano-EmoX集成了多模态编码器和异构适配器,以捕捉多模态情感线索并提高跨任务的迁移性,而P2E通过将感知与同理心对齐来逐步培养情感智能。该模型在多个基准测试中实现了最先进的或具有竞争力的性能,展示了出色的效率和泛化能力。
Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
Authors: Justin Waugh
First: 2026-03-02T17:40:54+00:00 · Latest: 2026-03-02T17:40:54+00:00
Abstract
We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.
中文标题/摘要
标题:铅笔谜题台:多步可验证推理的基准
我们引入了铅笔谜题台,一种通过铅笔谜题(一类与NP完全问题紧密相关的约束满足问题)评估大型语言模型推理能力的框架,这些谜题具有确定性的、逐步的验证。从包含62,231个谜题(涵盖94种类型,每个类型都有唯一解)的数据库中,我们选择了涵盖20种类型的300个谜题作为基准,并对来自11个提供商的51个模型进行了两种模式的评估:直接询问(单次)和代理模式(多轮次迭代验证)。我们基准的一个关键区别在于,每个中间棋盘状态都可以与特定类型的约束进行比对,从而将错误定位到具体的规则违反之处,为过程监督和强化学习提供了密集的每步奖励信号。 我们的评估揭示了两种能力轴:(1)推理努力扩展,GPT-5.2从无推理到最大努力提高了81倍;(2)代理迭代,Claude Opus 4.6通过迭代检查从0.3%提高到30.0%,而GPT-5.2@xhigh从20.2%提高到56.0%。代理尝试的中位数为29轮,持续17分钟,最长超过1,221轮和14.3小时,这是一项对长上下文利用能力的严苛测试,而不仅仅是推理能力的测试。
Summary / 总结
Pencil Puzzle Bench is a benchmark for evaluating large language models' reasoning abilities through pencil puzzles, which are constraint-satisfaction problems. The benchmark includes 300 puzzles from 20 varieties, evaluated across 51 models from 11 providers in single-shot and multi-turn modes. Key findings show that models improve significantly with reasoning effort and iterative verification, with GPT-5.2 showing an 81x improvement and Claude Opus 4.6 increasing from 0.3% to 30.0% through iterative checking. The benchmark also tests long-context utilization, with the longest attempt spanning over 1,221 turns and 14.3 hours.
Pencil Puzzle Bench 通过 300 个独解的铅笔谜题评估大型语言模型的推理能力,这些谜题经过了步骤级别的验证。模型在单轮和多轮模式下进行评估。主要发现包括 GPT-5.2 在推理努力上的 81 倍提升,以及 Claude Opus 4.6 通过迭代检查在成功自动推理迭代次数上提升了 30 倍。
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
Authors: Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang
First: 2026-03-02T17:38:58+00:00 · Latest: 2026-03-02T17:38:58+00:00
Comments: 33 pages, 17 figures
Abstract
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.
中文标题/摘要
标题:Robometer:通过轨迹比较扩展通用机器人奖励模型
通用机器人奖励模型通常被训练以从专家演示中预测绝对任务进度,仅提供局部、帧级监督。虽然对于专家演示有效,但这种范式在包含大量失败和次优轨迹的大规模机器人数据集中扩展性差,且为这些轨迹分配密集进度标签是模糊的。我们引入了Robometer,一种可扩展的奖励建模框架,结合了轨迹内进度监督与轨迹间偏好监督。Robometer通过双重目标进行训练:帧级进度损失,将奖励幅度锚定在专家数据上;轨迹比较偏好损失,施加跨相同任务的轨迹全局排序约束,从而有效从真实和增强的失败轨迹中学习。为了支持这种大规模的表述,我们整理了包含超过一百万条轨迹的RBM-1M奖励学习数据集,这些轨迹涵盖了多种机器人实体和任务,包括大量次优和失败数据。在基准测试和实际应用评估中,Robometer学习到的奖励函数比先前方法更具泛化性,并提高了机器人学习性能。代码、模型权重和视频见https://robometer.github.io/。
Summary / 总结
Robometer is a scalable reward modeling framework that combines frame-level progress supervision with trajectory-comparison preference supervision to learn from both expert and suboptimal trajectories. It uses a dual objective: a frame-level loss to anchor reward magnitude on expert data and a trajectory-comparison loss to enforce global ordering constraints. Robometer is trained on RBM-1M, a dataset of over one million trajectories. Experimental results show that Robometer learns more generalizable reward functions and improves robot learning performance across various applications compared to previous methods.
Robometer 是一种结合了轨迹内进度监督和轨迹间偏好监督的可扩展奖励建模框架,旨在学习更具泛化能力的奖励函数。它通过双重目标进行训练:帧级进度损失和轨迹比较偏好损失。Robometer 在各种基准测试和实际任务中的表现优于先前的方法,能够从专家和次优轨迹中学习。该框架依托于包含超过一百万条轨迹的 RBM-1M 数据集,涵盖了多种机器人任务,包括失败数据。
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Authors: Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych
Venue: AAAI 2026 Oral
First: 2025-06-18T14:37:59+00:00 · Latest: 2026-03-02T17:34:48+00:00
Comments: Accepted to AAAI 2026 (Oral)
Abstract
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.
中文标题/摘要
标题:SPARE: 基于参考引导评估的一次通过标注方法以自动监督过程和奖励建模
过程或步骤监督在推动大型语言模型(LLMs)复杂多步骤推理能力方面发挥了关键作用。然而,高效、高质量的自动化过程标注仍然是一个重大挑战。为了解决这一问题,我们提出了基于参考引导评估的一次通过标注(SPARE),这是一种新颖的结构化框架,通过联合对齐解决方案步骤与参考解决方案并以明确推理的方式在单次生成中确定其准确性,从而实现高效的逐步骤标注。我们展示了SPARE在四个不同数据集上的有效性,涵盖数学推理(GSM8K、MATH)、多跳问答(MuSiQue-Ans)和空间推理(SpaRP),在两个应用中显示出一致的改进:(1)训练过程奖励模型(PRMs)以对多个生成结果进行排名和聚合,(2)通过离线强化学习微调模型以进行贪婪解码。在ProcessBench上,SPARE展示了高效的数据外推泛化能力,仅使用约16%的训练样本量,相比人工标注和其他合成训练基线。此外,它在总令牌计数方面实现了2.3倍的速度提升,同时与MCTS方法竞争性地表现。手动分析显示了与MCTS方法互补的精确度-召回率特性,表明有潜力使用集成方法。这些结果确立了SPARE作为LLM推理中自动过程监督的实用且可扩展的解决方案。
Summary / 总结
SPARE is a novel framework for efficient single-pass annotation of process steps using reference solutions, enabling automated evaluation with explicit reasoning. It improves performance in training Process Reward Models and fine-tuning via offline reinforcement learning. SPARE demonstrates data efficiency and competitive performance compared to human-labeled and MCTS-based methods, with a 2.3x speedup in token count on ProcessBench.
SPARE 是一种新颖的框架,通过将过程步骤与参考解决方案对齐并利用显式推理评估其准确性来进行高效的一次性标注。它在训练过程奖励模型和通过离线强化学习微调模型方面表现出一致的改进。SPARE 在使用约 16% 的训练样本进行数据高效泛化方面优于人工标注和其他合成训练基线,并在总标记数量上比 MCTS 方法快 2.3 倍。
Distributions as Actions: A Unified Framework for Diverse Action Spaces
Authors: Jiamin He, A. Rupam Mahmood, Martha White
Venue: ICLR 2026
First: 2025-06-19T21:19:19+00:00 · Latest: 2026-03-02T17:30:05+00:00
Comments: Accepted to ICLR 2026
Abstract
We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.
中文标题/摘要
标题:分布即动作:统一的多样化动作空间框架
我们提出了一种新颖的强化学习(RL)框架,将参数化的动作分布视为动作,重新定义了代理和环境之间的边界。这种重新参数化使得新的动作空间连续,无论原始动作类型(离散、连续、混合等)如何。在新的参数化下,我们开发了一种广义确定性策略梯度估计器,即分布即动作策略梯度(DA-PG),其方差低于原始动作空间中的梯度。尽管学习分布参数的评论者提出了新的挑战,我们引入了一种简单的有效策略插值评论者学习(ICL),并得到了来自多臂老虎机设置的见解支持。基于TD3,一个强大的连续控制基线,我们提出了一种实用的演员-评论者算法,即分布即动作演员-评论者(DA-AC)。实验上,DA-AC在离散、连续和混合控制的各种设置中实现了竞争力的表现。
Summary / 总结
The research introduces a reinforcement learning framework where parameterized action distributions are treated as actions, creating a continuous action space. This framework leads to the development of a generalized policy gradient estimator, DA-PG, with lower variance. The study also proposes Interpolated Critic Learning (ICL) to address challenges in learning the critic over distribution parameters. Building on TD3, DA-AC, a practical actor-critic algorithm, is proposed and demonstrates competitive performance in various control settings, including discrete, continuous, and hybrid actions.
研究提出了一种强化学习框架,其中参数化的动作分布被视为动作,将动作空间转换为连续空间。该框架包含一个新的策略梯度估计器DA-PG,具有较低的方差。研究还提出ICL来解决在分布参数上学习批评家的挑战。基于TD3的DA-AC动作-批评家算法被开发出来,并在离散、连续和混合控制任务中表现出竞争力。
Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates
Authors: Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Wengong Jin, Yang Yang, Lei Li
First: 2024-05-13T21:48:48+00:00 · Latest: 2026-03-02T17:25:30+00:00
Abstract
Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.
中文标题/摘要
标题:由功能重要位点和小分子底物引导的生成酶设计
酶是通过基因编码的生物催化剂,能够加速化学反应。我们如何自动设计具有功能的酶?在本文中,我们提出了EnzyGen,一种学习统一模型来设计所有功能家族酶的方法。我们的核心思想是根据所需催化功能对应的功能重要位点和底物生成酶的氨基酸序列及其三维(3D)坐标。这些位点是从酶数据库中自动挖掘出来的。EnzyGen 包含一种新颖的交替网络,结合了注意力层和邻域等变层,能够捕捉整个蛋白质序列中的长程相关性和三维空间中最近氨基酸的局部影响。为了学习生成模型,我们设计了一种联合训练目标,包括序列生成损失、位置预测损失和酶-底物相互作用损失。我们进一步构建了EnzyBench数据集,包含3157个酶家族,涵盖了蛋白质数据银行(PDB)中所有可用的酶。实验结果表明,我们的EnzyGen在所有323个测试家族中始终表现出最佳性能,在底物结合亲和力方面比最佳基线高出10.79%。这些发现表明EnzyGen在设计与特定底物具有高亲和力的折叠良好且有效的酶方面具有优越的能力。
Summary / 总结
This paper aims to automatically design functional enzymes by leveraging functionally important sites and substrates. EnzyGen, the proposed method, uses a novel interleaving network to generate enzyme sequences and 3D coordinates based on these sites and substrates. The approach is evaluated on EnzyBench, a dataset of 3157 enzyme families, and shows superior performance, achieving a 10.79% improvement in substrate binding affinity compared to the best baseline method.
本文旨在通过利用功能重要位点和底物来自动设计功能性酶。EnzyGen 方法使用一种新颖的交替网络来根据这些位点和底物生成酶序列和三维坐标。该方法在包含 3157 个酶家族的 EnzyBench 数据集上进行评估,并显示出优越的性能,与最佳基线方法相比,在底物结合亲和力方面提高了 10.79%。
Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning
Authors: Vittorio Giammarino, Ahmed H. Qureshi
First: 2025-12-12T21:37:11+00:00 · Latest: 2026-03-02T17:21:38+00:00
Abstract
Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.
中文标题/摘要
标题:基于Eikonal约束层次拟度量强化学习的目标达成
目标条件化强化学习(GCRL)通过将任务视为目标达成而非最大化手工设计的奖励信号来缓解奖励设计的困难。在这种设置中,最优的目标条件价值函数自然形成拟度量,这促使了拟度量强化学习(QRL)的发展,QRL将价值学习约束为拟度量映射,并通过离散的轨迹基约束来确保局部一致性。我们提出了基于Eikonal偏微分方程(PDE)的Eikonal约束拟度量强化学习(Eik-QRL)作为QRL的连续时间形式。基于PDE的结构使Eik-QRL成为无轨迹方法,仅需采样的状态和目标,同时提高了离分布泛化能力。我们为Eik-QRL提供了理论保证,并在复杂动力学下识别了其局限性。为应对这些挑战,我们引入了Eik-Hierarchical QRL(Eik-HiQRL),将Eik-QRL整合到层次分解中。实验上,Eik-HiQRL在离线目标条件导航中达到了最先进的性能,并在操作任务中相对于QRL获得了持续的改进,与时间差分方法相当。
Summary / 总结
The research aims to improve goal-conditioned reinforcement learning (GCRL) by leveraging quasimetric properties and addressing limitations in out-of-distribution generalization. The method, Eikonal-Constrained Quasimetric RL (Eik-QRL), uses an Eikonal Partial Differential Equation to enforce local consistency and trajectory-free learning, enhancing performance in offline navigation tasks. Eik-Hierarchical QRL (Eik-HiQRL) further integrates Eik-QRL into a hierarchical structure to handle complex dynamics, achieving state-of-the-art performance in goal-conditioned navigation and manipulation tasks.
论文提出了Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning (Eik-HiQRL),通过利用Eikonal偏微分方程进行无轨迹的价值函数学习,以改进目标条件下的强化学习。该方法增强了对分布外样本的泛化能力,并在离线目标导向导航任务中达到了最先进的性能,同时在操作任务中也超过了之前的算法。
OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
Authors: Chuong Huynh, Manh Luong, Abhinav Shrivastava
Venue: CVPR 2026
First: 2026-03-02T17:19:55+00:00 · Latest: 2026-03-02T17:19:55+00:00
Comments: CVPR 2026. Project link: https://github.com/hmchuong/omniret
Abstract
Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model's omni-modal embedding capacity.
中文标题/摘要
标题:OmniRet:高效且高保真度的全模态检索
全模态检索是将查询跨异构模态的信息聚合起来以检索所需目标的任务。最先进的全模态检索模型能够理解复杂的查询,但通常仅限于两种模态:文本和视觉。这一限制阻碍了能够理解结合了多种模态的查询的通用检索系统的开发。为实现这一目标,我们提出了OmniRet,这是首个能够处理跨越三个关键模态(文本、视觉和音频)的复杂组合查询的检索模型。我们的OmniRet模型解决了通用检索中的两个关键挑战:计算效率和表示保真度。首先,将来自特定模态编码器的大量标记序列输入大型语言模型(LLM)是计算上低效的。因此,我们引入了一种基于注意力的重采样机制,从这些序列中生成紧凑的固定大小表示。其次,将丰富的全模态数据压缩到单个嵌入向量中不可避免地会导致信息丢失并丢弃细粒度细节。我们提出了注意力切片Wasserstein池化,以保留这些细粒度细节,从而提高全模态表示。OmniRet在大约600万个查询-目标对的聚合数据集上进行了训练,这些数据集跨越了30个数据集。我们在13个检索任务和MMEBv2子集上对我们的模型进行了基准测试。我们的模型在组合查询、音频和视频检索任务上表现出显著的改进,而在其他任务上则与最先进的模型持平。此外,我们整理了一个新的以音频为中心的多模态基准(ACM)。这个新基准引入了两个关键的、之前缺失的任务:组合音频检索和音频-视觉检索,以更全面地评估模型的全模态嵌入能力。
Summary / 总结
OmniRet is a retrieval model designed to handle complex queries combining text, vision, and audio, addressing the limitations of existing models that typically focus on two modalities. It introduces an attention-based resampling mechanism for computational efficiency and Attention Sliced Wasserstein Pooling to preserve fine-grained details in omni-modal representations. Experimental results show significant improvements on composed query, audio, and video retrieval tasks, while maintaining competitive performance on other tasks. Additionally, OmniRet is benchmarked on a new Audio-Centric Multimodal Benchmark (ACM) that evaluates omni-modal embedding capacity more comprehensively.
OmniRet旨在处理结合了文本、视觉和音频的复杂查询,解决了现有模型通常仅关注两种模态的局限性。它引入了一种基于注意力的重采样机制来生成紧凑的表示,并提出了一种注意力切片 Wasserstein 池化方法以保留细粒度的细节。实验结果表明,在组合查询、音频和视频检索任务上取得了显著改进,而在其他任务上保持了与最新模型相当的性能。此外,OmniRet在新的音频中心多模态基准(ACM)上进行了测试,该基准更全面地评估了模型的多模态嵌入能力。
HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
Authors: Xinyue Zeng, Junhong Lin, Yujun Yan, Feng Guo, Liang Shi, Jun Wu, Dawei Zhou
Venue: ICLR
First: 2026-01-26T18:23:09+00:00 · Latest: 2026-03-02T17:18:04+00:00
Comments: Accepted by The Fourteenth International Conference on Learning Representations (ICLR'26)
Abstract
The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations. We open-source our proposed \model{} model at https://github.com/Susan571/HalluGuard-ICLR2026.
中文标题/摘要
标题:HalluGuard:揭开大型语言模型数据驱动和推理驱动幻觉的面纱
大型语言模型(LLMs)在医疗、法律和科学发现等高风险领域中的可靠性常常因幻觉而受损。这些失败通常源自两个来源:数据驱动的幻觉和推理驱动的幻觉。然而,现有的检测方法通常只解决其中一个来源,并依赖于特定任务的启发式方法,限制了其在复杂场景中的泛化能力。为克服这些限制,我们引入了幻觉风险界,这是一种统一的理论框架,正式将幻觉风险分解为数据驱动和推理驱动的组件,分别与训练时的不匹配和推理时的不稳定性相关。这为分析幻觉的出现和发展提供了原则性的基础。在此基础上,我们引入了HalluGuard,这是一种基于NTK的评分方法,利用NTK诱导的几何结构和捕获的表示来联合识别数据驱动和推理驱动的幻觉。我们在10个不同的基准上评估了HalluGuard,与11个竞争性基线和9个流行的LLM基础模型进行了比较,始终在检测LLM幻觉的各种形式方面取得了最先进的性能。我们将在https://github.com/Susan571/HalluGuard-ICLR2026/开源我们提出的模型。
Summary / 总结
The paper addresses the issue of hallucinations in Large Language Models (LLMs) in high-stakes domains by introducing a unified theoretical framework called the Hallucination Risk Bound, which decomposes hallucination risk into data-driven and reasoning-driven components. Based on this framework, the authors developed HalluGuard, an NTK-based score that can jointly identify both types of hallucinations. HalluGuard outperforms 11 competitive baselines across 10 diverse benchmarks and 9 popular LLM backbones, demonstrating its effectiveness in detecting various forms of LLM hallucinations.
论文通过引入一个统一的理论框架——幻觉风险界,将幻觉风险分解为数据驱动和推理驱动两个部分,解决了大型语言模型(LLMs)在高风险领域中的问题。基于这一框架,作者开发了HalluGuard,这是一种基于NTK的得分方法,能够同时识别这两种类型的幻觉。HalluGuard在10个不同的基准测试上进行了评估,并优于11个竞争性基线,展示了其在检测LLM各种形式幻觉方面的有效性。该模型已在https://github.com/Susan571/HalluGuard-ICLR2026开源。
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Authors: Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu
Venue: CVPR 2026
First: 2026-03-02T17:16:47+00:00 · Latest: 2026-03-02T17:16:47+00:00
Comments: Accepted at CVPR 2026. Project page: https://yiwengxie.com/FluxMem/
Abstract
This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
中文标题/摘要
标题:FluxMem:流式视频理解的自适应分层内存
本文提出了一种无需训练的框架FluxMem,用于高效的流式视频理解。FluxMem通过层次化、两阶段设计自适应地压缩冗余视觉记忆:(1) 时间邻近选择(TAS) 模块在相邻帧之间移除冗余视觉令牌,(2) 空间域合并(SDC) 模块进一步在每个帧内合并空间重复区域为紧凑表示。为了有效适应动态场景,我们在TAS和SDC中引入了自适应令牌压缩机制,该机制根据内在场景统计自动确定压缩率,而不是手动调整。大量实验表明,FluxMem在现有在线视频基准上达到了新的最佳结果,在StreamingBench上达到76.4,在OVO-Bench上达到67.2,同时在实时设置下将延迟降低69.9%,并减少OVO-Bench上的峰值GPU内存34.5%。此外,它在离线性能上也表现出色,在MLVU上达到73.1,同时使用了65%更少的视觉令牌。
Summary / 总结
FluxMem is a training-free framework for efficient streaming video understanding, which uses a hierarchical, two-stage design to compress redundant visual memory. It includes a Temporal Adjacency Selection module that removes redundant visual tokens across adjacent frames and a Spatial Domain Consolidation module that merges spatially repetitive regions into compact representations. The framework introduces a self-adaptive token compression mechanism that automatically adjusts the compression rate based on scene statistics. FluxMem achieves state-of-the-art results on StreamingBench and OVO-Bench, reducing latency and peak GPU memory while maintaining strong offline performance.
FluxMem 是一个无需训练的框架,用于高效处理流式视频理解。它采用分层的两阶段方法:Temporal Adjacency Selection (TAS) 模块在相邻帧之间去除冗余的视觉令牌,而 Spatial Domain Consolidation (SDC) 模块将重复的空间区域合并为紧凑的表示。TAS 和 SDC 模块中的自适应令牌压缩机制会根据场景统计自动调整压缩率。FluxMem 在 StreamingBench 和 OVO-Bench 上达到了最先进的结果,减少了 69.9% 的延迟和 34.5% 的峰值 GPU 内存使用量,同时在离线性能方面使用更少的视觉令牌。
On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective
Authors: Guy Smorodinsky, Sveta Gimpleson, Itay Safran
First: 2026-03-02T17:13:33+00:00 · Latest: 2026-03-02T17:13:33+00:00
Abstract
We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as $Θ(1/\ln(t))$. To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system's dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.
中文标题/摘要
标题:非线性神经网络中梯度下降收敛速率的研究:从对抗鲁棒性视角
我们研究了梯度下降(GD)在最小二分类设置中的收敛动态,该设置由一个两神经元的ReLU网络和两个训练实例组成。我们证明,在这些强简化假设下,尽管GD成功地收敛到最优鲁棒性边界,有效地最大化决策边界与训练点之间的距离,但这种收敛速率极其缓慢,严格地按Θ(1/ln(t))的速率进行。据我们所知,这首次明确给出了非线性模型中鲁棒性边界的收敛速率下界。通过实证模拟,我们进一步证明了这种固有的失败模式是普遍存在的,在多个自然网络初始化中表现出相同的紧致收敛速率。我们的理论保证是通过严谨分析GD轨迹在模型不同激活模式下的动态而得出的。具体来说,我们开发了对系统动态的严格控制,以限制决策边界的轨迹,克服了由架构非线性性质引入的主要技术挑战。
Summary / 总结
This study investigates the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting with a two-neuron ReLU network and two training instances. The research demonstrates that while GD can successfully maximize the robustness margin, this convergence happens at a very slow rate, scaling as $Θ(1/\ln(t))$. This is the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Empirical simulations show that this slow convergence is a pervasive issue, consistent across various network initializations.
研究探讨了在仅包含两个神经元的ReLU网络和两个训练实例的最小二分类设置中,梯度下降(GD)的收敛动态。研究显示,尽管GD能够最大化鲁棒性边界,但其收敛速度非常缓慢,按Θ(1/ln(t))速率进行。这是首次在非线性模型中明确给出鲁棒性边界收敛速率的下界。通过多种网络初始化的实证模拟,发现这种缓慢的收敛问题是普遍存在的。
Learning from Synthetic Data Improves Multi-hop Reasoning
Authors: Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger
Venue: ICLR 2026
First: 2026-03-02T17:08:43+00:00 · Latest: 2026-03-02T17:08:43+00:00
Comments: Accepted to ICLR 2026
Abstract
Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
中文标题/摘要
标题:从合成数据学习提高多跳推理能力
强化学习(RL)已被证明能显著提升大型语言模型(LLM)在数学、编程和多跳推理任务中的推理能力。然而,RL 微调需要大量高质量可验证数据,通常来自人类注释、前沿 LLM 生成或由 LLM 基础验证器评分。所有三种方法都有显著限制:人类标注的数据集规模小且维护成本高,LLM 生成的数据容易产生幻觉且成本高,而基于 LLM 的验证器则不准确且速度慢。在本研究中,我们探讨了一种更便宜的替代方案:使用规则生成的合成数据对多跳推理任务进行 RL 微调。我们发现,使用合成数据微调的 LLM 在流行的现实世界问答基准测试中表现显著更好,尽管合成数据中仅包含虚构知识。通过对问题难度进行分层,我们发现合成数据教会 LLM 组合知识——这是一种基本且可泛化的推理技能。我们的研究突出了规则生成的合成推理数据作为一种免费且可扩展的资源,以提高 LLM 的推理能力。
History
20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553