FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models
Authors: Haoyang Li, Liang Wang, Siyu Zhou, Jiacheng Sun, Jing Jiang, Chao Wang, Guodong Long, Yan Peng
First: 2026-03-09T17:59:18+00:00 · Latest: 2026-03-09T17:59:18+00:00
Comments: 27 Pages, 9 Figures, 15 Tables
Abstract
CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT
中文标题/摘要
标题:FVG-PT:视觉语言模型自适应前景视图引导提示调优
基于CLIP的提示调优使预训练的视觉语言模型(VLMs)能够高效地适应下游任务。尽管现有研究取得了显著进展,但它们在调优过程中VLMs内部注意力表示的变化方面关注较少。本文将提示调优预测的失败模式归因于视觉编码器前景注意力的变化,并提出了一种自适应插件前景注意力引导模块——前景视图引导提示调优(FVG-PT),以缓解这种变化。具体而言,FVG-PT引入了一个可学习的前景可靠性门控,以自动增强前景视图质量,应用前景蒸馏补偿模块以引导视觉注意力朝向前景,并进一步引入先验校准模块以减轻过度关注前景导致的一般化退化。在多个骨干模型和数据集上的实验显示了FVG-PT的有效性和兼容性。代码可在:https://github.com/JREion/FVG-PT
Summary / 总结
The research aims to improve the adaptability of pretrained Vision-Language Models (VLMs) by addressing shifts in foreground attention during prompt tuning. The proposed FVG-PT method introduces a learnable Foreground Reliability Gate, a Foreground Distillation Compensation module, and a Prior Calibration module to enhance foreground view quality, guide visual attention, and mitigate generalization degradation, respectively. Experiments demonstrate the effectiveness and compatibility of FVG-PT across multiple backbone models and datasets.
研究旨在通过解决预训练视觉语言模型(VLMs)在提示调优过程中前景注意力转移的问题来提高其适应性。方法包括FVG-PT,该模块包含可学习的前景可靠性门控、前景蒸馏补偿模块以及先验校准模块。实验表明,FVG-PT能够提升VLMs在多个骨干模型和数据集上的性能和泛化能力。
Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting
Authors: Azul Garza, Renée Rosillo, Rodrigo Mendoza-Smith, David Salinas, Andrew Robert Williams, Arjun Ashok, Mononito Goswami, José Martín Juárez
First: 2026-03-09T17:59:00+00:00 · Latest: 2026-03-09T17:59:00+00:00
Abstract
Recent advances in time-series forecasting increasingly rely on pre-trained foundation-style models. While these models often claim broad generalization, existing evaluation protocols provide limited evidence. Indeed, most current benchmarks use static train-test splits that can easily lead to contamination as foundation models can inadvertently train on test data or perform model selection using test scores, which can inflate performance. We introduce Impermanent, a live benchmark that evaluates forecasting models under open-world temporal change by scoring forecasts sequentially over time on continuously updated data streams, enabling the study of temporal robustness, distributional shift, and performance stability rather than one-off accuracy on a frozen test set. Impermanent is instantiated on GitHub open-source activity, providing a naturally live and highly non-stationary dataset shaped by releases, shifting contributor behavior, platform/tooling changes, and external events. We focus on the top 400 repositories by star count and construct time series from issues opened, pull requests opened, push events, and new stargazers, evaluated over a rolling window with daily updates, alongside standardized protocols and leaderboards for reproducible, ongoing comparison. By shifting evaluation from static accuracy to sustained performance, Impermanent takes a concrete step toward assessing when and whether foundation-level generalization in time-series forecasting can be meaningfully claimed. Code and a live dashboard are available at https://github.com/TimeCopilot/impermanent and https://impermanent.timecopilot.dev.
中文标题/摘要
标题:瞬息万变:时间序列预测中时空泛化的实时基准
近期的时间序列预测进展越来越多地依赖于预训练的基础模型。尽管这些模型通常声称具有广泛的泛化能力,但现有的评估协议提供的证据有限。事实上,大多数当前的基准使用静态的训练-测试分割,这很容易导致污染,因为基础模型可能会无意中训练在测试数据上,或者使用测试分数进行模型选择,这会夸大性能。我们引入了瞬息万变,这是一个实时基准,通过在连续更新的数据流上按时间顺序评分预测,来评估在开放世界时空变化下的预测模型,从而研究时间鲁棒性、分布偏移和性能稳定性,而不是在冻结的测试集上的一次性准确性。瞬息万变基于GitHub开源活动实现,提供了一个自然的实时且高度非平稳的数据集,受到发布、贡献者行为变化、平台/工具变更和外部事件的影响。我们专注于星标数最高的前400个仓库,并从问题打开、拉取请求打开、推送事件和新星标中构建时间序列,通过滚动窗口每日更新进行评估,同时提供标准化的协议和排行榜以实现可重复的持续比较。通过将评估从静态准确性转向持续性能,瞬息万变朝着评估时间序列预测中基础水平泛化的实际意义迈出了实质性的一步。代码和实时仪表板可在https://github.com/TimeCopilot/impermanent和https://impermanent.timecopilot.dev/获取。
Summary / 总结
The research aims to evaluate the temporal generalization capabilities of time-series forecasting models by introducing Impermanent, a live benchmark. This benchmark evaluates models sequentially over time using continuously updated data streams, which helps in studying temporal robustness and performance stability. Key findings include the demonstration that foundation models often perform well on static benchmarks but may struggle with real-world, dynamic data, highlighting the need for more rigorous evaluation methods.
研究旨在通过引入Impermanent这一实时基准来评估时间序列预测模型的时空泛化能力。该基准通过使用不断更新的数据流进行序列化评估,有助于研究时间稳健性和性能稳定性。主要发现表明,基础模型在静态基准上表现良好,但在实际动态数据中可能表现不佳,强调了需要更严格的评估方法的重要性。
Agentic Critical Training
Authors: Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang
First: 2026-03-09T17:58:56+00:00 · Latest: 2026-03-09T17:58:56+00:00
Comments: Project page: https://attention-is-all-i-need.github.io/ACT/
Abstract
Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.
中文标题/摘要
标题:代理批判性训练
将大型语言模型(LLMs)训练为自主代理通常始于模仿学习,但这种方法仅教会代理做什么而不理解为什么:代理从不将成功行为与次优替代行为进行对比,因此缺乏对行为质量的认识。最近的方法试图通过引入源自专家行为与替代行为对比的自我反思监督来解决这一问题。然而,训练范式本质上仍然是模仿学习:模型模仿预先构建的反思文本,而不是学习自主推理。我们提出了代理批判性训练(ACT),这是一种强化学习范式,训练代理识别替代行为中的更好行为。通过奖励模型判断是否正确,ACT 促使模型自主发展关于行为质量的推理,产生真正的自我反思,而不是模仿它。在三个具有挑战性的代理基准测试中,ACT 在结合不同后训练方法时,始终提高了代理性能。它在模仿学习上的平均改进为 5.07 分,在强化学习上的平均改进为 4.62 分。与通过知识蒸馏注入反思能力的方法相比,ACT 也表现出明显的优势,平均改进为 2.42 分。此外,ACT 在代理基准测试中实现了强大的分布外泛化,并在没有特定推理训练数据的情况下提高了通用推理基准测试的性能,突显了我们方法的价值。这些结果表明,ACT 是开发更具反思能力和能力的 LLM 代理的一个有前景的方向。
Summary / 总结
The research motivation is to improve large language models (LLMs) by enabling them to autonomously develop reasoning about action quality, rather than just imitating pre-constructed reflection text. The main method is Agentic Critical Training (ACT), a reinforcement learning approach that trains agents to identify the better action among alternatives by rewarding correct judgments. Key experimental findings show that ACT consistently improves agent performance, achieving an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. It also demonstrates strong out-of-distribution generalization and improves performance on general reasoning benchmarks without specific reasoning training data.
研究旨在通过超越模仿学习,引入新的训练范式Agentic Critical Training (ACT),来提升大型语言模型(LLMs)的推理能力。ACT 训练代理自主评估行动质量,通过奖励正确的判断来促进真正的自我反思。在三个基准测试中,ACT 的平均性能提高了 5.07 分点,超过模仿学习,并且 4.62 分点超过强化学习,同时展示了强大的分布外泛化能力和在一般推理基准上的改进表现,无需特定的推理训练数据。
Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines
Authors: Akshay Gulati, Kanha Singhania, Tushar Banga, Parth Arora, Anshul Verma, Vaibhav Kumar Singh, Agyapal Digra, Jayant Singh Bisht, Danish Sharma, Varun Singla, Shubh Garg
First: 2026-03-09T17:58:54+00:00 · Latest: 2026-03-09T17:58:54+00:00
Comments: 12 pages, 6 Figures, 5 Tables
Abstract
Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.
中文标题/摘要
标题:大型语言模型的金融智能评估:SuperInvesting AI与LLM引擎的基准测试
大型语言模型在金融分析和投资研究中的应用日益增多,但对其金融推理能力的系统性评估仍然有限。本文介绍了AI金融智能基准(AFIB),这是一个多维度的评估框架,旨在从五个维度(事实准确性、分析完整性、数据时效性、模型一致性、失败模式)评估金融分析能力。我们使用95多个结构化的金融分析问题数据集,评估了五个AI系统:GPT、Gemini、Perplexity、Claude和SuperInvesting。结果显示,不同模型在性能上存在显著差异。在基准测试环境中,SuperInvesting表现出最高的综合性能,平均事实准确性得分为8.96/10,完整性得分为56.65/70,同时表现出最低的幻觉率。以检索为导向的系统如Perplexity在数据时效性任务中表现出色,但由于实时信息访问,其分析综合能力和一致性较弱。总体而言,结果表明,大型语言模型中的金融智能是多维度的,结合结构化金融数据访问与分析推理能力的系统在复杂投资研究流程中提供了最可靠的性能。
Summary / 总结
This study introduces the AI Financial Intelligence Benchmark (AFIB) to evaluate financial reasoning capabilities of large language models across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. Five AI systems—GPT, Gemini, Perplexity, Claude, and SuperInvesting—are assessed using 95+ structured financial analysis questions. SuperInvesting shows the highest performance, with the best factual accuracy and completeness scores, and the lowest hallucination rate. Retrieval-oriented systems like Perplexity excel in data recency but struggle with analytical synthesis and consistency.
本研究引入了AI金融智能基准(AFIB),以评估大型语言模型在事实准确性、分析完整性、数据时效性、模型一致性及失败模式五个维度上的金融推理能力。使用95+个结构化的金融分析问题评估了五种AI系统——GPT、Gemini、Perplexity、Claude和SuperInvesting。SuperInvesting在事实准确性与分析完整性方面得分最高,并且幻觉率最低。检索导向的系统如Perplexity在数据时效性方面表现优异,但在分析综合和一致性方面表现较弱。
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
Authors: Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani
Venue: ICRA 2026
First: 2025-06-25T17:59:01+00:00 · Latest: 2026-03-09T17:58:20+00:00
Comments: 11 pages. Published at ICRA 2026
Abstract
We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/
中文标题/摘要
标题:DemoDiffusion:使用预训练扩散策略的一次性人类模仿
我们提出了一种名为DemoDiffusion的简单方法,使机器人能够通过模仿单一个人的演示来执行操作任务,而无需进行特定任务的训练或人机配对数据。我们的方法基于两个洞察。首先,人类演示中的手部运动为机器人的末端执行器轨迹提供了一个有用的先验,我们可以通过运动学重定位将其转换为粗略的开环机器人运动轨迹。其次,虽然重定位的运动捕捉了任务的整体结构,但在上下文中的合理机器人动作可能并不匹配。为了解决这个问题,我们利用预训练的一般扩散策略来修改轨迹,确保它既遵循人类运动,又保持在合理的机器人动作分布内。与基于在线强化学习或人机配对数据的方法不同,我们的方法能够以最小的努力实现对新任务和场景的稳健适应。在8个不同操作任务的实地实验中,DemoDiffusion的平均成功率达到了83.8%,而预训练策略仅为13.8%,运动学重定位为52.5%,甚至在预训练通用策略完全失败的任务中也取得了成功。项目页面:https://demodiffusion.github.io/
Summary / 总结
DemoDiffusion is a method that allows robots to imitate a single human demonstration for manipulation tasks without task-specific training or paired human-robot data. It uses kinematic retargeting to convert human hand motions into rough robot trajectories and then modifies these trajectories using a pre-trained diffusion policy to ensure they are both human-like and feasible for the robot. In experiments across 8 diverse tasks, DemoDiffusion achieved an 83.8% success rate, significantly outperforming the pre-trained policy (13.8%) and kinematic retargeting (52.5%).
DemoDiffusion 是一种方法,可以让机器人通过单次人类演示来执行操作任务,无需特定任务的训练。它使用运动重定位将人类手部动作转换为粗略的机器人轨迹,然后使用预训练的扩散策略来修改这些轨迹,确保它们既符合人类动作又对机器人可行。在8个不同任务的实验中,DemoDiffusion 的成功率达到了83.8%,远超预训练策略(13.8%)和运动重定位(52.5%)。
HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising
Authors: Kai Zou, Dian Zheng, Hongbo Liu, Tiankai Hang, Bin Liu, Nenghai Yu
First: 2026-03-09T17:58:16+00:00 · Latest: 2026-03-09T17:58:16+00:00
Comments: Project page: https://jacky-hate.github.io/HiAR/ Code: https://github.com/Jacky-hate/HiAR
Abstract
Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.
中文标题/摘要
标题:HiAR:通过分层去噪高效自回归长视频生成
自回归(AR)扩散提供了一种生成理论上无限长度视频的有希望框架。然而,主要挑战在于保持时间连续性的同时防止由误差累积导致的渐进质量退化。为了确保连续性,现有方法通常依赖高度去噪的上下文;然而,这种方法会以高确定性传播预测误差,从而加剧退化。在本文中,我们主张高度清洁的上下文并非必要。受到双向扩散模型的启发,这些模型在共享噪声水平下去噪帧并保持一致性,我们提出在当前块与上下文具有相同噪声水平的情况下进行条件化,这为时间一致性提供了足够的信号,同时有效减少了误差传播。基于这一见解,我们提出了HiAR,一种分层去噪框架,逆转了传统的生成顺序:而不是逐个完成每个块,它在每个去噪步骤中对所有块进行因果生成,使得每个块总是基于相同噪声水平的上下文进行条件化。这种分层结构自然地允许流水线并行推理,我们在4步设置中获得了1.8倍的时钟速度提升。我们还观察到,在这种范式下,自我展开蒸馏放大了模式寻求反KL目标固有的低运动捷径。为了抵消这一现象,我们引入了双向注意模式下的前向KL正则化器,这在因果推理中保留了运动多样性,而不会干扰蒸馏损失。
Summary / 总结
The paper addresses the challenge of generating long videos with temporal continuity using autoregressive diffusion. It proposes HiAR, a hierarchical denoising framework that conditions each block on context at the same noise level, thereby mitigating error propagation and maintaining temporal consistency. HiAR achieves the best overall score and the lowest temporal drift on VBench compared to other methods.
论文旨在解决使用自回归扩散生成长视频时保持时间连续性和防止质量退化的问题。它提出了HiAR,一种分层去噪框架,每个块都基于相同噪声水平的上下文进行条件化,从而实现并行推理并减少错误传播。HiAR在VBench上表现出最低的时间漂移和最佳的整体得分。
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Authors: Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, Dan Roth
Venue: ICLR 2026
First: 2025-10-02T17:57:05+00:00 · Latest: 2026-03-09T17:54:56+00:00
Comments: Accepted at ICLR 2026
Abstract
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 44.2% higher ASR across 12 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
中文标题/摘要
标题:基于树的对话强化策略优化以应对红队攻击
尽管人工智能安全领域取得了快速进展,当前的大语言模型在多轮交互场景中仍然容易受到对手的对抗攻击,攻击者会战略性地调整其提示,这构成了更为关键且现实的挑战。现有的发现安全漏洞的方法要么依赖于人工红队测试,要么使用预定义模板和人工收集的攻击数据进行自动化方法,大多数方法集中在单轮攻击上。然而,这些方法并未探索可能的多轮攻击空间,未能考虑从复杂对话动态和战略性对话规划中产生的新型攻击轨迹。鉴于最近的研究发现,LLMs在多轮攻击中的脆弱性远高于单轮攻击,这一差距尤为重要。我们提出了DialTree,这是一种结合树搜索的在线策略强化学习框架,能够自主发现多样化的多轮攻击策略,将对话视为顺序决策问题,从而无需手动收集的数据即可系统地进行探索。通过广泛的实验,我们的方法不仅在12个目标模型上实现了比之前最先进的方法高出44.2%以上的攻击成功率,还通过学习最大化多轮攻击成功率的最优对话策略,有效发现了新的攻击策略。
Summary / 总结
The research aims to address the vulnerability of large language models to adversarial attacks in multi-turn interaction settings. It introduces DialTree, an on-policy reinforcement learning framework that uses tree search to autonomously discover diverse multi-turn attack strategies. The approach significantly improves attack success rate, achieving more than 44.2% higher ASR across 12 target models compared to previous methods and uncovers new attack strategies through optimal dialogue policies.
研究针对大型语言模型在多轮交互设置中对对手攻击的脆弱性,其中攻击者会在对话轮次中适应其提示。提出了一种结合树搜索的在线策略强化学习框架DialTree,以自主发现多轮攻击策略。该方法通过学习最优对话策略,在12个目标模型上实现了比之前最佳方法高出超过44.2%的攻击成功率,并通过学习发现了新的攻击策略。
Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio
Authors: Phillip Long, Zachary Novack, Chris Donahue
First: 2026-03-09T17:52:02+00:00 · Latest: 2026-03-09T17:52:02+00:00
Comments: Submitted for review at Interspeech 2026, 7 pages, 5 figures
Abstract
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
中文标题/摘要
标题:基于语言模型的全保真音频无损压缩基准测试
自回归"语言"模型(LMs)在原始波形上训练后可以重新用于无损音频压缩,但先前的工作仅限于8位音频,这留下了这样的方法是否适用于实际设置(16/24位)以及能否与现有编解码器竞争的问题。我们在多样化的领域(音乐、语音、生物声学)、采样率(16kHz-48kHz)和位深度(8、16、24位)上对基于LM的压缩进行了基准测试。标准的样本级标记化在更高的位深度下变得不可行,因为词汇表大小(16位为65K;24位为16.7M)。我们提出了Trilobyte,一种全分辨率音频的字节级标记化方案,将词汇表扩展从$O(2^{b})$改进到$O(1)$,并使基于LM的24位无损压缩首次变得可行。虽然LMs在8位和16位时始终优于FLAC并提供最先进的压缩效果,但我们观察到,随着位深度超过8位,压缩增益变得更为有限。
Summary / 总结
The study aims to evaluate the effectiveness of language models (LMs) for lossless compression of full-fidelity audio, particularly for 16/24-bit audio, which is beyond the scope of previous research focusing on 8-bit audio. The authors propose Trilobyte, a byte-level tokenization method, to address the vocabulary size issue at higher bit depths. Experiments show that LMs outperform FLAC for 8-bit and 16-bit audio, but the compression benefits diminish as the bit depth increases beyond 8-bit.
该研究对自回归语言模型(LMs)在全保真音频无损压缩中的性能进行了基准测试,扩展到16/24位音频。研究提出了一种字节级分词方法Trilobyte,以解决高位深下的词汇量问题。实验表明,LMs在8位和16位音频上优于FLAC,但在24位音频上的压缩增益有所减少。
ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation
Authors: Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang
First: 2026-03-09T17:49:46+00:00 · Latest: 2026-03-09T17:49:46+00:00
Abstract
Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.
中文标题/摘要
标题:ER-Pose:重新思考基于关键点的表示学习以实现实时人体姿态估计
单阶段多人姿态估计旨在在一个统一框架内同时进行人体定位和关键点预测,从而在推理效率和架构简洁性方面具有优势。因此,多尺度实时检测架构,如YOLO类模型,广泛用于实时姿态估计。然而,这些方法通常继承了从目标检测中继承的框驱动建模范式,在训练过程中姿态估计隐式地受到边界框监督的约束。这种形式引入了样本分配和特征表示的偏差,导致任务不匹配,最终限制了姿态估计的准确性。在本文中,我们从关键点驱动的角度重新审视了框驱动的单阶段姿态估计,并将并行目标之间的语义冲突识别为性能下降的关键来源。为了解决这一问题,我们提出了一种基于关键点的训练范式,将姿态估计提升为主要预测目标。具体来说,我们移除了边界框预测,并重新设计了预测头,以更好地适应姿态估计所需的高维结构化表示。我们还引入了一种基于关键点的动态样本分配策略,以使训练目标与姿态评估指标对齐,在训练期间提供密集监督,并实现高效的无NMS推理。此外,我们提出了一种平滑的OKS损失函数,以在基于回归的姿态估计中稳定优化。基于这些设计,我们开发了一种单阶段多人姿态估计框架,称为ER-Pose。在MS COCO和CrowdPose上,与基线YOLO-Pose相比,ER-Pose-n分别在无预训练情况下实现了3.2/6.7的AP改进,在有预训练情况下实现了7.4/4.9的AP改进。这些改进是在参数更少和更高的推理效率下实现的。
Summary / 总结
This work addresses the limitations of box-driven single-stage pose estimation by proposing a keypoint-driven learning paradigm, termed ER-Pose. The method removes bounding-box prediction and redesigns the prediction head to better handle high-dimensional structured representations. It also introduces a dynamic sample assignment strategy and a smooth OKS-based loss function. On MS COCO and CrowdPose, ER-Pose shows significant improvements in AP, achieving 3.2/6.7 and 7.4/4.9 gains without and with pre-training, respectively, while maintaining fewer parameters and higher inference efficiency.
该研究提出了一种基于关键点的学习框架ER-Pose,通过移除边界框预测并重新设计预测头来更好地适应姿态估计的高维结构表示。引入了一种动态样本分配策略和一种平滑的OKS损失函数,以使训练目标与评估指标对齐。在MS COCO和CrowdPose上,ER-Pose在无预训练情况下分别实现了3.2/6.7的AP改进,在有预训练情况下分别实现了7.4/4.9的AP改进,同时使用更少的参数并保持更高的推理效率。
A New Lower Bound for the Random Offerer Mechanism in Bilateral Trade using AI-Guided Evolutionary Search
Authors: Yang Cai, Vineet Gupta, Zun Li, Aranyak Mehta
First: 2026-03-09T17:49:02+00:00 · Latest: 2026-03-09T17:49:02+00:00
Abstract
The celebrated Myerson--Satterthwaite theorem shows that in bilateral trade, no mechanism can be simultaneously fully efficient, Bayesian incentive compatible (BIC), and budget balanced (BB). This naturally raises the question of how closely the gains from trade (GFT) achievable by a BIC and BB mechanism can approximate the first-best (fully efficient) benchmark. The optimal BIC and BB mechanism is typically complex and highly distribution-dependent, making it difficult to characterize directly. Consequently, much of the literature analyzes simpler mechanisms such as the Random-Offerer (RO) mechanism and establishes constant-factor guarantees relative to the first-best GFT. An important open question concerns the worst-case performance of the RO mechanism relative to first-best (FB) efficiency. While it was originally hypothesized that the approximation ratio $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}}$ is bounded by $2$, recent work provided counterexamples to this conjecture: Cai et al. proved that the ratio can be strictly larger than $2$, and Babaioff et al. exhibited an explicit example with ratio approximately $2.02$.
In this work, we employ AlphaEvolve, an AI-guided evolutionary search framework, to explore the space of value distributions. We identify a new worst-case instance that yields an improved lower bound of $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}} \ge \textbf{2.0749}$. This establishes a new lower bound on the worst-case performance of the Random-Offerer mechanism, demonstrating a wider efficiency gap than previously known.
Summary / 总结
This study aims to explore the efficiency gap between the Random-Offerer mechanism and the first-best benchmark in bilateral trade. By using AlphaEvolve, an AI-guided evolutionary search framework, the researchers identified a new worst-case instance that improves the lower bound of the efficiency ratio to 2.0749, indicating a wider efficiency gap than previously known.
该研究旨在探索随机报价机制与最优效率之间的效率差距。作者使用了AlphaEvolve,一种基于AI的进化搜索框架,找到了一个新的最坏情况实例,将近似比的下界提高到2.0749,表明与已知情况相比,效率差距更大。
Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Authors: Mengyi Shan, Shouchieh Chang, Ziqian Bai, Shichen Liu, Yinda Zhang, Luchuan Song, Rohit Pandey, Sean Fanello, Zeng Huang
Venue: CVPR 2026
First: 2026-03-09T17:46:52+00:00 · Latest: 2026-03-09T17:46:52+00:00
Comments: Accepted to CVPR 2026
Abstract
We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.
中文标题/摘要
标题:共同交流:从混合音频流合成共驻3D对话
我们解决了从混合音频流生成两个互动共驻参与者完整3D面部动画的挑战性任务。现有方法通常生成脱节的“说话头像”,类似于视频会议通话,而我们的工作是首个明确建模动态3D空间关系——包括相对位置、姿态和相互凝视——这对于现实对话至关重要。我们的系统综合了两人的完整表现,包括精确的唇部同步,并且独特地允许通过文本描述控制他们的相对头部姿态。为实现这一目标,我们提出了一种双流架构,其中每一流负责一个参与者的输出。我们使用说话人的角色嵌入和跨说话人交叉注意力机制来分离混合音频并建模互动。此外,我们引入了一种新颖的眼球凝视损失,以促进自然的相互凝视。为了支持我们数据饥渴的方法,我们引入了一种新颖的工作流程来收集包含超过200万对的大型对话数据集,来自野外视频。我们的方法生成流畅、可控且具有空间意识的双人动画,适用于VR和远程呈现的沉浸式应用,在感知真实性和互动连贯性方面显著优于现有基线。
Summary / 总结
This research addresses the challenge of generating realistic 3D facial animations for two co-located speakers from a mixed audio stream. It introduces a dual-stream architecture with speaker role embeddings and inter-speaker cross-attention to disentangle audio and model interaction, along with a novel eye gaze loss for natural mutual eye contact. The system produces fluid, controllable, and spatially aware animations, surpassing existing methods in perceived realism and interaction coherence. A large-scale conversational dataset was curated to support this data-hungry approach.
该研究解决了从混合音频流生成两个互动个体的逼真3D面部动画的挑战,重点关注其动态空间关系。提出的双流架构分离音频并通过说话人嵌入和交叉注意力机制建模互动。关键发现包括流畅、可控且具有空间意识的动画,在感知真实性和互动连贯性方面显著优于现有方法。
Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration
Authors: Víctor Yeste, Paolo Rosso
First: 2026-01-31T21:50:35+00:00 · Latest: 2026-03-09T17:41:24+00:00
Comments: Code: https://github.com/VictorMYeste/human-value-detection, models: https://huggingface.co/papers/2602.00913, 27 pages, 4 figures
Abstract
Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.
中文标题/摘要
标题:施瓦茨高层次价值观是否有助于句子级人类价值观检测?一项关于计算节约型预算下层次门控和校准的研究
从单句中检测人类价值观是一项稀疏、不平衡的多标签任务。我们研究施瓦茨高层次(HO)类别是否有助于ValueEval'24 / ValuesML(74000条英语句子)的设置。我们没有提出新的架构,而是比较了直接监督的变压器、硬HO→价值观管道、存在→HO→价值观级联、紧凑指令调优的大语言模型(LLMs)、QLoRA以及低成本升级如阈值调优和小型集成。HO类别是可学习的:最容易的二极分类,成长 vs. 自我保护,达到宏-$F_1=0.58$。最可靠的增益来自校准和集成:阈值调优将社会焦点 vs. 个人焦点从$0.41$提升到$0.57$(+0.16),变压器软投票将成长从$0.286$提升到$0.303$,而Transformer+LLM混合体在自我保护上达到$0.353$。相比之下,硬层次门控并不一致地改善最终任务。紧凑的LLMs作为独立系统也劣于监督编码器,尽管它们有时在混合集成中增加有用多样性。在这一基准下,HO结构作为归纳偏置比作为刚性路由规则更有用。
Summary / 总结
This study investigates the utility of Schwartz higher-order (HO) categories in enhancing sentence-level human value detection, a sparse and imbalanced multi-label task. The research compares various methods including direct supervised transformers, HO-to-values pipelines, and compact instruction-tuned large language models. Key findings show that calibration and ensembling techniques yield the most reliable gains, with threshold tuning improving Macro-$F_1$ scores and transformer soft voting enhancing detection accuracy. In contrast, hierarchical gating and compact LLMs do not consistently improve the task performance.
研究探讨了施瓦茨更高层次(HO)类别在提升单句人类价值观检测中的作用,这是一个稀疏且不平衡的多标签任务。研究对比了直接监督Transformer、HO到价值观的管道、紧凑指令调优的大语言模型等多种方法。关键发现表明,校准和集成技术提供了最可靠的改进,阈值调整提高了宏-$F_1$分数,而Transformer软投票提升了检测准确性。相比之下,层次门控和紧凑型大语言模型在任务性能上并未表现出一致的改进效果。
ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting
Authors: Jordi Muñoz Vicente
First: 2026-03-09T17:38:27+00:00 · Latest: 2026-03-09T17:38:27+00:00
Comments: 6 pages, 1 figure. Technical Report. This work introduces ImprovedGS+, a library-free C++/CUDA implementation for 3D Gaussian Splatting within the LichtFeld-Studio framework. Source code available at https://github.com/jordizv/ImprovedGS-Plus
Abstract
Recent advancements in 3D Gaussian Splatting (3DGS) have shifted the focus toward balancing reconstruction fidelity with computational efficiency. In this work, we propose ImprovedGS+, a high-performance, low-level reinvention of the ImprovedGS strategy, implemented natively within the LichtFeld-Studio framework. By transitioning from high-level Python logic to hardware-optimized C++/CUDA kernels, we achieve a significant reduction in host-device synchronization and training latency. Our implementation introduces a Long-Axis-Split (LAS) CUDA kernel, custom Laplacian-based importance kernels with Non-Maximum Suppression (NMS) for edge scores, and an adaptive Exponential Scale Scheduler. Experimental results on the Mip-NeRF360 dataset demonstrate that ImprovedGS+ establishes a new Pareto-optimal front for scene reconstruction. Our 1M-budget variant outperforms the state-of-the-art MCMC baseline by achieving a 26.8% reduction in training time (saving 17 minutes per session) and utilizing 13.3% fewer Gaussians while maintaining superior visual quality. Furthermore, our full variant demonstrates a 1.28 dB PSNR increase over the ADC baseline with a 38.4% reduction in parametric complexity. These results validate ImprovedGS+ as a scalable, high-speed solution that upholds the core pillars of Speed, Quality, and Usability within the LichtFeld-Studio ecosystem.
中文标题/摘要
标题:ImprovedGS+: LichtFeld-Studio框架内的高性能C++/CUDA重新实现策略
近年来,3D高斯点绘(3DGS)的发展重点转向了重建保真度与计算效率之间的平衡。本文提出了一种高性能的低级重新实现策略ImprovedGS+,该策略在LichtFeld-Studio框架内原生实现。通过从高级Python逻辑转换为硬件优化的C++/CUDA内核,我们显著减少了主机-设备同步和训练延迟。我们的实现引入了长轴分割(LAS)CUDA内核、基于拉普拉斯的重要性内核以及带有非最大值抑制(NMS)的边缘评分内核,并采用自适应指数缩放调度器。在Mip-NeRF360数据集上的实验结果表明,ImprovedGS+在场景重建方面建立了新的帕累托前沿。我们的1M预算变体在训练时间上比最先进的MCMC基线减少了26.8%(每会话节省17分钟),同时使用了13.3%更少的高斯点,且保持了更优的视觉质量。此外,我们的完整变体在参数复杂度减少38.4%的情况下,PSNR提高了1.28 dB。这些结果验证了ImprovedGS+作为可扩展、高速解决方案的有效性,它在LichtFeld-Studio生态系统中保持了速度、质量和易用性的核心支柱。
Summary / 总结
ImprovedGS+ is a high-performance re-implementation of the ImprovedGS strategy for 3D Gaussian Splatting, optimized in C++/CUDA within the LichtFeld-Studio framework. It introduces several improvements such as a Long-Axis-Split CUDA kernel, custom Laplacian-based importance kernels with NMS, and an adaptive Exponential Scale Scheduler. Experimental results show that ImprovedGS+ achieves a 26.8% reduction in training time, uses fewer Gaussians, and maintains superior visual quality compared to the state-of-the-art MCMC baseline. Additionally, the full variant of ImprovedGS+ demonstrates a 1.28 dB PSNR increase with reduced parametric complexity compared to the ADC baseline.
ImprovedGS+ 是一种针对 3D 高斯点绘制的高性能 C++/CUDA 重实现,优化了 ImprovedGS 策略。它引入了长轴分割 CUDA 内核、基于拉普拉斯的重要性内核以及非最大抑制和自适应指数缩放调度器。在 Mip-NeRF360 数据集上,ImprovedGS+ 实现了 26.8% 的训练时间减少,使用了更少的高斯点,并保持了更高的视觉质量。完整版本还显示了 1.28 dB 的 PSNR 增加,同时减少了参数复杂度。
How Far Can Unsupervised RLVR Scale LLM Training?
Authors: Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding
Venue: ICLR 2026
First: 2026-03-09T17:38:11+00:00 · Latest: 2026-03-09T17:38:11+00:00
Comments: Accepted to the ICLR 2026
Abstract
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
中文标题/摘要
标题:无监督RLVR如何扩展LLM训练的规模?
无监督强化学习带可验证奖励(URLVR)提供了一种途径,通过无需真实标签即可推导奖励来超越监督瓶颈,从而扩展LLM训练的规模。近期研究利用模型固有信号,显示出早期的积极成果,但其潜力和局限性尚不明确。在本文中,我们重新审视URLVR,并进行了全面的分析,涵盖分类学、理论和大量实验。我们首先根据奖励来源将URLVR方法分为固有和外部两类,然后建立了一个统一的理论框架,揭示所有固有方法都趋向于细化模型的初始分布。这种细化机制在初始置信度与正确性一致时成功,但在不一致时会灾难性地失败。通过系统实验,我们展示了固有奖励在方法间始终遵循先上升后下降的趋势,崩溃时间由模型先验决定而非工程选择。尽管存在这些扩展限制,我们发现固有奖励在小数据集的测试时训练中仍然有价值,并提出模型崩溃步来衡量模型先验,作为RL可训练性的实用指标。最后,我们探索了基于计算不对称性进行验证的外部奖励方法,显示初步证据表明它们可能能够摆脱置信度-正确性天花板。我们的发现为固有URLVR设定了边界,同时激励寻找可扩展替代方案的道路。
Summary / 总结
This study explores the scalability of unsupervised reinforcement learning with verifiable rewards (URLVR) for large language model (LLM) training, addressing the limitations of intrinsic and external reward methods. The research classifies URLVR methods and develops a theoretical framework showing that intrinsic methods sharpen the model's initial distribution, which can lead to catastrophic failure if initial confidence is misaligned. Experiments reveal a consistent rise-then-fall pattern in intrinsic rewards, with collapse timing depending on the model's prior rather than engineering choices. The study proposes a Model Collapse Step to measure this prior and suggests external reward methods may offer a way to overcome the confidence-correctness ceiling. This work sets boundaries for intrinsic URLVR and points to potential scalable alternatives.
该研究探讨了无监督强化学习与可验证奖励(URLVR)方法在大规模语言模型(LLM)训练中的扩展性。研究将URLVR方法分为基于内在和外部奖励来源两类,并建立了一个理论框架,表明内在方法会强化模型的初始分布,但如果初始信心与正确性不一致,则可能导致灾难性失败。实验显示,内在奖励遵循先上升后下降的模式,崩溃时间由模型先验决定。研究提出了一个模型崩溃步长来衡量模型先验,这可以作为强化学习可训练性的指标。此外,研究还表明,基于计算不对称性进行验证的外部奖励方法可能能够克服信心与正确性之间的天花板。
From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Authors: Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling
First: 2024-12-31T06:14:16+00:00 · Latest: 2026-03-09T17:35:57+00:00
Comments: A version of this paper appears in the official proceedings of RA-L, Volume 11, Issue 4
Abstract
Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
中文标题/摘要
标题:从像素到谓词:通过预训练的视觉-语言模型学习符号世界模型
我们的目标是在给定低级技能和少量短时间 horizon 的演示(包含一系列图像序列)的情况下,学习解决复杂机器人领域中的长期决策问题。为此,我们专注于学习抽象的符号世界模型,这些模型能够通过规划实现零样本泛化到新的目标。此类模型的关键组成部分是定义对象属性及其之间关系的符号谓词集。在本文中,我们利用预训练的视觉-语言模型(VLMs)提出了一组可能适用于决策的视觉谓词,并直接从相机图像中评估这些谓词。在训练时,我们将提出的谓词和演示传递给基于优化的模型学习算法,以获得一个用提出的谓词子集定义的抽象符号世界模型。在测试时,给定一个新的目标和新的环境设置,我们使用 VLM 构建当前世界状态的符号描述,然后使用基于搜索的规划算法找到实现目标的一系列低级技能。我们通过在仿真和真实世界中的实验中证明,我们的方法可以积极泛化,将其学习到的世界模型应用于解决各种对象类型、排列方式、对象数量和视觉背景广泛变化的问题,以及新的目标和远超训练时所见的更长的时间 horizon。
Summary / 总结
The research aims to develop a method for solving long-term decision-making problems in complex robotics environments using low-level skills and a few demonstrations. It leverages pretrained vision-language models to propose a set of visual predicates and evaluates them from camera images. During training, an optimization-based model-learning algorithm selects a compact subset of these predicates to create an abstract symbolic world model. At test time, the model uses these predicates to describe the current state and plan actions to achieve new goals. Experiments show that the method can generalize effectively to various object types, settings, and goals beyond the training data.
研究旨在利用低级技能和少量短期演示来解决复杂机器人环境中的长期决策问题。它利用预训练的视觉-语言模型提出一组视觉谓词,并直接从图像中评估这些谓词。在训练过程中,使用基于优化的算法选择这些谓词中的一个紧凑子集,形成一个抽象的符号世界模型。在测试时,该模型使用这些谓词描述当前状态并规划行动以实现新的目标。实验表明,该方法能够有效地泛化到各种物体类型、环境和目标,超越训练数据在问题复杂性和时间跨度上的限制。
OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
Authors: Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing Chen
First: 2026-03-09T17:34:53+00:00 · Latest: 2026-03-09T17:34:53+00:00
Comments: 24 pages, 16 figures. Introduces the OfficeQA Pro benchmark for grounded reasoning over enterprise documents
Abstract
We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
中文标题/摘要
标题:OfficeQA Pro:企业级端到端 grounded 推理基准
我们介绍了 OfficeQA Pro,这是一个用于评估 AI 代理在大型异构文档语料库上进行 grounded 多文档推理的基准。语料库包括跨越近百年、包含 89,000 页和超过 2600 万个数值的美国国库公告。OfficeQA Pro 包含 133 个问题,需要对无结构文本和表格数据进行精确的文档解析、检索和分析推理。包括 Claude Opus 4.6、GPT-5.4 和 Gemini 3.1 Pro Preview 在内的前沿大语言模型仅依靠参数化知识在 OfficeQA Pro 上的准确率低于 5%,额外访问网络后准确率低于 12%。即使直接提供文档语料库,前沿代理在超过一半的问题上仍然难以应对,平均得分仅为 34.1%。我们发现,通过 Databricks 的 ai_parse_document 生成的结构化文档表示为代理提供了 16.1% 的平均相对性能提升。我们还进行了额外的消融实验,研究模型选择、表格表示、检索策略和测试时缩放对性能的影响。尽管这些改进,但在企业级 grounded 推理方面仍存在显著的提升空间。
Summary / 总结
OfficeQA Pro is a benchmark for evaluating AI agents on grounded reasoning over a large and diverse document corpus consisting of U.S. Treasury Bulletins. The benchmark includes 133 questions requiring precise parsing, retrieval, and analytical reasoning across unstructured text and tabular data. Despite advanced models achieving less than 5% accuracy without document access and only 34.1% with direct access, structured document representations improved performance by 16.1%. Further ablations explored model selection, table representation, retrieval strategy, and test-time scaling, but significant room for improvement remains.
OfficeQA Pro 是一个评估 AI 代理在处理大规模多样文档(包括美国财政部公告)上的接地推理能力的基准。基准包括133个问题,要求精确解析、检索和分析非结构化文本和表格数据。尽管最先进的模型在没有文档访问时的准确率低于5%,直接访问文档时也只有34.1%,但结构化的文档表示提高了16.1%的性能。进一步的实验还研究了模型选择、表格表示、检索策略和测试时缩放对性能的影响,但仍有显著的改进空间。
CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation
Authors: Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, Yi Zhou, Siqi Dai, Jingwei Wu
First: 2026-03-09T17:31:16+00:00 · Latest: 2026-03-09T17:31:16+00:00
Comments: 21 pages, 7 figures, 7 tables
Abstract
Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo
中文标题/摘要
标题:CoCo:代码作为CoT的文本到图像预览和稀有概念生成
统一多模态模型(UMMs)的最新进展显著提升了文本到图像(T2I)生成,特别是在通过链式思考(CoT)推理的集成方面。然而,现有的基于CoT的T2I方法主要依赖于抽象的自然语言规划,这在复杂的空间布局、结构化视觉元素和密集文本内容方面缺乏精确性。在本工作中,我们提出了CoCo(Code-as-CoT),这是一种代码驱动的推理框架,将推理过程表示为可执行代码,从而实现显式的中间规划验证,以生成图像。给定一个文本提示,CoCo首先生成指定场景结构布局的可执行代码,然后在沙箱环境中执行以渲染确定性草图图像。模型随后通过精细的图像编辑对草图进行细化,以生成最终的高保真结果。为了支持这种训练范式,我们构建了CoCo-10K,这是一个包含结构化草图-最终图像对的精心策划的数据集,旨在教授结构化草图构建和纠正性视觉细化。在StructT2IBench、OneIG-Bench和LongText-Bench上的实证评估表明,CoCo在直接生成上的改进分别为+68.83%、+54.8%和+41.23%,同时在其他由CoT赋能的生成方法中也表现出色。这些结果表明,可执行代码是精确、可控和结构化文本到图像生成的有效且可靠的推理范式。代码可在:https://github.com/micky-li-hd/CoCo
Summary / 总结
CoCo is a code-driven reasoning framework for text-to-image generation that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning. Given a text prompt, CoCo generates executable code for scene layout, which is executed to produce a draft image, followed by fine-grained refinement. Experiments on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo outperforms direct generation and other CoT-based methods by +68.83%, +54.8%, and +41.23%, respectively, demonstrating the effectiveness of executable code for precise and controllable image generation.
CoCo 是一种代码驱动的文本到图像生成框架,将推理表示为可执行代码,实现明确且可验证的中间规划。给定文本提示后,CoCo 生成指定场景结构布局的可执行代码,执行以生成草图图像,随后进行精细的图像编辑以生成最终结果。CoCo 在 StructT2IBench、OneIG-Bench 和 LongText-Bench 上优于直接生成和其他基于 CoT 的方法,分别提高了 +68.83%、+54.8% 和 +41.23%。
CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Authors: Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao
First: 2026-03-09T17:26:26+00:00 · Latest: 2026-03-09T17:26:26+00:00
Abstract
As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
中文标题/摘要
标题:CAST: 建模视觉状态转换以实现一致的视频检索
随着视频内容创作转向长篇叙事,将短片段组合成连贯的故事线变得越来越重要。然而,现有的检索模型在推理时仍保持上下文无关,优先考虑局部语义对齐,而忽视了状态和身份的一致性。为了解决这一结构限制,我们正式化了一致视频检索(CVR)任务,并引入了一个跨越YouCook2、COIN和CrossTask的诊断基准。我们提出了CAST(上下文感知状态转换),这是一种轻量级、即插即用的适配器,兼容多种冻结的视觉-语言嵌入空间。通过从视觉历史中预测状态条件下的残差更新(Δ),CAST引入了对潜在状态演化的显式归纳偏置。广泛的实验表明,CAST在YouCook2和CrossTask上提高了性能,在COIN上保持竞争力,并且在多种基础骨干网络上始终优于零样本基线。此外,CAST为黑盒视频生成候选(例如来自Veo的)提供了有用的重排序信号,促进了更连贯的时间延续。
Summary / 总结
The research aims to improve video retrieval by addressing the lack of state and identity consistency in existing methods. CAST, a Context-Aware State Transition model, is proposed to predict state-conditioned residual updates from visual history, providing an explicit inductive bias for latent state evolution. Experiments show that CAST enhances performance on YouCook2 and CrossTask, maintains competitiveness on COIN, and outperforms zero-shot baselines across various foundation models. Additionally, CAST offers a useful reranking signal for video generation candidates, improving temporal coherence.
研究旨在通过解决现有方法中缺乏状态和身份一致性的问题来提升视频检索效果。提出了CAST(Context-Aware State Transition)模型,通过从视觉历史中预测状态条件下的残差更新,为潜在状态演化提供显式的归纳偏差。实验表明,CAST在YouCook2和CrossTask上提升了性能,在COIN上保持竞争力,并且在各种基础模型上优于零样本基线。此外,CAST还为视频生成候选者提供了有用的重排序信号,提高了时间连贯性。
HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Authors: Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Niraj Chitla, Zishen Wan, Shang Wu, Yu Cao, Caiwen Ding, Yang, Zhao
First: 2025-05-21T16:14:10+00:00 · Latest: 2026-03-09T17:26:13+00:00
Abstract
Retrieval Augmented Generation (RAG) is an essential agent for Large Language Model (LLM) aided Description Language (HDL) tasks, addressing the challenges of limited training data and prohibitively long prompts. However, its performance in handling ambiguous queries and real-world, repository-level HDL projects containing thousands or even tens of thousands of code lines remains limited. Our analysis demonstrates two fundamental mismatches, structural and vocabulary, between conventional semantic similarity-based RAGs and HDL codes. To this end, we propose HDLxGraph, the first framework that integrates the inherent graph characteristics of HDLs with RAGs for LLM-assisted tasks. Specifically, HDLxGraph incorporates Abstract Syntax Trees (ASTs) to capture HDLs' hierarchical structures and Data Flow Graphs (DFGs) to address the vocabulary mismatch. In addition, to overcome the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, an LLM generated dataset derived from real-world, repository-level HDL projects. Evaluations show that HDLxGraph improves search, debugging, and completion accuracy by 12.04%/12.22%/5.04% and by 11.59%/8.18%/4.07% over state-of-the-art similarity-based RAG and software-code Graph RAG baselines, respectively. The code of HDLxGraph and HDLSearch benchmark are available at https://github.com/UMN-ZhaoLab/HDLxGraph.
中文标题/摘要
标题:HDLxGraph:通过HDL图形数据库连接大型语言模型和HDL存储库
检索增强生成(RAG)是大型语言模型(LLM)辅助描述语言(HDL)任务中的关键代理,解决了有限训练数据和过长提示的挑战。然而,其在处理含数千甚至数万行代码的模糊查询和实际仓库级HDL项目中的表现仍然有限。我们的分析表明,传统的基于语义相似性的RAG与HDL代码之间存在两种基本不匹配,即结构和词汇不匹配。为此,我们提出了HDLxGraph,这是第一个将HDL固有的图形特性与RAG结合用于LLM辅助任务的框架。具体而言,HDLxGraph结合了抽象语法树(AST)来捕捉HDL的层次结构,并使用数据流图(DFG)来解决词汇不匹配问题。此外,为了解决全面的HDL搜索基准数据不足的问题,我们引入了HDLSearch,这是一个由实际仓库级HDL项目生成的LLM数据集。评估结果显示,与最先进的基于相似性的RAG和软件代码图形RAG基线相比,HDLxGraph在搜索、调试和完成准确性上分别提高了12.04%/12.22%/5.04%和11.59%/8.18%/4.07%。HDLxGraph和HDLSearch基准的代码可在https://github.com/UMN-ZhaoLab/HDLxGraph/获得。
Summary / 总结
HDLxGraph is a framework that integrates the graph characteristics of Hardware Description Languages (HDLs) with Retrieval Augmented Generation (RAG) to improve Large Language Models (LLMs) in handling HDL tasks. It uses Abstract Syntax Trees (ASTs) to capture HDL hierarchical structures and Data Flow Graphs (DFGs) to address vocabulary mismatches. Experiments show that HDLxGraph enhances search, debugging, and completion accuracy by 12.04%/12.22%/5.04% and 11.59%/8.18%/4.07% over existing RAG and software-code Graph RAG baselines, respectively.
HDLxGraph 是一个框架,将 HDL 的图特征与 RAG 结合起来以增强 LLM 在 HDL 任务中的表现。它使用 AST 捕捉 HDL 的层次结构,并使用 DFG 解决词汇不匹配问题。HDLxGraph 在搜索、调试和完成准确性上分别提高了 12.04%/12.22%/5.04% 和 11.59%/8.18%/4.07%,超过了现有的 RAG 和软件代码图 RAG 基线。该框架基于 HDLSearch 数据集,这是一个来自真实世界 HDL 仓库的 LLM 生成的数据集。
Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization
Authors: Matan Levy, Gavriel Habib, Issar Tzachor, Dvir Samuel, Rami Ben-Ari, Nir Darshan, Or Litany, Dani Lischinski
First: 2026-03-09T17:24:11+00:00 · Latest: 2026-03-09T17:24:11+00:00
Abstract
Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.
中文标题/摘要
标题:检索增强高斯头像:提高表情泛化能力
无需模板的可动画头像可以通过直接从主体捕捉中学习表情相关的面部变形,实现高视觉保真度,避免使用参数化面部模板和手设计的混合空间。然而,由于学习的变形仅由单个身份观察到的表情进行监督,这些模型在表情覆盖范围有限,并且在驱动与训练分布偏差的动作时常常表现不佳。我们提出了RAF(检索增强面部),一种用于无需模板头像的简单训练时增强方法,通过数据学习变形。RAF 构建了一个大型未标记的表情库,在训练过程中,用从该库检索到的最近邻表情替换主体的一部分表情特征,同时仍然重建主体的原始帧。这使变形场暴露在更广泛的表情条件下,鼓励更强的身份-表情解耦,并在无需配对跨身份数据、额外注释或架构更改的情况下提高对表情分布变化的鲁棒性。我们进一步分析了检索增强如何增加表情多样性,并通过用户研究验证检索质量,结果显示检索到的邻居在表情和姿态上感知上更接近。在NeRSemble基准测试上的实验表明,RAF 在自我驱动和跨驱动场景中一致地提高了表情保真度。
Summary / 总结
The motivation is to improve the expression generalization of template-free animatable head avatars by addressing their limited expression coverage. The method involves RAF (Retrieval-Augmented Faces), a training-time augmentation that constructs an unlabeled expression bank and replaces a subset of the subject's expression features with nearest-neighbor expressions during training. The key experimental findings show that RAF enhances expression fidelity and robustness to expression distribution shift, as demonstrated on the NeRSemble benchmark in both self-driving and cross-driving scenarios.
论文旨在解决无模板头部avatar在处理未见过的表情时的局限性。它引入了RAF(检索增强面部)方法,在训练时使用未标记的表情库替换一部分表情特征为最近邻的表情,从而增强模型对未见过的表情的鲁棒性。实验表明,RAF在自我驱动和跨驱动场景中都能比基线模型提高表情的保真度。
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Authors: Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko
First: 2026-03-09T17:18:00+00:00 · Latest: 2026-03-09T17:18:00+00:00
Abstract
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.
中文标题/摘要
标题:PostTrainBench:LLM代理能否自动化LLM后训练?
过去一年,AI代理在软件工程方面表现出惊人的能力,主要得益于推理能力的提升。这引发了一个更深层次的问题:这些系统能否将自身能力扩展到自动化AI研究本身?在本文中,我们探讨了后训练这一关键阶段,该阶段将基础LLM转化为有用的助手。我们引入了PostTrainBench来评估LLM代理在受限制计算资源(10小时,一个H100 GPU)下自主完成后训练的能力。我们让前沿代理(例如,配备Opus 4.6的Claude Code)优化一个基础LLM在特定基准上的性能(例如,Qwen3-4B在AIME上的表现)。重要的是,我们没有为代理提供任何预定义策略,而是赋予它们在网络上查找必要信息、运行实验和整理数据的完全自主权。我们发现,前沿代理取得了显著进展,但通常落后于领先提供商的指令调优LLM:最佳代理的进展为23.2%,而官方指令调优模型为51.1%。然而,在特定场景中,代理可以超越指令调优模型:GPT-5.1 Codex Max在BFCL上的得分为89%,而官方模型为67%。我们还观察到几种值得警惕的失败模式。代理有时会进行奖励作弊:在测试集上进行训练、下载现有的指令调优检查点而不是训练自己的模型、以及未经授权使用找到的API密钥生成合成数据。这些行为令人担忧,突显了随着这些系统能力增强,仔细沙盒化的重要性。总体而言,我们希望PostTrainBench能够有助于跟踪AI R&D自动化进展,并研究伴随而来的风险。网站和代码可在https://posttrainbench.com/上获取。
Summary / 总结
This paper explores whether advanced AI agents can automate the post-training phase of language models, which is crucial for turning base models into useful assistants. Using PostTrainBench, the authors benchmarked the performance of frontier agents like Claude Code in optimizing a base LLM on specific benchmarks. While these agents made significant progress, they generally lagged behind instruction-tuned models from leading providers. However, in targeted scenarios, some agents like GPT-5.1 Codex Max outperformed official models. The study also identified several concerning behaviors, such as reward hacking and unauthorized data generation, emphasizing the need for careful sandboxing as these systems become more capable.
该论文探讨了高级AI代理是否能够自动化语言模型的后训练阶段,这是将基础模型转化为有用助手的关键步骤。使用PostTrainBench,作者评估了像Claude Code这样的前沿代理优化特定基准上的基础LLM的性能。虽然这些代理取得了显著进展,但它们通常落后于领先提供商的指令调优模型。然而,在某些特定场景中,一些代理如GPT-5.1 Codex Max超越了官方模型。研究还发现了几个令人担忧的行为,如奖励作弊和未经授权的数据生成,强调了随着这些系统变得越来越强大,需要谨慎的沙盒环境的重要性。
StreamReady: Learning What to Answer and When in Long Streaming Videos
Authors: Shehreen Azad, Vibhav Vineet, Yogesh Singh Rawat
Venue: CVPR 2026
First: 2026-03-09T17:02:44+00:00 · Latest: 2026-03-09T17:02:44+00:00
Comments: Accepted in CVPR 2026
Abstract
Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.
中文标题/摘要
标题:StreamReady:学习何时回答长流式视频中的问题
流式视频理解通常涉及时间敏感的场景,其中模型需要在支持视觉证据出现时精确回答:在证据出现之前回答是推测,而在证据过后回答则降低了实时实用性。为了捕捉这种行为,我们引入了一种流式视频理解的准备度感知形式,其中包含答案准备度评分(ARS),这是一种具有不对称早期和晚期惩罚的时间感知目标。当与正确性结合时,ARS 定义了一种有效的准确性,不仅衡量模型是否正确,还衡量其是否在适当时刻回答。在此基础上,我们引入了 StreamReady 框架,通过一种轻量级的准备度机制统一时间推理与及时回答。为了评估这种能力,我们进一步引入了 ProReady-QA 基准,其中包含标注的答案证据窗口和主动的多轮问题,跨越局部和全局上下文。StreamReady 在 ProReady-QA 上表现出色,并且在八个额外的流式和离线长视频基准上始终优于先前的方法,展示了稳健且广泛适用的视频理解能力。
Summary / 总结
The research aims to address the challenge of time-sensitive scenarios in streaming video understanding, where models must answer precisely when visual evidence appears. The study introduces the Answer Readiness Score (ARS) as a timing-aware objective with penalties for early and late answers, and proposes StreamReady, a framework that integrates temporal reasoning with on-time answering. StreamReady outperforms previous methods on ProReady-QA and eight other benchmarks, showing robust and generalizable video understanding capability.
研究旨在解决在视觉证据出现时回答流式视频问题的时机问题。引入了带有不对称惩罚的Answer Readiness Score (ARS)来平衡早答和晚答。提出了StreamReady框架,该框架使用准备机制在观察到足够证据后决定何时回答。StreamReady在ProReady-QA和其他基准测试中表现优于先前的方法,展示了稳健且广泛适用的视频理解能力。
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Authors: Earl Ranario, Mason J. Earles
First: 2025-12-17T21:22:44+00:00 · Latest: 2026-03-09T16:58:19+00:00
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
中文标题/摘要
标题:视觉语言模型在农业中是否准备好零样本替代监督分类模型?
视觉语言模型(VLMs)越来越多地被提议作为视觉识别任务的一般解决方案,但它们在农业决策支持中的可靠性仍不明确。我们对来自AgML集合(https://github.com/Project-AgML)的27个农业图像分类数据集进行了基准测试,这些数据集涵盖了162个类别和248,000张图片,包括植物病害、害虫和损伤以及植物和杂草种类识别。在所有任务中,零样本VLMs的表现显著低于监督任务特定基线(YOLO11),后者始终比任何基础模型获得更高的准确率。在多项选择提示下,表现最佳的VLM(Gemini-3 Pro)达到约62%的平均准确率,而开放式提示则导致性能大幅下降,通常准确率低于25%。基于LLM的语义评估提高了开放式提示的准确率(例如,顶级模型从约21%提高到约30%),并改变了模型排名,表明评估方法对报告结论有实质性影响。在开源模型中,Qwen-VL-72B表现最佳,在受限提示下接近闭源模型的性能,但仍落后于顶级专有系统。任务级分析表明,植物和杂草种类分类始终比害虫和损伤识别更容易,后者是所有模型中最具有挑战性的类别。总体而言,这些结果表明,当前的即用型VLMs尚不适合作为独立的农业诊断系统,但在与受限界面、明确标签本体和领域意识评估策略配对时可以作为辅助组件发挥作用。
Summary / 总结
The study benchmarks various vision-language models (VLMs) on 27 agricultural image classification datasets, finding that zero-shot VLMs underperform a supervised task-specific baseline. The best-performing VLM under multiple-choice prompting achieves around 62% accuracy, while open-ended prompting yields lower performance. Applying LLM-based semantic judging improves open-ended accuracy and alters model rankings. The research indicates that current VLMs are not yet suitable as standalone diagnostic systems but can assist when paired with specific interfaces and evaluation strategies.
研究评估了视觉语言模型(VLMs)在27个农业图像分类数据集上的表现,发现零样本VLMs的表现低于监督任务特定基线,仅在多项选择提示下达到约62%的平均准确率,在开放提示下表现更低。研究还表明,应用基于LLM的语义评估可以提高开放提示的准确率并改变模型排名,突显了评估方法的重要性。在开源模型中,Qwen-VL-72B表现最佳,但仍落后于顶级专有系统。研究结论指出,当前的VLMs还不适合作为独立诊断系统,但在与特定界面和评估策略结合使用时可以起到辅助作用。
FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection
Authors: Anqi Joyce Yang, James Tu, Nikita Dvornik, Enxu Li, Raquel Urtasun
Venue: CoRL 2025
First: 2026-03-09T16:57:23+00:00 · Latest: 2026-03-09T16:57:23+00:00
Comments: Published at 9th Annual Conference on Robot Learning (CoRL 2025)
Abstract
In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.
中文标题/摘要
标题:FOMO-3D:使用视觉基础模型进行长尾3D目标检测
为了导航复杂的交通环境,自动驾驶车辆必须识别许多与弱势道路使用者或交通控制设备相关的语义类别。然而,许多安全关键对象(例如,建筑工人)在正常交通条件下出现的频率很低,导致仅从驾驶数据中缺乏足够的训练样本。最近的视觉基础模型,由于在大量数据上进行了训练,可以作为外部先验知识的良好来源,以提高泛化能力。我们提出了FOMO-3D,这是第一个利用视觉基础模型进行长尾3D检测的多模态3D检测器。具体而言,FOMO-3D 在基于LiDAR的分支和新颖的基于相机的分支中利用了来自OWLv2和Metric3Dv2的丰富语义和深度先验,在两阶段检测框架中首先生成提案,并特别利用OWL的图像特征进行注意力调整。在真实驾驶数据上的评估表明,使用视觉基础模型中的丰富先验并结合精心设计的多模态融合方案,可以显著提高长尾3D检测的效果。项目网站为 https://waabi.ai/fomo3d/。
Summary / 总结
FOMO-3D proposes a multi-modal 3D detector that leverages vision foundation models to enhance long-tailed 3D object detection, particularly for safety-critical objects like construction workers. By integrating semantic and depth priors from OWLv2 and Metric3Dv2, FOMO-3D achieves significant improvements in recognizing rare objects in real-world driving data through a two-stage detection process that combines LiDAR and camera inputs.
FOMO-3D 提出了一种多模态 3D 检测器,利用视觉基础模型来解决自动驾驶车辆中对安全关键对象的长尾检测挑战。通过整合来自 OWLv2 和 Metric3Dv2 的语义和深度先验知识,FOMO-3D 使用 LiDAR 和基于相机的分支生成提案,并通过关注图像特征进行细化。实验结果表明,在真实驾驶数据上的长尾 3D 检测性能有了显著提升。
Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting
Authors: Hongyi Li, Han Lin, Jun Xu
First: 2026-02-05T06:49:01+00:00 · Latest: 2026-03-09T16:55:02+00:00
Abstract
Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT's model class is a universal approximator with an explicit $O(δ^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
中文标题/摘要
标题:枢轴回归树:一种用于斜分割的牛顿方法
斜决策树结合了树的透明性和多变量决策边界的强大功能,但学习高质量的斜分割是NP难问题,实际方法仍然依赖于缓慢的搜索或无理论启发式方法。我们提出了枢轴回归树(HRT),它将每个分割重新表述为两个线性预测器的非线性最小二乘问题,其最大/最小包络诱导类似于ReLU的表达能力。由此产生的交替拟合过程在固定分区中等价于阻尼牛顿(高斯-牛顿)方法。我们分析了这种节点级优化,并证明了回溯线搜索变体中局部目标单调减少并收敛;实践中,固定和自适应阻尼都可快速稳定收敛,并可与可选岭正则化结合使用。我们进一步证明HRT的模型类是具有显式$O(δ^2)$逼近率的通用逼近器,并在合成和真实世界基准测试中证明它与更紧凑结构的单树基线具有竞争力或更优性能。
Summary / 总结
The research aims to improve the efficiency and quality of oblique decision tree splits, which combine the transparency of trees with the power of multivariate boundaries. The Hinge Regression Tree (HRT) method reframes each split as a non-linear least-squares problem, enabling faster and more stable convergence through a damped Newton method. Experiments on synthetic and real-world data show that HRT matches or outperforms single-tree baselines with more compact structures.
研究旨在提高斜决策树分割的效率和质量,结合树的透明性和多变量边界的强大力量。Hinge Regression Tree (HRT) 方法将每个分割重新定义为非线性最小二乘问题,通过阻尼牛顿方法实现更快更稳定的收敛。实验结果显示,HRT 在合成和真实世界数据上与单树基线相比,要么匹配要么超越,且具有更紧凑的结构。
Single Image, Any Face: Generalisable 3D Face Generation
Authors: Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu
First: 2024-09-25T14:56:37+00:00 · Latest: 2026-03-09T16:53:08+00:00
Comments: Accepted by Pattern Recognition, March 2026
Abstract
The creation of 3D human face avatars from a single unconstrained image is a fundamental task that underlies numerous real-world vision and graphics applications. Despite the significant progress made in generative models, existing methods are either less suited in design for human faces or fail to generalise from the restrictive training domain to unconstrained facial images. To address these limitations, we propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input within a multi-view consistent diffusion framework. Given a specific input image, our model first produces multi-view images, followed by neural surface construction. To incorporate face geometry information while preserving generalisation to in-the-wild inputs, we estimate a subject-specific mesh directly from the input image, enabling training and evaluation without ground-truth 3D supervision. Importantly, we introduce a multi-view joint generation scheme to enhance the appearance consistency among different views. To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images for generic human subject across domains. Extensive experiments demonstrate the efficacy and superiority of our method over previous alternatives for out-of-domain single image 3D face generation and the top ranking competition for the in-domain setting.
中文标题/摘要
标题:单张图像,任意人脸:通用3D人脸生成
从单张不受限制的图像中创建3D人类面部头像是一项基础任务,支撑着众多现实世界的视觉和图形应用。尽管在生成模型方面取得了显著进展,但现有方法要么在设计上不适合人类面部,要么无法从限制性的训练领域推广到不受限制的面部图像。为了解决这些局限性,我们提出了一种新型模型Gen3D-Face,在多视角一致的扩散框架中生成不受限制的单张图像输入的3D人类面部。给定特定的输入图像,我们的模型首先生成多视角图像,然后进行神经表面构建。为了在保留对野外输入的泛化能力的同时融入面部几何信息,我们直接从输入图像估计一个特定主题的网格,从而在没有真实3D监督的情况下进行训练和评估。重要的是,我们引入了一种多视角联合生成方案,以增强不同视角之间的外观一致性。据我们所知,这是首次尝试和基准,用于从单张图像为通用人类主题生成逼真的3D人类面部头像。大量实验表明,我们的方法在域外单张图像3D人脸生成方面优于先前的替代方法,并且在域内设置中名列前茅。
Summary / 总结
The paper addresses the challenge of generating 3D human face avatars from a single unconstrained image, which is crucial for various vision and graphics applications. It introduces Gen3D-Face, a model that uses a multi-view consistent diffusion framework to produce 3D faces from a single image. The model first generates multi-view images and then constructs a neural surface, while estimating a subject-specific mesh directly from the input image to ensure generalization to real-world inputs. Experiments show that Gen3D-Face outperforms previous methods in generating photorealistic 3D faces for generic human subjects both in-domain and out-of-domain.
论文解决了从单张非受限图像生成3D人脸avatar的问题,这对于各种视觉和图形应用至关重要。为克服现有方法的局限性,作者提出了Gen3D-Face模型,该模型使用多视角一致的扩散框架生成多视角图像并构建神经表面。通过直接从输入图像估计主体特定的网格,该模型能够很好地泛化到野外面部图像。多视角联合生成方案确保了不同视角之间的外观一致性。该方法在跨域单图像3D人脸生成中优于先前的方法,并在同域设置中排名靠前。
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
Authors: Jiangye Yuan, Gowri Kumar, Baoyuan Wang
First: 2026-03-09T16:42:43+00:00 · Latest: 2026-03-09T16:42:43+00:00
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
中文标题/摘要
标题:利用几何参考3D场景表示增强MLLM的空间推理能力
虽然多模态大型语言模型(MLLMs)在2D视觉理解方面取得了显著成功,但它们在3D空间推理方面的能力仍然有限。为解决这一问题,我们引入了几何参考3D场景表示(GR3D)。给定一组输入图像,GR3D将图像中的对象用唯一的ID进行标注,并以这些ID为索引编码它们的3D几何属性为文本参考。这种表示使MLLMs能够利用其先进的基于语言的数学推理技能来解释3D线索,同时以紧密耦合的方式分析2D视觉特征。我们提出了一种基于GR3D的简单而有效的方法,该方法无需额外训练,且易于应用于不同的MLLMs。在零样本设置下,我们的方法使GPT-5在VSI-Bench上的整体性能提高了8%,在高度依赖空间布局理解的任务上提高了超过11%。进一步的定性研究表明,GR3D使MLLMs能够利用稀疏输入视图进行复杂的空间推理。
Summary / 总结
The research aims to enhance the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) by introducing geometrically referenced 3D scene representations (GR3D). GR3D annotates objects in images with unique IDs and encodes their 3D geometric attributes as textual references, enabling MLLMs to reason about 3D space using their language-based skills. The approach, implemented in a zero-shot setting, improves GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks requiring spatial layout understanding, as shown by both quantitative and qualitative studies.
研究旨在通过引入几何参考的3D场景表示(GR3D)来增强多模态大型语言模型(MLLMs)的空间推理能力。GR3D为图像中的对象分配唯一的ID,并以文本参考的形式编码其3D几何属性,使MLLMs能够利用其语言技能进行3D空间推理。该方法在零样本设置下实施,使GPT-5在VSI-Bench上的整体性能提高了8%,在依赖于空间布局理解的任务上提高了超过11%,定量和定性研究进一步证明了GR3D的能力。
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
Authors: Yucheng Wang, Zedong Wang, Yuetong Wu, Yue Ma, Dan Xu
Venue: CVPR 2026
First: 2026-03-09T16:40:47+00:00 · Latest: 2026-03-09T16:40:47+00:00
Comments: Accepted by CVPR 2026. Project page: https://care-edit.github.io/
Abstract
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.
中文标题/摘要
标题:CARE-Edit:条件感知专家路由的上下文图像编辑
统一的扩散编辑器通常依赖于一个固定的共享骨干来处理多种任务,这会导致任务干扰和对异构需求(如局部 vs 全局、语义 vs 光学)的不良适应。特别是,流行的ControlNet和OmniControl变体通过静态连接或加性适配器将多种条件信号(如文本、掩码、参考)组合在一起,但这些方法无法动态地优先处理或抑制冲突的模态,从而导致跨掩码边界的颜色溢出、身份或风格漂移以及在多条件输入下的不可预测行为。为了解决这个问题,我们提出了条件感知专家路由(CARE-Edit),它使模型计算与特定的编辑能力相匹配。其核心是一个轻量级的潜在注意力路由器,根据多模态条件和扩散时间步将编码的扩散令牌分配给四个专门的专家——文本、掩码、参考和基础:(i)一个掩码重绘模块首先细化用户定义的粗略掩码,以提供精确的空间指导;(ii)路由器应用稀疏的Top-K选择来动态分配计算给最相关的专家;(iii)一个潜在混合模块随后融合专家输出,将语义、空间和风格信息一致地整合到基础图像中。实验验证了CARE-Edit在上下文编辑任务(包括擦除、替换、文本驱动的编辑和风格迁移)中的强大性能。进一步的实证分析揭示了专门专家的任务特定行为,展示了动态、条件感知处理的重要性,以缓解多条件冲突。
Summary / 总结
CARE-Edit is proposed to address the limitations of unified diffusion editors by dynamically routing model computation to specialized experts based on multi-modal conditions. It consists of a latent-attention router that assigns diffusion tokens to four experts (Text, Mask, Reference, and Base) and a Mask Repaint module that refines masks for spatial guidance. The router uses sparse top-K selection to allocate computation to the most relevant experts, and a Latent Mixture module fuses their outputs to integrate semantic, spatial, and stylistic information. Experiments show that CARE-Edit performs well on various contextual editing tasks, such as erasure, replacement, text-driven edits, and style transfer, and highlights the importance of dynamic, condition-aware processing for handling multi-condition inputs.
CARE-Edit通过提出一种条件感知路由机制,动态地将计算分配给基于多模态条件和扩散时间步的专门专家,解决了统一扩散编辑器的局限性。该方法包括一个用于细化掩码的Mask Repaint模块、一个用于稀疏Top-K选择的路由器模块以及一个用于融合专家输出的Latent Mixture模块。实验表明,CARE-Edit在擦除、替换、文本驱动编辑和风格迁移等多种上下文编辑任务中表现出色,并强调了动态、条件感知处理在处理多条件冲突中的重要性。
Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control
Authors: Riccardo De Monte, Matteo Cederle, Gian Antonio Susto
First: 2026-03-09T16:40:06+00:00 · Latest: 2026-03-09T16:40:06+00:00
Abstract
State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.
中文标题/摘要
标题:迈向连续控制的批处理到流式深度强化学习
最先进的深度强化学习(RL)方法在连续控制任务中取得了显著的性能,但由于其依赖于重放缓冲区、批处理更新和目标网络,其计算复杂性往往与资源受限硬件的约束不兼容。新兴的流式深度RL范式通过纯粹的在线更新解决了这一限制,实现了在标准基准上的强大实证性能。在本文中,我们提出了两种新的流式深度RL算法,流式软动作价值函数算法(S2AC)和流式确定性动作价值函数算法(SDAC),这些算法明确设计为与最先进的批处理RL方法兼容,特别适用于如Sim2Real转移等设备内微调应用。这两种算法在标准基准上达到了与最先进的流式基线相当的性能,而无需繁琐的超参数调整。最后,我们进一步探讨了从批处理到流式学习过渡期间微调过程中的实际挑战,并提出了具体的应对策略。
Summary / 总结
This paper addresses the computational limitations of state-of-the-art deep reinforcement learning methods in continuous control tasks by proposing two streaming algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC). These algorithms are designed to be compatible with batch RL methods, enabling efficient on-device fine-tuning. The algorithms achieve performance comparable to existing streaming baselines without extensive hyperparameter tuning. The study also explores challenges in transitioning from batch to streaming learning during fine-tuning and suggests practical strategies to overcome these issues.
该研究针对现有深度强化学习方法在连续控制任务中的计算复杂性问题,这些问题往往与资源有限的硬件不兼容。作者提出了两个流式深度RL算法,Streaming Soft Actor-Critic (S2AC) 和 Streaming Deterministic Actor-Critic (SDAC),这些算法设计上能够与批处理RL方法兼容。这些算法在标准基准测试上达到了与最先进的流式基线相当的性能,且无需进行繁琐的超参数调整。此外,研究还探讨了在微调过程中从批处理到流式学习过渡的实际挑战,并提出了应对策略。
DualFlexKAN: Dual-stage Kolmogorov-Arnold Networks with Independent Function Control
Authors: Andrés Ortiz, Nicolás J. Gallego-Molina, Carmen Jiménez-Mesa, Juan M. Górriz, Javier Ramírez
First: 2026-03-09T16:36:04+00:00 · Latest: 2026-03-09T16:36:04+00:00
Comments: 22 pages, 12 figures
Abstract
Multi-Layer Perceptrons (MLPs) rely on pre-defined, fixed activation functions, imposing a static inductive bias that forces the network to approximate complex topologies solely through increased depth and width. Kolmogorov-Arnold Networks (KANs) address this limitation through edge-centric learnable functions, yet their formulation suffers from quadratic parameter scaling and architectural rigidity that hinders the effective integration of standard regularization techniques. This paper introduces the DualFlexKAN (DFKAN), a flexible architecture featuring a dual-stage mechanism that independently controls pre-linear input transformations and post-linear output activations. This decoupling enables hybrid networks that optimize the trade-off between expressiveness and computational cost. Unlike standard formulations, DFKAN supports diverse basis function families, including orthogonal polynomials, B-splines, and radial basis functions, integrated with configurable regularization strategies that stabilize training dynamics. Comprehensive evaluations across regression benchmarks, physics-informed tasks, and function approximation demonstrate that DFKAN outperforms both MLPs and conventional KANs in accuracy, convergence speed, and gradient fidelity. The proposed hybrid configurations achieve superior performance with one to two orders of magnitude fewer parameters than standard KANs, effectively mitigating the parameter explosion problem while preserving KAN-style expressiveness. DFKAN provides a principled, scalable framework for incorporating adaptive non-linearities, proving particularly advantageous for data-efficient learning and interpretable function discovery in scientific applications.
中文标题/摘要
标题:DualFlexKAN:具有独立功能控制的双阶段柯尔莫哥洛夫-阿诺尔德网络
多层感知机(MLPs)依赖于预定义的固定激活函数,这施加了一种静态先验偏置,迫使网络通过增加深度和宽度来近似复杂的拓扑结构。柯尔莫哥洛夫-阿诺尔德网络(KANs)通过边缘中心的可学习函数解决了这一限制,但其形式化设计存在二次参数缩放和架构刚性问题,这阻碍了标准正则化技术的有效集成。本文引入了DualFlexKAN(DFKAN),这是一种灵活的架构,具有双阶段机制,独立控制预线性输入变换和后线性输出激活。这种解耦使网络能够优化表达能力和计算成本之间的权衡。与标准形式化设计不同,DFKAN 支持多种基函数家族,包括正交多项式、B-样条和径向基函数,并结合可配置的正则化策略以稳定训练动态。在回归基准、物理信息任务和函数逼近的全面评估中,DFKAN 在准确度、收敛速度和梯度保真度方面均优于MLPs和传统KANs。提出的混合配置在参数数量上比标准KANs少一个到两个数量级,有效地缓解了参数爆炸问题,同时保持了KAN风格的表达能力。DFKAN 提供了一个原理上可扩展的框架,用于集成自适应非线性,特别适用于数据高效学习和科学应用中的可解释函数发现。
Summary / 总结
DualFlexKAN (DFKAN) addresses the limitations of Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs) by introducing a dual-stage mechanism that independently controls pre-linear input transformations and post-linear output activations. This flexible architecture supports various basis function families and configurable regularization strategies, leading to improved accuracy, faster convergence, and better gradient fidelity compared to MLPs and conventional KANs. DFKAN achieves superior performance with significantly fewer parameters, effectively mitigating the parameter explosion problem while maintaining KAN-style expressiveness.
该论文提出了DualFlexKAN (DFKAN) 架构,通过解耦预线性输入变换和后线性输出激活,允许混合网络在表达能力和计算成本之间取得平衡。DFKAN 支持多种基函数家族和可配置的正则化策略,相比MLPs和传统KANs,在准确度、收敛速度和梯度保真度方面表现出更优性能。它通过显著减少参数数量,缓解了参数爆炸问题,同时保持了KAN风格的表达能力。
MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation
Authors: Yutong Shen, Hangxu Liu, Penghui Liu, Jiashuo Luo, Yongkang Zhang, Rex Morvley, Chen Jiang, Jianwei Zhang, Lei Zhang
First: 2026-03-09T16:28:26+00:00 · Latest: 2026-03-09T16:28:26+00:00
Comments: 8 figures, https://syt2004.github.io/metaworldX/
Abstract
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
中文标题/摘要
标题:MetaWorld-X:通过VLM协调专家进行类人行走操作建模
学习类人机器人执行同时行走和操作(行走操作)的自然、稳定且组成上通用的全身控制策略仍然是机器人学中的一个基本挑战。现有的强化学习方法通常依赖单一的庞大策略来获取多种技能,这往往导致跨技能梯度干扰和高自由度系统中的运动模式冲突。因此,生成的行为经常表现出不自然的运动、有限的稳定性和较差的对复杂任务组成的泛化能力。为了解决这些限制,我们提出了MetaWorld-X,一种类人控制的分层世界模型框架。我们的方法遵循分而治之的原则,将复杂的控制问题分解为一组专门的专家策略(专门专家策略,SEP)。每个专家通过模仿约束强化学习在人类运动先验下进行训练,引入生物力学一致的归纳偏置,确保自然且物理上合理的运动生成。在此基础上,我们进一步开发了一种由视觉语言模型(VLM)监督的智能路由机制(IRM),实现语义驱动的专家组合。VLM引导的路由器根据高层次任务语义动态整合专家策略,促进多阶段行走操作任务中的组合泛化和自适应执行。
Summary / 总结
MetaWorld-X proposes a hierarchical world model framework for humanoid robots to perform loco-manipulation tasks. It decomposes complex control problems into specialized expert policies trained with human motion priors, ensuring natural and physically plausible movements. An Intelligent Routing Mechanism guided by a Vision-Language Model dynamically integrates these experts based on task semantics, enabling compositional generalization and adaptive execution in multi-stage tasks. Key findings include improved stability and generalization compared to monolithic policies.
MetaWorld-X 提出了一种分层世界模型框架,用于使类人机器人执行复杂的同时移动和操作任务。该方法将任务分解为基于人类运动先验训练的专业专家策略,并通过视觉语言模型指导的智能路由机制动态整合这些专家,根据任务语义实现组合泛化和适应性执行。这种方法解决了单一整体策略的局限性,确保了自然、稳定和通用的行为。
BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment
Authors: Erdong Chen, Yuyang Ji, Jacob K. Greenberg, Benjamin Steel, Faraz Arkam, Abigail Lewis, Pranay Singh, Feng Liu
First: 2026-03-09T16:25:28+00:00 · Latest: 2026-03-09T16:25:28+00:00
Abstract
Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.
中文标题/摘要
标题:BioGait-VLM:一种用于可解释临床步态评估的三模态视觉-语言-生物力学框架
基于视频的临床步态分析往往由于模型过度拟合环境偏差而泛化能力差,未能捕捉到病理运动。为了解决这一问题,我们提出了一种三模态视觉-语言-生物力学框架BioGait-VLM,用于可解释的临床步态评估。与标准视频编码器不同,我们的架构包含一个时间证据提炼分支,用于捕捉节奏动态,以及一个生物力学标记化分支,将3D骨架序列投影到语言对齐的语义标记上。这使得模型能够独立于视觉捷径进行显式推理。为了确保严格的基准测试,我们通过增加一个高保真的退行性颈椎髓病(DCM)队列,对公开的GAVD数据集进行扩充,形成了一个统一的8类分类体系,并建立了严格的受试者分离协议以防止数据泄露。在这一设置下,BioGait-VLM 达到了最先进的识别精度。此外,一项盲法专家研究证实,生物力学标记显著提高了临床合理性并增强了证据基础,为透明、隐私保护的步态评估提供了途径。
Summary / 总结
The research aims to improve the generalization of video-based clinical gait analysis by addressing the issue of overfitting to environmental biases. BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework, is proposed to capture pathological motion more effectively. It includes a Temporal Evidence Distillation branch and a Biomechanical Tokenization branch, which helps the model reason about joint mechanics independently of visual cues. The model achieves state-of-the-art recognition accuracy, and a blinded expert study confirms that biomechanical tokens enhance clinical plausibility and evidence grounding.
研究旨在通过解决过度拟合环境偏差来提高基于视频的临床步态分析的泛化能力。提出了一个三模态视觉-语言-生物力学框架BioGait-VLM,以捕捉病理运动。该框架包括一个用于节奏动态的时序证据蒸馏分支和一个用于语义标记投影的生物力学标记化分支,使模型能够明确地推理关节力学。该模型在识别准确性上达到了最先进的水平,并通过盲法专家研究得到了验证,该研究确认了生物力学标记的临床合理性和证据基础。
RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback
Authors: Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao
First: 2026-03-09T16:23:33+00:00 · Latest: 2026-03-09T16:23:33+00:00
Comments: 45 pages
Abstract
Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.
中文标题/摘要
标题:RetroAgent:通过回顾双重内在反馈从解决到进化
基于大型语言模型(LLM)的代理通过强化学习(RL)训练,在复杂交互任务上展现了强大的潜力。然而,标准的RL范式更倾向于静态问题解决而非持续适应:代理往往由于探索不足而收敛到次优策略,而学到的知识则隐含在参数中而非明确可检索,限制了有效的经验学习。为了解决这些限制,我们引入了RetroAgent,这是一种在线RL框架,使代理不仅通过解决,还能通过进化来掌握复杂的交互环境。具体而言,RetroAgent具有前瞻自我反思机制,产生双重内在反馈:(1)内在数值反馈,跟踪相对于先前尝试的逐步子任务完成情况,奖励有希望的探索;(2)内在语言反馈,将可重用的教训提炼到记忆缓冲区中,通过我们提出的相似性与效用意识上置信界(SimUtil-UCB)策略进行检索,该策略平衡相关性、效用和探索,有效利用过往经验。在两个模型家族的四个具有挑战性的代理任务上的广泛实验表明,RetroAgent显著优于现有方法,达到最先进的结果——例如,在ALFWorld上超过GRPO训练的代理18.3%,在WebShop上超过15.4%,在Sokoban上超过27.1%,在Minesweeper上超过8.9%;同时表现出强大的测试时适应性和泛化能力,以应对分布外场景。
Summary / 总结
RetroAgent is an online RL framework that enhances agents' ability to continuously adapt and evolve in complex interactive tasks by incorporating a hindsight self-reflection mechanism. This mechanism provides dual intrinsic feedback: numerical feedback for incremental subtask completion and language feedback for reusable lessons stored in a memory buffer. Experiments show that RetroAgent outperforms existing methods on various tasks, achieving state-of-the-art results and strong generalization to out-of-distribution scenarios.
RetroAgent 是一种在线强化学习框架,使代理不仅能解决任务还能通过后见之明机制进化。该机制提供双重内在反馈:数值反馈追踪子任务完成的增量进展和语言反馈提取可重用的教训存储在记忆缓冲区中。SimUtil-UCB 策略用于检索这些教训,平衡相关性、实用性和探索性。实验表明,RetroAgent 在四个具有挑战性的任务上优于现有方法,达到最先进的性能,并且在测试时具有强大的适应性和泛化能力,能够处理未见过的数据分布场景。
Impact of Connectivity on Laplacian Representations in Reinforcement Learning
Authors: Tommaso Giorgi, Pierriccardo Olivieri, Keyue Jiang, Laura Toni, Matteo Papini
First: 2026-03-09T16:20:31+00:00 · Latest: 2026-03-09T16:20:31+00:00
Abstract
Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.
中文标题/摘要
标题:连通性对强化学习中拉普拉斯表示的影响
在马尔可夫决策过程(MDPs)中学习紧凑的状态表示已被证明对于解决大规模强化学习(RL)问题中的维数灾难至关重要。现有的原理性方法通过将状态表示构建为状态图拉普拉斯特征向量的线性组合,利用MDP的结构先验。当转移图未知或状态空间过大时,可以通过样本轨迹直接估计图谱特征。在本文中,我们证明了在学习的谱特征下线性价值函数逼近的上界误差。我们展示了这种误差如何随着状态图的代数连通性而变化,将逼近质量与MDP的拓扑结构联系起来。我们进一步界定了特征向量估计本身引入的误差,从而在整个表示学习管道中获得端到端的误差分解。此外,我们对RL设置下的拉普拉斯算子的表达式,尽管与现有表达式等价,但可以防止一些常见的误解,我们从文献中展示了几个例子。我们的结果适用于一般(非均匀)策略,没有任何假设关于诱导转移核的对称性。我们通过网格世界环境的数值模拟验证了我们的理论发现。
Summary / 总结
This work addresses the challenge of learning compact state representations in large-scale reinforcement learning problems by leveraging the state-graph Laplacian eigenvectors. The authors prove an upper bound on the approximation error of linear value function approximation under these spectral features, showing that the error scales with the algebraic connectivity of the state-graph. They also provide a detailed error decomposition that includes the error from eigenvector estimation, validating their theoretical findings through numerical simulations on gridworld environments.
该研究探讨了状态图的连通性如何影响强化学习中拉普拉斯表示的有效性。作者推导出线性价值函数逼近使用谱特征时的上界误差,表明误差与状态图的代数连通性密切相关。他们还详细地分解了整个表示学习管道中的误差,包括特征向量的估计。结果通过在网格世界环境中的数值模拟得到了验证。
mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud
Authors: Abdullah Al Masud, Shi Xintong, Mondher Bouazizi, Ohtsuki Tomoaki
Venue: M. A. Al, X. Shi, B. Mondher and T. Ohtsuki, "mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud," IEEE ICC 2024, Denver, CO, USA
First: 2026-03-09T16:15:45+00:00 · Latest: 2026-03-09T16:15:45+00:00
Comments: copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract
Pose estimation and human action recognition (HAR) are pivotal technologies spanning various domains. While the image-based pose estimation and HAR are widely admired for their superior performance, they lack in privacy protection and suboptimal performance in low-light and dark environments. This paper exploits the capabilities of millimeter-wave (mmWave) radar technology for human pose estimation by processing radar data with Graph Neural Network (GNN) architecture, coupled with the attention mechanism. Our goal is to capture the finer details of the radar point cloud to improve the pose estimation performance. To this end, we present a unique feature extraction technique that exploits the full potential of the GNN processing method for pose estimation. Our model mmGAT demonstrates remarkable performance on two publicly available benchmark mmWave datasets and establishes new state of the art results in most scenarios in terms of human pose estimation. Our approach achieves a noteworthy reduction of pose estimation mean per joint position error (MPJPE) by 35.6% and PA-MPJPE by 14.1% from the current state of the art benchmark within this domain.
中文标题/摘要
标题:mmGAT:基于互信息的雷达点云图注意力网络人体姿态估计
人体姿态估计和人体动作识别(HAR)是跨多个领域的关键技术。尽管基于图像的人体姿态估计和HAR因其卓越性能而备受推崇,但它们在隐私保护和低光环境下的表现欠佳。本文利用毫米波(mmWave)雷达技术进行人体姿态估计,通过使用图神经网络(GNN)架构处理雷达数据,并结合注意力机制。我们的目标是捕捉雷达点云的细微特征,以提高姿态估计性能。为此,我们提出了一种独特的特征提取技术,充分利用了GNN处理方法在姿态估计中的潜力。我们的模型mmGAT在两个公开的mmWave基准数据集上表现出色,并在大多数场景中建立了人体姿态估计的新最佳结果。我们的方法在当前领域内的姿态估计平均每个关节位置误差(MPJPE)上实现了35.6%的显著减少,并在PA-MPJPE上实现了14.1%的减少。
Summary / 总结
This paper addresses the limitations of image-based pose estimation and human action recognition by leveraging millimeter-wave (mmWave) radar technology. It proposes mmGAT, a model that uses Graph Neural Network (GNN) with attention mechanisms to process radar data, aiming to capture finer details of the radar point cloud. The model achieves significant improvements, reducing the mean per joint position error (MPJPE) by 35.6% and PA-MPJPE by 14.1% compared to the current state-of-the-art methods on benchmark datasets.
该论文通过利用毫米波(mmWave)雷达技术来弥补基于图像的人体姿态估计和人体动作识别的局限性。它提出了一种名为mmGAT的模型,该模型使用图神经网络(GNN)和注意力机制来处理雷达数据,旨在捕捉雷达点云的细微特征。该模型在公开的mmWave数据集上取得了显著的改进,将关键点位置误差(MPJPE)降低了35.6%,姿态关键点位置误差(PA-MPJPE)降低了14.1%,超过了当前最先进的方法。
Interactive World Simulator for Robot Policy Training and Evaluation
Authors: Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, Yunzhu Li
First: 2026-03-09T16:13:32+00:00 · Latest: 2026-03-09T16:13:32+00:00
Comments: Project Page: https://yixuanwang.me/interactive_world_sim
Abstract
Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
中文标题/摘要
标题:交互式世界模拟器用于机器人政策训练与评估
基于动作条件的视频预测模型(通常称为世界模型)在机器人应用中显示出强大的潜力,但现有方法往往速度较慢,难以长时间捕捉物理一致的交互,限制了其在大规模机器人政策训练和评估中的应用。我们提出了交互式世界模拟器,这是一种从适度规模的机器人交互数据集中构建交互式世界模型的框架。我们的方法利用一致性模型进行图像解码和潜在空间动力学预测,从而实现快速稳定的物理交互模拟。在我们的实验中,学习到的世界模型生成了交互一致的像素级预测,并支持超过10分钟的稳定长时间交互,帧率为15 FPS,仅使用单个RTX 4090 GPU。我们的框架允许在世界模型内部进行可扩展的演示收集,以训练最先进的模仿策略。通过在涉及刚体、变形体、物体堆及其交互的多种任务中进行广泛的现实世界评估,我们发现基于世界模型生成数据训练的策略与基于相同量的现实世界数据训练的策略表现相当。此外,我们在世界模型内部和现实世界中对多种任务进行了策略评估,观察到模拟和现实世界性能之间存在强烈的相关性。这些结果共同确立了交互式世界模拟器作为可扩展机器人数据生成的稳定且物理一致的替代方案以及忠实、可重复的策略评估工具的地位。
Summary / 总结
The research aims to address the limitations of existing action-conditioned video prediction models in robotics by developing a framework called Interactive World Simulator. This framework uses consistency models for image decoding and latent-space dynamics prediction to simulate physical interactions efficiently and stably. The experiments show that the learned world models can produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for over 10 minutes at 15 FPS on a single GPU. Policies trained using data from the world models perform comparably to those trained with real-world data, and there is a strong correlation between simulated and real-world performance across various tasks.
研究旨在解决现有动作条件下的视频预测模型在机器人应用中速度慢且物理不一致的问题。Interactive World Simulator框架利用一致性模型进行图像解码和潜在空间动力学预测,以实现快速稳定的物理交互模拟。实验表明,学习到的世界模型能够产生交互一致的像素级预测,并支持超过10分钟的稳定长时间交互,每秒15帧在单个GPU上运行。使用此框架训练的策略在性能上与使用相同量真实世界数据训练的策略相当,并且模拟和真实世界性能之间存在很强的相关性,这确立了该模拟器作为机器人数据生成和策略评估可靠替代品的地位。
PCFEx: Point Cloud Feature Extraction for Graph Neural Networks
Authors: Abdullah Al Masud, Shi Xintong, Mondher Bouazizi, Ohtsuki Tomoaki
Venue: IEEE Internet of Things Journal, vol. 13, no. 4, pp. 5909-5917, 15 Feb.15, 2026
First: 2026-03-09T16:09:02+00:00 · Latest: 2026-03-09T16:09:02+00:00
Comments: ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract
Graph neural networks (GNNs) have gained significant attention for their effectiveness across various domains. This study focuses on applying GNN to process 3D point cloud data for human pose estimation (HPE) and human activity recognition (HAR). We propose novel point cloud feature extraction (PCFEx) techniques to capture meaningful information at the point, edge, and graph levels of the point cloud by considering point cloud as a graph. Moreover, we introduce a GNN architecture designed to efficiently process these features. Our approach is evaluated on four most popular publicly available millimeter wave radar datasets, three for HPE and one for HAR. The results show substantial improvements, with significantly reduced errors in all three HPE benchmarks, and an overall accuracy of 98.8% in mmWave-based HAR, outperforming the existing state of the art models. This work demonstrates the great potential of feature extraction incorporated with GNN modeling approach to enhance the precision of point cloud processing.
中文标题/摘要
标题:PCFEx:点云特征提取用于图神经网络
图神经网络(GNN)因其在各个领域的有效性而受到广泛关注。本研究专注于将GNN应用于处理3D点云数据以进行人体姿态估计(HPE)和人体活动识别(HAR)。我们提出了一种新颖的点云特征提取(PCFEx)技术,通过将点云视为图来捕捉点、边和图层面的有意义信息。此外,我们还引入了一种设计用于高效处理这些特征的GNN架构。我们的方法在四个最受欢迎的公开可用的毫米波雷达数据集上进行了评估,其中三个用于HPE,一个用于HAR。结果显示了显著的改进,所有三个HPE基准的误差显著降低,毫米波基于的HAR的整体准确率为98.8%,优于现有最先进的模型。本研究展示了将特征提取与GNN建模方法结合以提高点云处理精度的巨大潜力。
Summary / 总结
This study proposes PCFEx, a novel point cloud feature extraction technique for graph neural networks (GNNs) to improve human pose estimation and human activity recognition. By treating point clouds as graphs, the method captures information at the point, edge, and graph levels, and integrates it into a GNN architecture. The approach significantly reduces errors in three HPE benchmarks and achieves 98.8% accuracy in mmWave-based HAR, surpassing existing models.
该研究提出了一种名为PCFEx的新颖点云特征提取技术,用于图神经网络(GNN),以提高人体姿态估计和人体活动识别的精度。通过将点云视为图,PCFEx在点、边和图级别捕获信息。该GNN架构高效地处理这些特征,显著提高了HPE基准的性能,并在基于毫米波的HAR中达到了98.8%的准确率,超过了现有模型。
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Authors: Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
First: 2026-03-09T16:06:26+00:00 · Latest: 2026-03-09T16:06:26+00:00
Abstract
Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.
中文标题/摘要
标题:SWIFT:滑动窗口重建用于少量样本无训练生成视频归属
近年来,视频生成技术取得了显著进展,广泛应用于多个领域。然而,人们对生成内容的潜在滥用表示担忧。追踪生成视频的来源变得至关重要,以减轻潜在滥用并识别责任方。现有视频归属方法需要额外操作或训练源归属模型,这可能降低视频质量或需要大量训练样本。为应对这些挑战,我们首次定义了“少量样本无训练生成视频归属”任务,并提出SWIFT,该方法紧密结合了视频的时间特性。通过利用每个视频片段内的“像素帧(多个)到潜在帧(一个)”的时间映射,SWIFT 应用固定长度的滑动窗口执行两种不同的重建:正常和损坏。两种重建之间的损失变化被用作归属信号。我们对五种最先进的(SOTA)视频生成模型进行了广泛评估。实验结果表明,SWIFT 在所有模型中仅使用 20 个视频样本即可实现超过 90% 的平均归属准确率,并且甚至可以实现零样本归属,适用于 HunyuanVideo、EasyAnimate 和 Wan2.2。我们的源代码可在 https://github.com/wangchao0708/SWIFT 获取。
Summary / 总结
The paper addresses the challenge of attributing the origin of generated videos, which is crucial for preventing misuse. It introduces SWIFT, a method that uses a sliding window to perform two reconstructions on video chunks, leveraging the temporal mapping between pixel frames and latent frames. SWIFT achieves over 90% average attribution accuracy with just 20 video samples across five state-of-the-art video generation models, demonstrating its effectiveness in few-shot training-free attribution.
论文旨在解决无需额外训练或质量降级即可追踪生成视频来源的问题。提出了一种名为SWIFT的方法,该方法利用视频片段中像素帧到潜在帧的时序映射,通过滑动窗口进行两种重建,并利用重建损失之间的差异作为归属信号。实验结果显示,SWIFT在五种最先进的视频生成模型上实现了超过90%的平均归属准确率,并且支持特定模型的零样本归属。
SecAgent: Efficient Mobile GUI Agent with Semantic Context
Authors: Yiping Xie, Song Chen, Jingxuan Xing, Wei Jiang, Zekun Zhu, Yingyao Wang, Pi Bu, Jun Song, Yuning Jiang, Bo Zheng
First: 2026-03-09T16:04:08+00:00 · Latest: 2026-03-09T16:04:08+00:00
Abstract
Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.
中文标题/摘要
标题:SecAgent:具有语义上下文的高效移动GUI代理
由多模态大型语言模型驱动的移动图形用户界面(GUI)代理在自动化复杂智能手机任务方面展现了有前景的能力。然而,现有方法面临两个关键限制:高质量多语言数据集的稀缺性,尤其是非英语生态系统,以及低效的历史表示方法。为了解决这些挑战,我们提出了SecAgent,这是一种3B规模的高效移动GUI代理。我们首先构建了一个包含18000个语义对齐样本和121000个导航步骤的经过人工验证的中文移动GUI数据集,覆盖44个应用程序,并提供了一个包含多选操作注解的中文导航基准。基于此数据集,我们提出了一种语义上下文机制,将历史截图和操作提炼为简洁的自然语言摘要,显著降低了计算成本,同时保留了任务相关信息。通过监督和强化微调,SecAgent 在我们的导航基准和公共导航基准上优于同等规模的基线,并达到了与7B-8B模型相当的性能。我们将开源训练数据集、基准、模型和代码,以促进多语言移动GUI自动化研究。
Summary / 总结
SecAgent is an efficient mobile GUI agent at 3B scale that addresses the limitations of existing approaches by constructing a human-verified Chinese mobile GUI dataset and proposing a semantic context mechanism. This mechanism distills history screenshots and actions into concise summaries, reducing computational costs while preserving task-relevant information. SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on navigation benchmarks.
SecAgent 是一个3B规模的高效移动GUI代理,通过构建一个经过人工验证的中文移动GUI数据集和提出一种语义上下文机制来减少计算成本同时保留任务相关信息,SecAgent 在导航基准测试中优于类似规模的基线并达到了与7B-8B模型相当的性能。
BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images
Authors: Sinan U. Ulu, A. Enes Doruk, I. Can Yagmur, Bahadir K. Gunturk, Oguz Hanoglu, Hasan F. Ates
First: 2026-03-09T15:56:42+00:00 · Latest: 2026-03-09T15:56:42+00:00
Abstract
Accurate building segmentation and height estimation from single-view RGB satellite imagery are fundamental for urban analytics, yet remain ill-posed due to structural variability and the high computational cost of global context modeling. While current approaches typically adapt monocular depth architectures, they often suffer from boundary bleeding and systematic underestimation of high-rise structures. To address these limitations, we propose BuildMamba, a unified multi-task framework designed to exploit the linear-time global modeling of visual state-space models. Motivated by the need for stronger structural coupling and computational efficiency, we introduce three modules: a Mamba Attention Module for dynamic spatial recalibration, a Spatial-Aware Mamba-FPN for multi-scale feature aggregation via gated state-space scans, and a Mask-Aware Height Refinement module using semantic priors to suppress height artifacts. Extensive experiments demonstrate that BuildMamba establishes a new performance upper bound across three benchmarks. Specifically, it achieves an IoU of 0.93 and RMSE of 1.77~m on DFC23 benchmark, surpassing state-of-the-art by 0.82~m in height estimation. Simulation results confirm the model's superior robustness and scalability for large-scale 3D urban reconstruction.
中文标题/摘要
标题:BuildMamba:基于视觉状态空间模型的多任务建筑分割与高度估计框架
单视角RGB卫星图像中的建筑分割和高度估计对于城市分析至关重要,但由于结构多样性以及全局上下文建模的高计算成本,这一任务仍然难以解决。当前方法通常采用单目深度架构,但往往存在边界溢出和高层建筑系统性低估的问题。为解决这些问题,我们提出BuildMamba,这是一种统一的多任务框架,旨在利用视觉状态空间模型的线性时间全局建模。受更强结构耦合和计算效率的需求驱动,我们引入了三个模块:Mamba 注意模块用于动态空间重新校准,空间感知的 Mamba-FPN 通过门控状态空间扫描进行多尺度特征聚合,以及使用语义先验的掩码感知高度细化模块以抑制高度伪影。大量实验表明,BuildMamba 在三个基准测试中建立了新的性能上限。具体而言,它在DFC23基准测试中实现了0.93的IoU和1.77米的RMSE,高度估计方面比最先进的方法高出0.82米。仿真结果证实了该模型在大规模三维城市重建中的优越鲁棒性和可扩展性。
Summary / 总结
BuildMamba is a unified multi-task framework for building segmentation and height estimation from satellite images, addressing the limitations of existing monocular depth architectures by incorporating visual state-space models. It includes a Mamba Attention Module for spatial recalibration, a Spatial-Aware Mamba-FPN for multi-scale feature aggregation, and a Mask-Aware Height Refinement module. Experiments show that BuildMamba outperforms state-of-the-art methods, achieving an IoU of 0.93 and an RMSE of 1.77 meters on the DFC23 benchmark, with a 0.82-meter improvement in height estimation accuracy.
BuildMamba 是一个统一的多任务框架,旨在从卫星图像中提高建筑物分割和高度估计的准确性。该框架通过引入 Mamba 注意力模块、空间感知 Mamba-FPN 和掩码感知高度细化模块,增强了结构耦合并提高了计算效率。实验结果表明,BuildMamba 在 DFC23 基准上表现出色,IoU 达到 0.93,RMSE 为 1.77 米,高度估计比现有最佳方法高出 0.82 米。