arXiv 论文速递

2026-02-19 03:58
Snapshot: 20260219_0358
Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
Authors: Yuxuan Kuang, Sungjae Park, Katerina Fragkiadaki, Shubham Tulsiani
First: 2026-02-17T18:59:31+00:00 · Latest: 2026-02-17T18:59:31+00:00
Comments: Project page: https://dex4d.github.io/
Abstract
Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.
中文标题/摘要
标题:Dex4D:通用点轨迹策略框架实现从仿真到现实的灵巧操作
在灵巧操作中,学习能够完成多种日常任务的一般性策略仍然是一个开放的挑战。特别是,通过现实世界的远程操作收集大规模操作数据既昂贵又难以扩展。虽然在仿真中学习提供了一种可行的替代方案,但设计多个特定任务的环境和奖励进行训练同样具有挑战性。我们提出了Dex4D框架,该框架利用仿真来学习任务无关的灵巧技能,这些技能可以在测试时灵活重组以执行各种现实世界的操作任务。具体而言,Dex4D学习了一种领域无关的3D点轨迹条件策略,该策略能够操作任何物体到任何期望的姿态。我们在数千种具有不同姿态配置的物体上对这种“任意姿态到任意姿态”的策略进行了仿真训练,涵盖了可以在测试时组合的广泛机器人-物体交互空间。在部署时,该策略可以通过仅提示其期望的物体中心点轨迹(从生成的视频中提取)而无需微调,即可实现零样本转移。在执行过程中,Dex4D使用在线点跟踪进行闭环感知和控制。在仿真和真实机器人上的大量实验表明,我们的方法能够实现多种灵巧操作任务的零样本部署,并且在先前基线方法上取得了持续改进。此外,我们展示了其对新型物体、场景布局、背景和轨迹的强大泛化能力,突显了所提出框架的鲁棒性和可扩展性。
Summary / 总结
Dex4D is a framework designed to learn task-agnostic dexterous manipulation skills in simulation, which can be flexibly applied to various real-world tasks. It trains a 3D point track policy to manipulate any object to any desired pose across thousands of objects with diverse configurations. During deployment, the policy can be zero-shot transferred to real-world tasks by prompting it with desired object-centric point tracks. Experiments show that Dex4D enables zero-shot deployment for diverse manipulation tasks and demonstrates strong generalization to novel objects and backgrounds.
Dex4D 是一个框架,旨在通过模拟学习通用的灵巧操作技能,这些技能可以灵活应用于各种实际任务。它训练了一个‘任意姿态到任意姿态’的策略,能够在数千个具有不同配置的对象上操作任意物体到任意姿态。该策略可以通过提示其所需的目标物体中心点轨迹,在实际任务中实现零样本迁移。实验表明,Dex4D 在性能上优于先前的方法,并且在新物体和场景中表现出强大的泛化能力。
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
Authors: Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Koushil Sreenath, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, C. Karen Liu
First: 2026-02-17T18:59:11+00:00 · Latest: 2026-02-17T18:59:11+00:00
Abstract
While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.
中文标题/摘要
标题:感知类人公园跑酷:通过运动匹配串联动态人类技能
尽管近年来类人机器人在多变地形上的稳定行走取得了进展,但捕捉人类高度动态运动的敏捷性和适应性仍然是一个开放的挑战。特别是,在复杂环境中进行敏捷的公园跑酷不仅需要低级别的鲁棒性,还需要类似人类的运动表达性、长期技能组合以及感知驱动的决策制定。在本文中,我们提出了感知类人公园跑酷(PHP),这是一种模块化框架,使类人机器人能够自主地在具有挑战性的障碍物课程上进行长期视角导向的公园跑酷。我们的方法首先利用运动匹配,将其形式化为特征空间中的最近邻搜索,将重新定位的原子人类技能组合成长期的运动轨迹。该框架使复杂的技能链的灵活组合和平滑过渡成为可能,同时保持动态人类运动的优雅和流畅性。接下来,我们为这些组合的运动训练基于运动跟踪的强化学习(RL)专家策略,并使用DAgger和RL的组合将其提炼为一个基于深度的、多技能学生策略。关键的是,感知与技能组合的结合使自主、上下文感知的决策成为可能:仅使用机载深度传感和离散的二维速度命令,机器人可以选择并执行跨越、攀爬、越障或滚落不同几何形状和高度的障碍物。我们通过在Unitree G1类人机器人上进行广泛的实地实验验证了该框架,展示了诸如攀爬高达1.25米(96%机器人高度)的障碍物等高度动态的公园跑酷技能,以及在实时障碍扰动下进行长期多障碍物穿越的闭环适应。
Summary / 总结
This paper addresses the challenge of achieving agile and adaptive humanoid parkour by presenting Perceptive Humanoid Parkour (PHP), a modular framework. It uses motion matching to compose human skills into long-horizon trajectories and trains reinforcement learning policies to execute these skills autonomously. The robot can perform dynamic parkour skills like climbing obstacles up to 1.25m and traverse multiple obstacles with real-time adaptation.
本文提出了Perceptive Humanoid Parkour (PHP) 模块化框架,使类人机器人能够自主执行复杂的公园技巧。该方法使用运动匹配将人类技能组合成长时间轨迹,并训练强化学习策略来执行这些技能。机器人仅使用机载深度传感和离散速度命令即可自主决定是否跨过、攀爬、翻越或滚过不同几何形状和高度的障碍物,展示了动态的公园技巧,如攀爬高达1.25m的障碍物,并在实时障碍变化时进行闭环适应,实现多障碍物穿越。
Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence
Authors: Alisa Vinogradova, Vlad Vinogradov, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev
First: 2026-02-16T18:57:49+00:00 · Latest: 2026-02-17T18:58:56+00:00
Abstract
Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests that over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total. A growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high recall discovery across heterogeneous, multilingual sources without hallucination. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real-deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Claude Opus 4.6 (56.2%), Gemini 3 Pro + Deep Research (50.6%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.
中文标题/摘要
标题:全球搜寻:广泛搜索AI代理在投资、商务拓展和竞争情报中的药物资产勘探
生物医药创新已转变:许多新药资产现在起源于美国之外,并主要通过区域性的非英语渠道披露。最新数据显示,超过85%的专利申请来自美国之外,其中中国占全球总量的近一半。非美国的学术产出份额也在增加。行业估计显示,中国在全球药物研发中占30%,涵盖1,200多种新型候选药物。在这种高风险环境中,未能发现“非主流”的资产会给投资者和商务拓展团队带来数亿美元的风险,使资产勘探成为一项关键的竞争,速度和完整性决定价值。然而,当前的深度研究AI代理在实现跨异构、多语言来源的高召回率发现时仍落后于人类专家,且缺乏幻觉。我们提出了一种药物资产勘探的基准测试方法,并开发了一种调优的树状自学习Bioptic代理,旨在实现完整的非幻觉勘探。我们构建了一个具有挑战性的完整性基准,使用多语言多代理管道:复杂用户查询配以主要在美国之外的地面真实资产。为了反映实际复杂性,我们收集了专家投资者、商务拓展和风险投资专业人士的筛查查询,并将其作为先验条件生成基准查询。在评分方面,我们使用校准后的LLM作为裁判,评估标准基于专家意见。在该基准测试中,我们的Bioptic代理取得了79.7%的F1分数,优于Claude Opus 4.6(56.2%)、Gemini 3 Pro + 深度研究(50.6%)、OpenAI GPT-5.2 Pro(46.6%)、Perplexity深度研究(44.2%)和Exa Websets(26.9%)。随着计算资源的增加,性能显著提升,支持了更多计算资源会带来更好结果的观点。
Summary / 总结
The paper addresses the challenge of identifying new drug assets that originate outside the U.S., particularly from non-English sources. It introduces a benchmarking methodology and a self-learning Bioptic Agent designed to comprehensively scout these assets without hallucination. The Bioptic Agent outperforms several other AI agents, achieving a 79.7% F1 score on a challenging multilingual benchmark, significantly surpassing competitors like Claude Opus 4.6 and Gemini 3 Pro + Deep Research.
论文针对识别起源于美国以外地区的新药物资产的挑战,特别是非英语来源的资产。它提出了一种基准测试方法,并开发了一个自学习的Bioptic代理,旨在全面地侦察这些资产而不产生幻觉。Bioptic代理在一项具有挑战性的多语言基准测试中取得了79.7%的F1分数,显著超过了Claude Opus 4.6和Gemini 3 Pro + Deep Research等竞争对手。
stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
Authors: Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, Randall Balestriero
First: 2026-02-09T18:04:22+00:00 · Latest: 2026-02-17T18:58:08+00:00
Abstract
World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.
中文标题/摘要
标题:稳定的世界模型v1:可重复的世界建模研究与评估
世界模型已经发展成为一种强大的范式,用于学习环境动力学的紧凑、预测性表示,使智能体能够推理、规划并超越直接经验进行泛化。尽管最近对世界模型的兴趣增加,但大多数可用的实现仍然针对特定的出版物,严重限制了它们的可重用性,增加了错误的风险,并降低了评估标准化。为缓解这些问题,我们引入了稳定的世界模型(SWM),这是一个模块化、经过测试和文档化的世界建模研究生态系统,提供了高效的数据收集工具、标准化环境、规划算法和基线实现。此外,SWM中的每个环境都支持鲁棒性和持续学习研究,允许控制变化因素,包括视觉和物理属性。最后,我们通过使用SWM研究DINO-WM的零样本鲁棒性来展示SWM的实用性。
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
Authors: Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad
First: 2026-02-17T18:58:04+00:00 · Latest: 2026-02-17T18:58:04+00:00
Abstract
A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.
中文标题/摘要
标题:CrispEdit:低曲率投影的可扩展非破坏性大语言模型编辑
大规模语言模型(LLM)编辑中的一个主要挑战是能力保留:能够成功改变目标行为的方法可能会悄悄地利用编辑代理并破坏一般能力,产生类似于代理/奖励黑客行为的退化行为。我们提出了CrispEdit,这是一种可扩展且基于原理的二阶编辑算法,将能力保留视为显式约束,统一并泛化了多种现有的编辑方法。CrispEdit将编辑表述为约束优化,并通过将编辑更新投影到能力损失景观的低曲率子空间中来强制执行约束。CrispEdit的核心在于通过Bregman发散表达能力约束,其二次形式在基模型未完全训练时也能精确地给出Gauss-Newton海森矩阵。我们通过Kronecker因子近似曲率(K-FAC)和一种新颖的矩阵自由投影器,利用Kronecker结构避免构建大规模投影矩阵,使这种二阶过程在LLM规模下高效运行。在标准模型编辑基准测试中,CrispEdit在保持能力退化低于1%的同时实现了高编辑成功率,显著优于先前的编辑器。
Summary / 总结
CrispEdit addresses the challenge of preserving capabilities in large language model (LLM) editing by formulating editing as constrained optimization and projecting edit updates onto the low-curvature subspace of the capability-loss landscape. It uses Bregman divergence to express capability constraints and employs Kronecker-factored approximate curvature and a matrix-free projector to maintain efficiency at the LLM scale. Experiments show that CrispEdit successfully edits LLMs with minimal capability degradation, averaging less than 1% degradation across datasets, outperforming previous methods.
CrispEdit通过将编辑问题表述为约束优化问题,并将编辑更新投影到能力损失景观的低曲率子空间中来解决大规模语言模型(LLM)编辑中的能力保留问题。它使用Bregman散度来表达能力约束,并利用Kronecker因子近似曲率和一种基于Kronecker结构的矩阵自由投影器来保持效率。实验表明,CrispEdit能够在几乎不降低能力的情况下成功编辑模型,优于先前的方法。
Stabilizing Test-Time Adaptation of High-Dimensional Simulation Surrogates via D-Optimal Statistics
Authors: Anna Zimmel, Paul Setinek, Gianluca Galletti, Johannes Brandstetter, Werner Zellinger
First: 2026-02-17T18:55:18+00:00 · Latest: 2026-02-17T18:55:18+00:00
Abstract
Machine learning surrogates are increasingly used in engineering to accelerate costly simulations, yet distribution shifts between training and deployment often cause severe performance degradation (e.g., unseen geometries or configurations). Test-Time Adaptation (TTA) can mitigate such shifts, but existing methods are largely developed for lower-dimensional classification with structured outputs and visually aligned input-output relationships, making them unstable for the high-dimensional, unstructured and regression problems common in simulation. We address this challenge by proposing a TTA framework based on storing maximally informative (D-optimal) statistics, which jointly enables stable adaptation and principled parameter selection at test time. When applied to pretrained simulation surrogates, our method yields up to 7% out-of-distribution improvements at negligible computational cost. To the best of our knowledge, this is the first systematic demonstration of effective TTA for high-dimensional simulation regression and generative design optimization, validated on the SIMSHIFT and EngiBench benchmarks.
中文标题/摘要
标题:通过D-最优统计稳定高维模拟代理的测试时适应
机器学习代理在工程中越来越多地用于加速昂贵的模拟,但在训练和部署之间出现的分布变化往往会导致严重的性能下降(例如,未见过的几何形状或配置)。测试时适应(TTA)可以缓解这种变化,但现有方法主要针对低维分类且具有结构化输出和视觉对齐输入输出关系的问题进行开发,使得它们在模拟中常见的高维、无结构和回归问题中不稳定。我们通过提出基于存储最大信息量(D-最优)统计的TTA框架来应对这一挑战,该框架在测试时同时实现了稳定适应和原理参数选择。当应用于预训练的模拟代理时,我们的方法在几乎不增加计算成本的情况下,可获得高达7%的分布外性能提升。据我们所知,这是首次系统地证明有效的高维模拟回归和生成设计优化的TTA,并在SIMSHIFT和EngiBench基准上进行了验证。
Summary / 总结
The research aims to improve the performance of machine learning surrogates in engineering simulations by addressing the issue of distribution shifts between training and deployment. The proposed method, based on storing D-optimal statistics, enables stable test-time adaptation for high-dimensional regression problems, which are common in simulation. The method achieves up to 7% out-of-distribution performance improvements with negligible computational cost, demonstrating its effectiveness on the SIMSHIFT and EngiBench benchmarks.
研究旨在通过解决训练和测试之间分布变化的问题,提高工程模拟中机器学习代理的表现。提出的方法使用D最优统计来稳定测试时的自适应,特别是针对高维度、非结构化的回归问题。实验结果显示,在几乎不增加计算成本的情况下,可以提高7%的出域性能。
VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
Authors: Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker
First: 2026-02-17T18:55:03+00:00 · Latest: 2026-02-17T18:55:03+00:00
Abstract
Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.
中文标题/摘要
标题:VideoSketcher:视频模型先验使顺序素描生成多样化
素描本质上是一个顺序过程,在这个过程中,按照有意义的顺序绘制线条以探索和细化想法。然而,大多数生成模型将素描视为静态图像,忽视了支撑创造性绘画的时间结构。我们提出了一种数据高效的方法,将预训练的文本到视频扩散模型适应以生成素描过程。我们的关键见解是,大型语言模型和视频扩散模型在这一任务中提供了互补的优势:LLMs 提供语义规划和线条顺序,而视频扩散模型作为强大的渲染器,产生高质量、时间上连贯的视觉效果。我们通过将素描表示为短视频来利用这一点,在这些视频中,线条在空白画布上逐步绘制,受文本指定的顺序指令引导。我们引入了一种两阶段微调策略,将线条顺序的学习与素描外观的学习分离。线条顺序是通过合成具有受控时间结构的形状来学习的,而视觉外观则从七个手动撰写的素描过程中提取,这些过程捕捉了全局绘画顺序和单个线条的连续形成。尽管人类绘制的素描数据极其有限,但我们的方法生成了高质量的顺序素描,这些素描紧密遵循文本指定的顺序,同时表现出丰富的视觉细节。我们还通过扩展如笔触风格条件和自回归素描生成,进一步展示了我们方法的灵活性,从而实现额外的可控性和交互式、协作式绘画。
Summary / 总结
The research aims to address the limitations of static image-based generative models in capturing the sequential nature of sketching. It proposes a method that uses pretrained text-to-video diffusion models to generate sequential sketches, leveraging the strengths of large language models for semantic planning and stroke ordering, and video diffusion models for high-quality rendering. The method achieves high-quality, temporally coherent sequential sketches with limited human-drawn sketch data, demonstrating its effectiveness and flexibility in generating detailed and ordered sketches.
研究旨在解决基于静态图像的生成模型无法捕捉绘图顺序性的局限性。提出了一种方法,利用预训练的文本到视频扩散模型生成顺序性素描,结合大型语言模型进行语义规划和笔触顺序,以及视频扩散模型进行高质量渲染。该方法使用有限的人工绘制素描数据生成高质量、时间连贯的顺序性素描,展示了其在生成详细且有序素描方面的有效性与灵活性。
Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning
Authors: Oswin So, Eric Yang Yu, Songyuan Zhang, Matthew Cleaveland, Mitchell Black, Chuchu Fan
Venue: ICLR 2026
First: 2026-02-17T18:53:31+00:00 · Latest: 2026-02-17T18:53:31+00:00
Comments: ICLR 2026. The project page can be found at https://oswinso.xyz/fge
Abstract
Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.
中文标题/摘要
标题:使用强化学习解决参数鲁棒的可达性问题,即使初始可行性未知
深度强化学习(RL)的最新进展在高维控制任务上取得了显著成果,但将其应用于可达性问题时会引发根本性的不匹配:可达性旨在最大化系统保持无限安全状态的初始状态集,而RL则优化用户指定分布的预期回报。这种不匹配可能导致在低概率但仍在安全集内的状态下表现不佳的策略。一种自然的替代方案是将问题重新表述为在初始条件集上进行鲁棒优化,该集指定了初始状态、动力学和安全集,但该问题是否有解取决于指定集的可行性,而这是事先未知的。我们提出了一种可行性引导探索(FGE)方法,该方法同时识别出在其中存在安全策略的可行初始条件子集,并学习在该初始条件集上解决可达性问题的策略。实验证明,FGE在MuJoCo模拟器和Kinetix模拟器(带有像素观察)的复杂初始条件下学习的策略覆盖范围比现有最佳方法高出50%以上。
Summary / 总结
The paper addresses the challenge of applying reinforcement learning (RL) to reachability problems by proposing Feasibility-Guided Exploration (FGE), which simultaneously identifies feasible initial conditions and learns a policy to solve the reachability problem. FGE outperforms existing methods by learning policies with over 50% more coverage on challenging initial conditions across various simulation environments.
研究解决了将强化学习(RL)应用于可达性问题时遇到的挑战,其中RL的目标是最大化预期回报与可达性目标确保广泛状态范围内的安全性相冲突。提出的可行性引导探索(FGE)方法识别可行的初始条件,并学习在这些条件下的策略来解决可达性问题。实验表明,FGE在各种模拟环境中的挑战初始条件下,比现有方法的覆盖率高出50%以上。
Token-Based Audio Inpainting via Discrete Diffusion
Authors: Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
First: 2025-07-11T06:25:49+00:00 · Latest: 2026-02-17T18:53:15+00:00
Abstract
Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Visit our project page for examples and code.
中文标题/摘要
标题:基于令牌的音频修复通过离散扩散
音频修复旨在恢复降级录音中缺失的段落。之前的基于扩散的方法在缺失区域较大时表现出较差的性能。我们提出了第一个在预训练音频分词器的令牌化音乐表示上应用离散扩散的方法,从而实现对长段缺失的稳定且语义一致的修复。我们的方法还进一步结合了两种训练方法:基于导数的正则化损失,以确保平滑的时间动态;以及基于跨度的吸收转换,以在扩散过程中提供结构化的破坏。在MusicNet和MAESTRO数据集上的实验表明,对于150毫秒及以上的缺失段落,我们的方法在不同长度的缺失段落范围内始终优于强大的基线方法。这项工作推进了音乐音频修复,并引入了离散扩散模型训练的新方向。访问我们的项目页面以获取示例和代码。
Avey-B
Authors: Devang Acharya, Mohammad Hammoud
First: 2026-02-17T18:50:40+00:00 · Latest: 2026-02-17T18:50:40+00:00
Abstract
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
中文标题/摘要
标题:Avey-B
紧凑的预训练双向编码器在计算和内存预算紧张的工业NLP中仍然是骨干。它们的有效性源于自注意力能够通过序列级并行性提供高质量的双向上下文化。最近,Avey 作为一种自回归、无注意力的替代方案被引入,自然地适用于仅编码器的适应。在本文中,我们重新构想了Avey的仅编码器范式,并对其架构提出了多项创新,包括解耦静态和动态参数化、稳定性导向的规范化和神经压缩。结果显示,这种重新构想的架构在标准的标记分类和信息检索基准测试中优于四种广泛使用的基于Transformer的编码器,且在长上下文场景下更具扩展性。
Summary / 总结
This paper aims to improve the performance of Avey, an autoregressive, attention-free model, by reformulating it for an encoder-only paradigm. The authors introduce several architectural innovations, such as decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. The results demonstrate that the reformulated Avey outperforms four commonly used Transformer-based encoders on standard benchmarks for token classification and information retrieval, while also scaling more efficiently to longer contexts.
本文旨在通过增强Avey模型,一种自回归、无注意力机制的编码器,使其在受限资源下更好地应用于NLP任务。作者重新构建了Avey,并引入了诸如静态和动态参数解耦、稳定性导向的规范化和神经压缩等架构创新。实验结果表明,重构后的Avey模型在标准的标记分类和信息检索基准测试中优于四种常用的Transformer编码器,特别是在处理更长的序列时表现出更好的可扩展性。
Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models
Authors: Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor
Venue: ICLR 2026
First: 2026-02-08T16:07:04+00:00 · Latest: 2026-02-17T18:50:34+00:00
Comments: This paper will be published in the ICLR 2026 proceedings
Abstract
We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at https://github.com/leor-c/horizon-imagination.
中文标题/摘要
标题:地平线想象:扩散世界模型中的高效策略回放
我们研究基于扩散的世界模型在强化学习中的应用,这些模型在生成性保真度方面表现出色,但在控制方面面临严重的效率挑战。当前方法要么在推理时需要重型模型,要么依赖高度顺序的想象,这两种方法都会带来巨大的计算成本。我们提出了地平线想象(HI),这是一种针对离散随机策略的在线策略想象过程,可以并行去噪多个未来观察。HI 包含一个稳定机制和一种新颖的采样计划,该计划解耦了去噪预算与实际应用的去噪有效时长,并支持子帧预算。在 Atari 100K 和 Craftium 上的实验表明,我们的方法在子帧预算仅为去噪步骤一半的情况下仍能保持控制性能,并且在各种计划下生成质量更优。代码可在 https://github.com/leor-c/horizon-imagination 获取。
Summary / 总结
The paper addresses the efficiency challenges in using diffusion-based world models for reinforcement learning, where current methods either require heavy models or sequential imagination, both of which are computationally expensive. The authors propose Horizon Imagination (HI), an on-policy imagination process that denoises multiple future observations in parallel, incorporating a stabilization mechanism and a novel sampling schedule. Experiments on Atari 100K and Craftium demonstrate that HI maintains control performance with a sub-frame budget of half the denoising steps and achieves better generation quality under varied schedules.
研究旨在解决使用基于扩散的世界模型进行强化学习时的效率问题,尽管这些模型具有高生成保真度,但需要重型模型或顺序想象,导致高计算成本。提出的Horizon Imagination (HI)方法引入了一种在线想象过程,可以并行去除多个未来观察的噪声,并包含稳定机制和新颖的采样计划。实验结果表明,HI在子帧预算仅为去噪步骤一半的情况下保持了控制性能,并在不同计划下实现了更好的生成质量。
Decision Quality Evaluation Framework at Pinterest
Authors: Yuqi Tian, Robert Paine, Attila Dobi, Kevin O'Sullivan, Aravindh Manickavasagam, Faisal Farooq
First: 2026-02-17T18:45:55+00:00 · Latest: 2026-02-17T18:45:55+00:00
Abstract
Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
中文标题/摘要
标题:Pinterest的内容质量评估框架
在线平台需要强大的系统来大规模执行内容安全政策。这些系统的关键组成部分是评估人类代理和大型语言模型(LLMs)所做 Moderation 决策质量的能力。然而,由于成本、规模和可信度之间的固有权衡,以及政策的复杂性,这种评估具有挑战性。为了解决这个问题,我们提出了一个全面的内容质量评估框架,该框架在 Pinterest 上开发和部署。该框架以主题专家(SMEs)精心策划的高可信度金集(GDS)为中心,作为基准。我们引入了一种自动智能抽样管道,使用倾向得分来高效地扩展数据集覆盖范围。我们展示了该框架在几个关键领域的实际应用:基准测试各种 LLM 代理的成本-性能权衡,建立数据驱动的提示优化的严格方法,管理复杂的政策演变,并通过持续验证确保政策内容存在性指标的完整性。该框架使内容安全系统的管理从主观评估转变为数据驱动和定量实践。
Summary / 总结
The research aims to develop a robust framework for evaluating the quality of moderation decisions made by both human agents and LLMs in online platforms. The method involves creating a high-trust Golden Set curated by SMEs and an automated intelligent sampling pipeline using propensity scores. Key findings include the framework's effectiveness in benchmarking LLM agents, optimizing prompts data-drivenly, managing policy evolution, and ensuring content prevalence metrics integrity.
研究旨在开发一个框架来评估在线平台上由人类代理和LLM做出的审核决策的质量。方法包括由专家创建一个高可信度的金集,并使用倾向得分的自动化智能采样管道。关键发现包括该框架在评估LLM代理、数据驱动地优化提示、管理政策演变以及确保内容存在度指标完整性方面的有效性。
The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
Authors: Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova
First: 2026-02-17T18:39:15+00:00 · Latest: 2026-02-17T18:39:15+00:00
Comments: 27 pages, 4 figures
Abstract
Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.
中文标题/摘要
标题:对齐崩塌的几何学:微调如何破坏安全性
在无害任务上对齐的语言模型进行微调,即使训练数据中不含有害内容且开发人员无敌意意图,也会不可预测地降低安全性保障。我们展示了当前解释,即微调更新应在高维参数空间中与关键的安全方向正交,提供的是一种虚假的安全感:我们证明这种正交性在动态梯度下降下是结构上不稳定的,并会崩溃。然后我们通过一种新颖的几何分析解决了这一问题,证明了对齐集中在具有尖锐曲率的低维子空间中,形成了微分方法无法检测或防御的脆弱结构。虽然初始微调更新确实可以避开这些子空间,但微调损失的曲率会系统地加速轨迹进入对齐敏感区域。我们通过对齐不稳定性条件这一机制进行了形式化,即当三个几何属性同时满足时,会导致安全性下降。我们的主要结果建立了四次方律:对齐损失随训练时间的四次方增长,由对齐几何的尖锐度和微调任务与关键安全参数之间曲率耦合的强度控制。这些结果揭示了当前安全性范式中的结构性盲点。主流的对齐微调方法仅关注根本上动态问题的初始快照。对齐脆弱性不是需要修补的漏洞,而是梯度下降在曲率流形上的固有几何属性。我们的结果促使开发曲率感知方法,并希望进一步推动对齐安全性分析从反应性红队测试转向预测性诊断,以适应开放权重模型部署。
Summary / 总结
The study investigates how fine-tuning aligned language models on benign tasks can unpredictably degrade safety, even without harmful training data. It challenges the prevailing orthogonality explanation by demonstrating that this orthogonality is unstable and collapses under gradient descent dynamics. The research introduces a geometric analysis showing that alignment concentrates in low-dimensional subspaces with sharp curvature, leading to a brittle structure that first-order methods cannot detect or defend. The main result establishes a quartic scaling law for alignment loss growth, highlighting the need for curvature-aware methods in safe fine-tuning and predictive diagnostics for model deployment.
研究探讨了为什么在良性任务上微调对齐的语言模型会不可预测地降低安全性,即使没有有害的训练数据。它挑战了现有的正交性假设,并引入了一种几何分析,表明对齐集中在具有尖锐曲率的低维子空间中,导致对齐损失随训练时间的四次方增长。研究识别了对齐不稳定性条件,并提出使用曲率感知方法来解决这一问题,建议从被动的红队测试转向预测性诊断,以应对开放权重模型部署中的对齐安全性问题。
GenDA: Generative Data Assimilation on Complex Urban Areas via Classifier-Free Diffusion Guidance
Authors: Francisco Giral, Álvaro Manzano, Ignacio Gómez, Ricardo Vinuesa, Soledad Le Clainche
First: 2026-01-16T17:02:00+00:00 · Latest: 2026-02-17T18:27:19+00:00
Abstract
Urban wind flow reconstruction is essential for assessing air quality, heat dispersion, and pedestrian comfort, yet remains challenging when only sparse sensor data are available. We propose GenDA, a generative data assimilation framework that reconstructs high-resolution wind fields on unstructured meshes from limited observations. The model employs a multiscale graph-based diffusion architecture trained on computational fluid dynamics (CFD) simulations and interprets classifier-free guidance as a learned posterior reconstruction mechanism: the unconditional branch learns a geometry-aware flow prior, while the sensor-conditioned branch injects observational constraints during sampling. This formulation enables obstacle-aware reconstruction and generalization across unseen geometries, wind directions, and mesh resolutions without retraining. We consider both sparse fixed sensors and trajectory-based observations using the same reconstruction procedure. When evaluated against supervised graph neural network (GNN) baselines and classical reduced-order data assimilation methods, GenDA reduces the relative root-mean-square error (RRMSE) by 25-57% and increases the structural similarity index (SSIM) by 23-33% across the tested meshes. Experiments are conducted on Reynolds-averaged Navier-Stokes (RANS) simulations of a real urban neighbourhood in Bristol, United Kingdom, at a characteristic Reynolds number of $\mathrm{Re}\approx2\times10^{7}$, featuring complex building geometry and irregular terrain. The proposed framework provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains.
Summary / 总结
GenDA is a generative data assimilation framework designed to reconstruct high-resolution wind fields in urban areas using limited sensor data. It employs a multiscale graph-based diffusion architecture trained on CFD simulations and uses classifier-free guidance to incorporate observational constraints during sampling. Experiments show that GenDA outperforms supervised GNN baselines and classical data assimilation methods, reducing RRMSE by 25-57% and increasing SSIM by 23-33%. The framework successfully reconstructs wind fields in a complex urban neighborhood in Bristol, UK, with improved accuracy and generalization across different geometries and mesh resolutions.
GenDA 是一种生成式数据同化框架,旨在使用有限的传感器数据重建城市区域的高分辨率风场。该框架采用基于图的多尺度扩散架构,并通过计算流体动力学(CFD)模拟进行训练,利用分类器无指导的方法来学习几何感知的流动先验,并在采样期间注入观测约束。实验表明,GenDA 在 RRMSE 上优于监督图神经网络基线和经典数据同化方法,降低了 25-57%,在 SSIM 上提高了 23-33%。该框架在英国布里斯托尔市一个具有复杂建筑几何形状和不规则地形的真实城市街区的 RANS 模拟中得到了验证,展示了其在复杂几何形状和不规则地形中的有效性。
NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy
Authors: Laura Salort-Benejam, Antonio Agudo
Venue: ISBI 2026
First: 2026-02-17T18:05:23+00:00 · Latest: 2026-02-17T18:05:23+00:00
Comments: ISBI 2026
Abstract
Endoscopy is essential in medical imaging, used for diagnosis, prognosis and treatment. Developing a robust dynamic 3D reconstruction pipeline for endoscopic videos could enhance visualization, improve diagnostic accuracy, aid in treatment planning, and guide surgery procedures. However, challenges arise due to the deformable nature of the tissues, the use of monocular cameras, illumination changes, occlusions and unknown camera trajectories. Inspired by neural rendering, we introduce NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable endoscopic tissues from a monocular video. NeRFscopy includes a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. In addition, the color images are efficiently exploited by introducing sophisticated terms to learn a 3D implicit model without assuming any template or pre-trained model, solely from data. NeRFscopy achieves accurate results in terms of novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.
中文标题/摘要
标题:NeRFscopy:内窥镜下变形组织的神经辐射场实时三维重建
内窥镜在医学成像中至关重要,用于诊断、预后和治疗。开发用于内窥镜视频的稳健动态三维重建管道可以增强可视化,提高诊断准确性,辅助治疗计划,并指导手术程序。然而,由于组织的可变形性、单目相机的使用、照明变化、遮挡和未知的相机轨迹,带来了挑战。受神经渲染的启发,我们提出了NeRFscopy,这是一种自监督的管道,用于从单目视频中合成新颖视图和重建变形内窥镜组织的三维模型。NeRFscopy 包括一个可变形模型,具有一个标准辐射场和一个由 SE(3) 变换参数化的时变变形场。此外,通过引入复杂的术语,NeRFscopy 有效地利用了彩色图像,无需假设任何模板或预训练模型,仅从数据中学习三维隐式模型。NeRFscopy 在各种具有挑战性的内窥镜场景中实现了准确的结果,优于竞争对手的方法。
Summary / 总结
NeRFscopy is a self-supervised pipeline for 3D reconstruction and novel view synthesis of deformable endoscopic tissues from monocular videos. It uses a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. NeRFscopy outperforms competing methods in various challenging endoscopy scenes, demonstrating accurate results in novel view synthesis.
NeRFscopy 是一种自监督的管道,用于从单目视频中重建和生成变形内窥镜组织的 3D 模型。它使用一个变形模型,包含一个标准辐射场和一个时间依赖的变形场,并利用彩色图像来学习一个 3D 隐式模型,无需假设任何模板。该方法在各种具有挑战性的内窥镜场景中表现出色,提高了可视化和诊断准确性。
Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Authors: Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
First: 2026-02-17T18:04:13+00:00 · Latest: 2026-02-17T18:04:13+00:00
Comments: Accepted to ICLR2026
Abstract
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.
中文标题/摘要
标题:理解与生成:多模态模型优化困境的导航
当前的多模态模型研究面临一个关键挑战,即增强生成能力往往以牺牲理解能力为代价,反之亦然。我们分析了这种权衡,并认为主要原因是生成与理解之间的潜在冲突,这在模型内部创造了一种竞争动态。为了解决这一问题,我们提出了Reason-Reflect-Refine (R3)框架。这一创新算法将单步生成任务重新构建成“生成-理解-再生”的多步过程。通过明确利用模型的理解能力进行生成,我们成功地缓解了优化困境,实现了更强的生成结果并提高了与生成过程相关的理解能力。这为设计下一代统一多模态模型提供了宝贵的见解。代码可在https://github.com/sen-ye/R3获取。
Summary / 总结
The paper addresses the challenge in multimodal models where improving generative capabilities often diminishes understanding, and vice versa. It proposes the Reason-Reflect-Refine (R3) framework, which restructures the generation process into a multi-step 'generate-understand-regenerate' approach. This method leverages the model's understanding capability during generation, leading to enhanced generation results and improved understanding related to the generation process.
论文探讨了多模态模型中生成能力提升往往会导致理解能力下降,反之亦然的问题。提出了一种Reason-Reflect-Refine (R3)框架,将生成过程重构为‘生成-理解-再生’的多步流程。该框架在生成过程中利用模型的理解能力,从而提高了生成结果的质量并增强了与生成过程相关的理解能力。
Robot-Assisted Social Dining as a White Glove Service
Authors: Atharva S Kashyap, Ugne Aleksandra Morkute, Patricia Alves-Oliveira
First: 2026-02-17T17:58:25+00:00 · Latest: 2026-02-17T17:58:25+00:00
Comments: 20 pages, 9 figures. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract
Robot-assisted feeding enables people with disabilities who require assistance eating to enjoy a meal independently and with dignity. However, existing systems have only been tested in-lab or in-home, leaving in-the-wild social dining contexts (e.g., restaurants) largely unexplored. Designing a robot for such contexts presents unique challenges, such as dynamic and unsupervised dining environments that a robot needs to account for and respond to. Through speculative participatory design with people with disabilities, supported by semi-structured interviews and a custom AI-based visual storyboarding tool, we uncovered ideal scenarios for in-the-wild social dining. Our key insight suggests that such systems should: embody the principles of a white glove service where the robot (1) supports multimodal inputs and unobtrusive outputs; (2) has contextually sensitive social behavior and prioritizes the user; (3) has expanded roles beyond feeding; (4) adapts to other relationships at the dining table. Our work has implications for in-the-wild and group contexts of robot-assisted feeding.
中文标题/摘要
标题:机器人辅助社交餐饮作为白手套服务
机器人辅助喂食使需要他人协助进食的残疾人能够独立且有尊严地享受餐饮。然而,现有的系统仅在实验室或家庭中进行测试,社交餐饮环境(如餐馆)等野外场景尚未被充分探索。为这些环境设计机器人带来了独特的挑战,例如机器人需要适应和应对动态且未监督的餐饮环境。通过与残疾人的推测参与式设计,结合半结构化访谈和自定义基于AI的视觉故事板工具,我们发现了野外社交餐饮的理想场景。我们的主要见解表明,此类系统应体现白手套服务的原则,其中机器人(1)支持多模态输入和不显眼的输出;(2)具有上下文敏感的社会行为并优先考虑用户;(3)扩展其角色,不仅限于喂食;(4)适应餐桌上的其他关系。我们的研究对机器人辅助喂食在野外和群体环境中的应用具有重要意义。
Summary / 总结
The research aims to design a robot for assisting people with disabilities in social dining settings, such as restaurants, which have not been extensively explored. The team used speculative participatory design, interviews, and a visual storyboarding tool to identify key features, including unobtrusive support, contextually sensitive social behavior, expanded roles, and adaptability to other diners. The study suggests that such robots should prioritize the user and adapt to the dynamic environment of a restaurant, enhancing the dining experience for those with disabilities.
该研究探讨了在餐厅等公共场合设计机器人辅助进食系统的方案。通过推测性参与设计和访谈,研究人员确定了此类系统的关键原则,包括多模态支持、情境敏感的社会行为、超越进食的扩展角色以及对餐桌其他关系的适应性。研究旨在使残疾人能够在公共场合独立且有尊严地用餐。
GLM-5: from Vibe Coding to Agentic Engineering
Authors: GLM-5 Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chen Li, Chenghua Huang, Chengwei Hu, Chenhui Zhang, Chenzheng Zhu, Congfeng Yin, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huan Liu, Huanpeng Chu, Jia'ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan, Jinhua Du, Jinxin Liu, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Liang Xu, Lindong Wu, Lintao Ding, Lu Chen, Minghao Li, Nianyi Lin, Pan Ta, Qiang Zou, Rongjun Song, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuyi Fan, Wei Qin, Wei Tian, Weining Zhang, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xunkai Zhang, Yandong Wu, Yanfu Li, Yidong Wang, Yifan Zhu, Yijun Tan, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yipeng Geng, Yong Yan, Yonglin Tan, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yutao Zhang, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhenhe Yan, Zheyu Zhang, Zhixiang Wei, Zhuo Chen, Zhuoer Feng, Zijun Yao, Ziwei Chai, Ziyuan Wang, Zuzhou Zhang, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang
First: 2026-02-17T17:50:56+00:00 · Latest: 2026-02-17T17:50:56+00:00
Abstract
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.
中文标题/摘要
标题:GLM-5:从 vibe 编码到能动工程
我们介绍了 GLM-5,这是一种下一代基础模型,旨在将 vibe 编码的范式转变为能动工程。GLM-5 基于其前代的能动、推理和编码(ARC)能力,采用 DSA 显著降低训练和推理成本,同时保持长上下文保真度。为了提高模型对齐和自主性,我们实现了一种新的异步强化学习基础设施,通过将生成与训练解耦来大幅提高后训练效率。此外,我们提出了新的异步代理 RL 算法,进一步提高 RL 质量,使模型能够更有效地从复杂的、长期的交互中学习。通过这些创新,GLM-5 在主要的开放基准测试中达到了最先进的性能。最关键的是,GLM-5 在实际编码任务中展示了前所未有的能力,超越了之前的基线,在处理端到端的软件工程挑战方面表现更佳。代码、模型和更多信息可在 https://github.com/zai-org/GLM-5 获取。
Summary / 总结
GLM-5 is a next-generation foundation model aimed at transitioning from vibe coding to agentic engineering. It leverages DSA to reduce costs while maintaining long-context fidelity and introduces an asynchronous reinforcement learning infrastructure to improve post-training efficiency. GLM-5 achieves state-of-the-art performance on major benchmarks and excels in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges.
GLM-5 是一个下一代基础模型,从 vibe 编码过渡到 agentic 工程。它利用 DSA 减少成本同时保持长上下文保真度,并引入异步强化学习基础设施以提高训练后效率。GLM-5 在主要基准测试中达到最先进的性能,并在实际编码任务中表现出色,超越了之前的基线模型在端到端软件工程挑战中的表现。
Online GPU Energy Optimization with Switching-Aware Bandits
Authors: Xiongxiao Xu, Solomon Abera Bekele, Brice Videau, Kai Shu
Venue: WWW
First: 2024-10-03T17:05:34+00:00 · Latest: 2026-02-17T17:45:38+00:00
Comments: ACM Web Conference 2026 (WWW'26)
Abstract
Energy consumption has become a bottleneck for future computing architectures, from wearable devices to leadership-class supercomputers. Existing energy management techniques largely target CPUs, even though GPUs now dominate power draw in heterogeneous high performance computing (HPC) systems. Moreover, many prior methods rely on either purely offline or hybrid offline and online training, which is impractical and results in energy inefficiencies during data collection. In this paper, we introduce a practical online GPU energy optimization problem in a HPC scenarios. The problem is challenging because (1) GPU frequency scaling exhibits performance-energy trade-offs, (2) online control must balance exploration and exploitation, and (3) frequent frequency switching incurs non-trivial overhead and degrades quality of service (QoS). To address the challenges, we formulate online GPU energy optimization as a multi-armed bandit problem and propose EnergyUCB, a lightweight UCB-based controller that dynamically adjusts GPU core frequency in real time to save energy. Specifically, EnergyUCB (1) defines a reward that jointly captures energy and performance using a core-to-uncore utilization ratio as a proxy for GPU throughput, (2) employs optimistic initialization and UCB-style confidence bonuses to accelerate learning from scratch, and (3) incorporates a switching-aware UCB index and a QoS-constrained variant that enforce explicit slowdown budgets while discouraging unnecessary frequency oscillations. Extensive experiments on real-world workloads from the world's third fastest supercomputer Aurora show that EnergyUCB achieves substantial energy savings with modest slowdown and that the QoS-constrained variant reliably respects user-specified performance budgets.
中文标题/摘要
标题:基于切换感知的在线GPU能源优化
能源消耗已成为未来计算架构的瓶颈,从可穿戴设备到领导级超级计算机都是如此。现有的能源管理技术主要针对CPU,尽管在异构高性能计算(HPC)系统中,GPU现在主导了能耗。此外,许多先前的方法依赖于纯离线或混合离线和在线训练,这在数据收集过程中是不切实际的,导致能源效率低下。在本文中,我们介绍了在HPC场景下的实用在线GPU能源优化问题。该问题具有挑战性,因为(1)GPU频率缩放表现出性能与能源之间的权衡,(2)在线控制必须在探索和利用之间取得平衡,(3)频繁的频率切换会产生非平凡的开销并降低服务质量(QoS)。为了解决这些挑战,我们将在线GPU能源优化形式化为多臂老虎机问题,并提出了一种轻量级的基于UCB的控制器EnergyUCB,该控制器能够实时动态调整GPU核心频率以节省能源。具体而言,EnergyUCB(1)定义了一个奖励,该奖励通过核心到非核心利用率比作为GPU吞吐量的代理来联合捕捉能源和性能,(2)采用乐观初始化和UCB风格的信心奖金来加速从零开始的学习,(3)引入了切换感知的UCB指数和一个服务质量约束变体,该变体强制执行明确的减速预算,同时抑制不必要的频率振荡。在世界上第三快的超级计算机Aurora的真实工作负载上进行的广泛实验表明,EnergyUCB在适度降低性能的情况下实现了显著的能源节省,并且服务质量约束变体可靠地遵守了用户指定的性能预算。
Summary / 总结
This paper addresses the challenge of online GPU energy optimization in high-performance computing scenarios. It formulates the problem as a multi-armed bandit and proposes EnergyUCB, a lightweight controller that dynamically adjusts GPU core frequency to balance energy savings and performance. The method uses a reward based on core-to-uncore utilization, optimistic initialization, and switching-aware UCB indices to minimize energy consumption while respecting performance constraints. Experiments on Aurora show significant energy savings with minimal performance impact and reliable adherence to user-specified performance budgets.
本文针对HPC场景下的在线GPU能量优化问题,其中GPU频率调整和频繁切换导致能量效率低下。作者提出了一种基于UCB的控制器EnergyUCB,能够动态调整GPU核心频率以平衡能量节省和性能。EnergyUCB使用基于核心到非核心利用率的奖励指标、乐观初始化和切换感知的UCB指数来最小化能量消耗并遵守性能预算。实验结果表明,EnergyUCB在性能影响较小的情况下实现了显著的能量节省,并且能够可靠地遵守用户指定的性能约束。
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Authors: Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé
First: 2026-02-17T17:45:34+00:00 · Latest: 2026-02-17T17:45:34+00:00
Comments: 16 pages, 13 figures including Supplementary Material
Abstract
While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.
中文标题/摘要
标题:ChartEditBench:评估多轮图表编辑能力的多模态语言模型
虽然多模态大型语言模型(MLLMs)在单轮图表生成方面表现出色,但它们支持实际探索性数据分析的能力仍被忽视。实际上,用户通过多轮交互迭代细化可视化,这需要保持共同理解,跟踪先前的编辑,并适应不断变化的偏好。我们引入了ChartEditBench,这是一个通过代码实现增量、视觉接地图表编辑的基准,包含5000个难度控制的修改链和严格的人工验证子集。与之前的单次编辑基准不同,ChartEditBench评估持续的、上下文相关的编辑。我们还提出了一种稳健的评估框架,通过结合执行基础的准确性检查、像素级视觉相似性和逻辑代码验证,缓解了LLM作为评判者的度量标准的局限性。实验表明,最先进的MLLMs在多轮设置中由于错误累积和共享上下文的失败而表现大幅下降,但在风格编辑方面表现出色,但在数据为中心的转换方面频繁出现执行失败。ChartEditBench为接地的、意图感知的多模态编程建立了具有挑战性的测试平台。
Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Authors: Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé
First: 2026-02-17T17:45:28+00:00 · Latest: 2026-02-17T17:45:28+00:00
Abstract
Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.
中文标题/摘要
标题:超越二元分类:在社交媒体视频中检测细微性别歧视
在线性别歧视以多种形式出现,这使其检测变得具有挑战性。尽管自动化工具可以增强对性别歧视内容的识别,但它们通常仅限于二元分类。因此,由于缺乏细粒度和上下文敏感的标签,更微妙的性别歧视形式可能会被遗漏。为了解决这一问题,我们做出了以下贡献:(1) 我们提出了一个新的西班牙语多模态性别歧视检测数据集 FineMuSe,其中包括二元和细粒度注释;(2) 我们引入了一个全面的层次分类体系,涵盖了性别歧视、非性别歧视以及讽刺和幽默的修辞手法;(3) 我们评估了多种语言模型在二元和细粒度性别歧视检测中的表现。我们的研究结果表明,多模态语言模型在识别微妙的性别歧视形式方面与人类注释者具有竞争力;然而,它们在通过视觉线索传达多种性别歧视类型时难以捕捉到这些类型。
Summary / 总结
This study addresses the challenge of detecting various forms of online sexism by introducing a new multimodal dataset, FineMuSe, which includes both binary and fine-grained annotations. The researchers also developed a hierarchical taxonomy to classify sexism, non-sexism, and rhetorical devices. They evaluated a variety of language models and found that multimodal LLMs perform well in identifying nuanced sexism but have difficulty capturing co-occurring types when visual cues are involved.
研究旨在通过超越二元分类来检测社交视频中的性别歧视,识别更微妙的形式。它引入了FineMuSe,一个包含细粒度注释的多模态数据集,用于西班牙语,并评估了各种LLM在二元和细粒度性别歧视检测中的表现。研究发现,多模态LLM在识别微妙的性别歧视方面表现良好,但在通过视觉线索捕捉多种同时出现的性别歧视类型时存在困难。
Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games
Authors: Brian Hu Zhang, Ioannis Anagnostides, Tuomas Sandholm
First: 2025-10-06T00:33:20+00:00 · Latest: 2026-02-17T17:41:09+00:00
Comments: Compared to the previous version, this version includes new results on harmonic games and extensive-form games. Abstract abridged due to arXiv length constraints
Abstract
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established, the most effective paradigm in practice is counterfactual regret minimization (CFR), which is based on regret matching and its modern variants. In particular, the state of the art across most benchmarks is predictive regret matching$^+$ (PRM$^+$). Yet, such algorithms can exhibit slower $T^{-1/2}$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par or even better relative to PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between (IREG-)PRM$^+$ and optimistic gradient descent with adaptive learning rate. Reflecting this theoretical bridge, we find that the adaptive version of optimistic gradient descent we consider performs on par with IREG-PRM$^+$. This demystifies the effectiveness of the regret matching family vis-a-vis more standard optimization techniques. Moreover, we extend our analysis beyond zero-sum games to a family of variational inequality problems that includes harmonic games, as well as extensive-form games with fully-mixed equilibria, via a new and intriguing connection between CFR and harmonic games. Unlike prior work in harmonic games, our algorithms do not require knowing the underlying weights by virtue of scale invariance. Under the weighted Minty condition, we show that any algorithm satisfying a scale-invariant RVU property (such as IREG-PRM$^+$) has constant regret (in self-play) and $T^{-1/2}$ iterate convergence.
中文标题/摘要
标题:尺度不变的遗憾匹配与在线学习:零和博弈中理论与实践的桥梁
多年来,零和博弈通过一阶方法求解在理论与实践之间存在显著的鸿沟。尽管已经证明了$T^{-1}$的收敛率,但在实践中最有效的范式是反事实遗憾最小化(CFR),它基于遗憾匹配及其现代变体。特别是在大多数基准测试中,最先进的算法是预测遗憾匹配$^+$(PRM$^+$)。然而,即使在自博弈中,这些算法也可能表现出较慢的$T^{-1/2}$收敛率。 在本文中,我们弥合了理论与实践之间的差距。我们提出了一种新的尺度不变且无需参数的PRM$^+$变体,称为IREG-PRM$^+$。我们证明了它实现了$T^{-1/2}$最优迭代和$T^{-1}$(即最优)平均迭代收敛保证,同时在基准游戏中与PRM$^+$相比表现相当甚至更好。从技术角度来看,我们将(IREG-)PRM$^+$与具有自适应学习率的乐观梯度下降进行了类比。反映这种理论桥梁,我们发现我们考虑的自适应版本的乐观梯度下降与IREG-PRM$^+$表现相当。这揭示了遗憾匹配家族相对于更标准的优化技术的有效性。 此外,我们通过CFR与谐波博弈之间的一种新奇联系,将分析扩展到包括谐波博弈以及具有完全混合均衡的扩展形式博弈的变分不等式问题家族。不同于谐波博弈的先前工作,我们的算法由于尺度不变性而无需知道底层权重。在加权Minty条件下,我们证明了任何满足尺度不变RVU性质(如IREG-PRM$^+$)的算法在自博弈中具有恒定的遗憾($T^{-1/2}$迭代收敛)。
Summary / 总结
This paper addresses the gap between theoretical and practical performance in zero-sum game solving using first-order methods. It introduces IREG-PRM+, a new scale-invariant and parameter-free variant of predictive regret matching+, which achieves optimal $T^{-1}$ average-iterate convergence and competitive performance on benchmarks. Theoretical analysis shows a connection to optimistic gradient descent with adaptive learning rate, explaining the effectiveness of regret matching algorithms. Additionally, the paper extends the analysis to variational inequality problems, including harmonic games and extensive-form games, with algorithms that do not require known underlying weights.
本文解决了零和博弈中基于一阶方法的理论性能与实际性能之间的差距。提出了一种新的尺度不变变体IREG-PRM+,该方法实现了最优的$T^{-1}$平均迭代收敛和$T^{-1/2}$最佳迭代收敛,匹配或优于PRM+在基准游戏上的表现。理论分析表明该方法与具有自适应学习率的乐观梯度下降方法有相似之处,并将方法扩展到包括谐波游戏和扩展形式博弈的变分不等式问题,无需知道底层权重。
RaCo: Ranking and Covariance for Practical Learned Keypoints
Authors: Abhiram Shenoi, Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys
First: 2026-02-17T17:39:52+00:00 · Latest: 2026-02-17T17:39:52+00:00
Abstract
This paper introduces RaCo, a lightweight neural network designed to learn robust and versatile keypoints suitable for a variety of 3D computer vision tasks. The model integrates three key components: the repeatable keypoint detector, a differentiable ranker to maximize matches with a limited number of keypoints, and a covariance estimator to quantify spatial uncertainty in metric scale. Trained on perspective image crops only, RaCo operates without the need for covisible image pairs. It achieves strong rotational robustness through extensive data augmentation, even without the use of computationally expensive equivariant network architectures. The method is evaluated on several challenging datasets, where it demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, particularly under large in-plane rotations. Ultimately, RaCo provides an effective and simple strategy to independently estimate keypoint ranking and metric covariance without additional labels, detecting interpretable and repeatable interest points. The code is available at https://github.com/cvg/RaCo.
中文标题/摘要
标题:RaCo:排序与协方差用于实用的所学关键点
本文介绍了RaCo,这是一种轻量级神经网络,旨在学习适用于多种3D计算机视觉任务的稳健且多功能的关键点。该模型集成了三个关键组件:可重复的关键点检测器、可微分排序器以在有限的关键点数量下最大化匹配,以及协方差估计器以在度量尺度上量化空间不确定性。仅在透视图像剪辑上进行训练,RaCo无需使用共视图像对即可运行。通过广泛的图像增强,它实现了强大的旋转鲁棒性,即使不使用计算成本高昂的等变网络架构也是如此。该方法在几个具有挑战性的数据集上进行了评估,其中在关键点重复性和两视图匹配方面均表现出最先进的性能,尤其是在大平面旋转下。最终,RaCo提供了一种有效且简单的策略,独立估计关键点排序和度量协方差,无需额外标签即可检测可解释且可重复的兴趣点。代码可在https://github.com/cvg/RaCo 获取。
Summary / 总结
The paper presents RaCo, a lightweight neural network for learning robust and versatile keypoints for 3D computer vision tasks. It combines a repeatable keypoint detector, a differentiable ranker, and a covariance estimator. Trained on perspective image crops, RaCo achieves strong rotational robustness through data augmentation and demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, especially under large rotations. The method independently estimates keypoint ranking and metric covariance without additional labels, providing interpretable and repeatable interest points.
RaCo 是一种轻量级神经网络,用于学习适用于多种 3D 计算机视觉任务的稳健且多功能的关键点。它结合了重复关键点检测器、可微排序器和协方差估计器。通过数据增强训练,RaCo 实现了强大的旋转鲁棒性,并在关键点重复性和两视图匹配方面达到了最先进的性能,尤其是在大平面旋转下。该方法无需额外标签即可独立估计关键点排序和度量协方差,使其简单且有效。代码可在 https://github.com/cvg/RaCo 获取。
PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression
Authors: Fabian Fumagalli, R. Teal Witter, Christopher Musco
Venue: ICLR 2026
First: 2026-01-26T15:47:45+00:00 · Latest: 2026-02-17T17:39:03+00:00
Comments: Published at ICLR 2026: https://openreview.net/forum?id=M19J8UGguq
Abstract
Shapley values have emerged as a central game-theoretic tool in explainable AI (XAI). However, computing Shapley values exactly requires $2^d$ game evaluations for a model with $d$ features. Lundberg and Lee's KernelSHAP algorithm has emerged as a leading method for avoiding this exponential cost. KernelSHAP approximates Shapley values by approximating the game as a linear function, which is fit using a small number of game evaluations for random feature subsets. In this work, we extend KernelSHAP by approximating the game via higher degree polynomials, which capture non-linear interactions between features. Our resulting PolySHAP method yields empirically better Shapley value estimates for various benchmark datasets, and we prove that these estimates are consistent. Moreover, we connect our approach to paired sampling (antithetic sampling), a ubiquitous modification to KernelSHAP that improves empirical accuracy. We prove that paired sampling outputs exactly the same Shapley value approximations as second-order PolySHAP, without ever fitting a degree 2 polynomial. To the best of our knowledge, this finding provides the first strong theoretical justification for the excellent practical performance of the paired sampling heuristic.
中文标题/摘要
标题:PolySHAP:扩展KernelSHAP的特征交互启发式多项式回归
Shapley值已成为可解释人工智能(XAI)中的一种核心博弈论工具。然而,精确计算Shapley值需要对具有d个特征的模型进行$2^d$次博弈评估。Lundberg和Lee的KernelSHAP算法已成为避免这种指数级成本的主要方法。KernelSHAP通过将博弈近似为线性函数来近似Shapley值,该线性函数通过少量随机特征子集的博弈评估进行拟合。 在这项工作中,我们通过使用更高阶的多项式来近似博弈,从而扩展了KernelSHAP,这些多项式能够捕捉特征之间的非线性交互。我们的PolySHAP方法在各种基准数据集上提供了经验上更好的Shapley值估计,并且我们证明了这些估计是一致的。 此外,我们将我们的方法与配对采样(反向采样)联系起来,这是一种广泛应用于KernelSHAP以提高经验准确性的修改。我们证明了配对采样输出与二次PolySHAP相同的Shapley值近似,而从未拟合过二次多项式。据我们所知,这一发现为配对采样启发式的出色实际性能提供了第一个强有力的理论依据。
Summary / 总结
This paper introduces PolySHAP, an extension of KernelSHAP that uses higher-degree polynomial regression to approximate Shapley values, thereby capturing non-linear feature interactions. This method provides better Shapley value estimates on benchmark datasets and is proven to be consistent. The authors also show that paired sampling, a common heuristic in KernelSHAP, is equivalent to second-order PolySHAP without fitting a degree 2 polynomial, offering a theoretical basis for its practical success.
本文提出了PolySHAP,这是一种扩展的KernelSHAP方法,使用高次多项式回归来近似Shapley值,捕捉特征之间的非线性交互。该方法在各种数据集上提供了更好的Shapley值估计,并被证明是一致的。作者还表明,KernelSHAP中常用的配对采样等价于二次PolySHAP,而无需拟合二次多项式,这为其实用性能提供了理论解释。
Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching
Authors: Ren Kishimoto, Rikiya Takehi, Koichi Tanaka, Masahiro Nomura, Riku Togashi, Yoji Tomita, Yuta Saito
Venue: ICLR 2026
First: 2026-02-17T17:30:53+00:00 · Latest: 2026-02-17T17:30:53+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user's profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.
中文标题/摘要
标题:超越匹配最大化与公平性:优化用户留存的双边匹配
在如在线约会和招聘等双边匹配平台上,推荐算法通常旨在最大化总的匹配数量。然而,这一目标会导致不平衡,一些用户收到过多匹配,而许多其他用户则收到很少甚至没有匹配,最终放弃平台。用户留存对于许多平台至关重要,尤其是那些依赖订阅的平台。一些人可能会使用公平性目标来解决匹配最大化的问题。然而,对于许多平台来说,公平性本身并不是最终目标,因为用户并不会因为曝光平等而突然奖励平台。在实践中,当用户留存往往是最终目标时,随意依赖公平性会使优化留存依赖于运气。 在这项工作中,我们不再最大化匹配或公理化定义公平性,而是正式定义了在双边匹配平台上最大化用户留存的新问题设置。为此,我们引入了一种动态学习排序(LTR)算法,称为用户留存匹配(MRet)。与传统的双边匹配算法不同,我们的方法通过从每个用户的个人资料和互动历史中学习个性化留存曲线来建模用户留存。基于这些曲线,MRet动态调整推荐,同时考虑接收推荐的用户和被推荐用户的留存收益,以便将有限的匹配机会分配到最能提高整体留存的地方。自然地但重要的是,对来自一家主要在线约会平台的合成和真实数据集的实证评估表明,MRet实现了更高的用户留存,因为传统方法优化匹配或公平性而非留存。
Summary / 总结
This work addresses the issue of user retention in two-sided matching platforms by proposing a new objective of maximizing user retention, rather than just maximizing matches or enforcing fairness. The authors introduce MRet, a dynamic learning-to-rank algorithm that learns personalized retention curves from user profiles and interaction histories to adaptively optimize recommendations. Empirical evaluations on both synthetic and real-world datasets demonstrate that MRet outperforms conventional methods in terms of user retention.
该研究针对两方匹配平台中的用户留存问题,提出了最大化用户留存的新问题。引入了MRet算法,该算法通过学习用户的个人留存曲线和互动历史来动态调整推荐。实证评估表明,MRet在用户留存方面优于传统方法,后者侧重于最大化匹配或公平性而非留存。
SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated
Authors: Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M Albrecht, Stefano Maurogiovanni, Paolo Fraccaro
First: 2025-02-28T20:30:56+00:00 · Latest: 2026-02-17T17:17:14+00:00
Abstract
This work presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12, this extension updates the previous version to fix geospatial alignment inaccuracies and the inefficent data structure. The dataset allows low-barrier, analysis-ready data loading while maintaining the predecessor's spatial coverage of the world's 10,000 largest cities and surrounding geographies, resulting in 246k time series with nearly one million image patches. We package each time series in Zarr file format stored in WebDataset tar shards for efficient data loading and representation of meta-information such as cloud masks. We add new modalities for elevation, land-cover, and vegetation to support multimodal pre-training. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1.
中文标题/摘要
标题:SSL4EO-S12 v1.1:一种用于预训练的多模态多季节地球观测数据集,更新版
本研究介绍了SSL4EO-S12 v1.1,这是一种多模态、多时相的地球观测数据集,旨在用于大规模基础模型的预训练。基于SSL4EO-S12的成功,此扩展更新了之前的版本,以解决地理空间对齐不准确和数据结构效率低的问题。该数据集允许低门槛、分析就绪的数据加载,同时保持了前身对世界前10,000大城市及其周边地理区域的空间覆盖,产生了近246,000个时间序列和近一百万个图像块。我们将每个时间序列打包为Zarr文件格式,并存储在WebDataset tar分片中,以实现高效的数据加载并表示诸如云掩码等元信息。我们添加了新的模态,包括高程、土地覆盖和植被,以支持多模态预训练。在CC-BY-4.0许可下发布,SSL4EO-S12 v1.1促进了开放研究,并为未来的自监督学习和地理空间分析提供了坚实的基础。该数据集可通过https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1在线获取。
Summary / 总结
This work presents SSL4EO-S12 v1.1, an updated multimodal, multiseasonal Earth Observation dataset for pretraining large-scale foundation models. It addresses previous issues with geospatial alignment and data structure, maintaining spatial coverage of 10,000 largest cities and surrounding areas, and providing 246k time series with nearly one million image patches. The dataset is packaged in Zarr file format with WebDataset tar shards for efficient data loading and includes new modalities for elevation, land-cover, and vegetation. Released under CC-BY-4.0, it supports open research in self-supervised learning and geospatial analysis.
该研究介绍了SSL4EO-S12 v1.1,这是一个更新后的多模态、多季节地球观测数据集,用于预训练大规模基础模型。该数据集解决了先前版本的地理空间对齐和数据结构问题,保持了对世界上10,000个最大城市及其周边地理区域的空间覆盖。数据集包含246k时间序列和近一百万个图像片段,以Zarr文件格式和WebDataset tar碎片进行高效数据加载。新增了海拔、土地覆盖和植被等新模态,以支持多模态预训练。该数据集在CC-BY-4.0许可下发布,促进了开放研究,并支持自监督学习和地理空间分析的进步。
Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning
Authors: Olivier Lepel, Anas Barakat
First: 2024-10-03T15:45:39+00:00 · Latest: 2026-02-17T17:15:23+00:00
Abstract
We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.
中文标题/摘要
标题:累积前景理论在强化学习中的策略梯度
我们推导了有限时间窗口强化学习(RL)中累积前景理论(CPT)目标的策略梯度定理,扩展了标准的策略梯度定理,并将基于失真风险目标作为特殊情况包括在内。受行为经济学的启发,CPT 结合了围绕参考点的不对称效用转换和概率失真。基于我们的定理,我们设计了一种使用基于顺序统计量的蒙特卡洛梯度估计器的一阶策略梯度算法进行 CPT-RL。我们为估计器建立了统计保证,并证明了该算法的渐近收敛到(通常是非凸的)CPT 目标的一阶稳定点。仿真展示了由 CPT 引起的定性行为,并将我们的一阶方法与现有的零阶方法进行了比较。
Summary / 总结
This paper derives a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), extending the standard policy gradient theorem. Motivated by behavioral economics, the authors design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. They establish statistical guarantees for the estimator and prove asymptotic convergence to first-order stationary points of the CPT objective. Simulations show qualitative behaviors of CPT and compare the first-order approach to existing zeroth-order methods.
本文推导了 Cumulative Prospect Theory (CPT) 目标在有限时间窗口 Reinforcement Learning (RL) 中的策略梯度定理,扩展了标准的策略梯度定理。受行为经济学的启发,作者设计了一种基于顺序统计的 Monte Carlo 梯度估计器的一阶策略梯度算法。他们为该估计器建立了统计保证,并证明了该算法在 CPT 目标的一阶稳定点处的渐近收敛性。模拟结果显示了 CPT 引起的定性行为,并将一阶方法与现有的零阶方法进行了比较。
FRSICL: LLM-Enabled In-Context Learning Flight Resource Allocation for Fresh Data Collection in UAV-Assisted Wildfire Monitoring
Authors: Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida
First: 2025-07-14T10:24:43+00:00 · Latest: 2026-02-17T17:12:48+00:00
Abstract
Uncrewed Aerial Vehicles (UAVs) play a vital role in public safety, especially in monitoring wildfires, where early detection reduces environmental impact. In UAV-Assisted Wildfire Monitoring (UAWM) systems, jointly optimizing the data collection schedule and UAV velocity is essential to minimize the average Age of Information (AoI) for sensory data. Deep Reinforcement Learning (DRL) has been used for this optimization, but its limitations-including low sampling efficiency, discrepancies between simulation and real-world conditions, and complex training make it unsuitable for time-critical applications such as wildfire monitoring. Recent advances in Large Language Models (LLMs) provide a promising alternative. With strong reasoning and generalization capabilities, LLMs can adapt to new tasks through In-Context Learning (ICL), which enables task adaptation using natural language prompts and example-based guidance without retraining. This paper proposes a novel online Flight Resource Allocation scheme based on LLM-Enabled In-Context Learning (FRSICL) to jointly optimize the data collection schedule and UAV velocity along the trajectory in real time, thereby asymptotically minimizing the average AoI across all ground sensors. Unlike DRL, FRSICL generates data collection schedules and velocities using natural language task descriptions and feedback from the environment, enabling dynamic decision-making without extensive retraining. Simulation results confirm the effectiveness of FRSICL compared to state-of-the-art baselines, namely Proximal Policy Optimization, Block Coordinate Descent, and Nearest Neighbor.
中文标题/摘要
标题:FRSICL:基于LLM的上下文学习飞行资源分配以优化UAV辅助野火监测的新鲜数据收集
无人驾驶航空器(UAV)在公共安全中发挥着重要作用,特别是在野火监测中,早期检测可以减少环境影响。在UAV辅助野火监测(UAWM)系统中,联合优化数据收集计划和UAV速度对于最小化感官数据的平均年龄信息(AoI)至关重要。深度强化学习(DRL)已被用于这种优化,但其局限性包括低采样效率、仿真与现实世界条件之间的差异以及复杂的训练,使其不适合如野火监测等时间关键的应用。最近在大型语言模型(LLMs)方面的进展提供了另一种有希望的替代方案。凭借强大的推理和泛化能力,LLMs可以通过上下文学习(ICL)适应新任务,这使得通过自然语言提示和基于示例的指导来完成任务适应,无需重新训练。本文提出了一种基于LLM启用的上下文学习(FRSICL)的新型在线飞行资源分配方案,以实时联合优化数据收集计划和UAV轨迹上的速度,从而渐近地最小化所有地面传感器的平均AoI。与DRL不同,FRSICL使用自然语言任务描述和环境反馈生成数据收集计划和速度,从而实现动态决策而无需大量重新训练。仿真结果证实了FRSICL与最先进的基线(如近端策略优化、块坐标下降和最近邻)相比的有效性。
Summary / 总结
This paper introduces FRSICL, a novel approach for optimizing flight resource allocation in UAV-assisted wildfire monitoring using Large Language Models (LLMs) and In-Context Learning (ICL). Unlike Deep Reinforcement Learning (DRL), FRSICL generates data collection schedules and UAV velocities through natural language prompts and feedback, enabling dynamic decision-making. Simulation results show that FRSICL outperforms state-of-the-art methods in minimizing the average Age of Information (AoI) for sensory data in real-time applications.
论文提出了一种基于大型语言模型(LLMs)的In-Context Learning(ICL)的实时飞行资源分配方案FRSICL,用于优化野火监测中的数据采集时间和无人机速度。与深度强化学习(DRL)不同,FRSICL基于自然语言的任务描述和环境反馈生成计划,无需大量重新训练。仿真结果表明,FRSICL在减少平均信息年龄(AoI)方面优于Proximal Policy Optimization、Block Coordinate Descent和Nearest Neighbor等最先进的方法。
Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
Authors: Guile Wu, David Huang, Bingbing Liu, Dongfeng Bai
First: 2026-02-17T17:10:13+00:00 · Latest: 2026-02-17T17:10:13+00:00
Comments: Technical Report
Abstract
Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
中文标题/摘要
标题:语言和几何驱动的稀疏体素表示法用于整体场景理解
现有的3D开放词汇场景理解方法主要强调将2D基础模型的语言特征提炼为3D特征场,但很大程度上忽视了场景外观、语义和几何之间的协同作用。因此,场景理解往往偏离场景的几何结构,与重建过程脱节。在本文中,我们提出了一种新的方法,利用语言和几何驱动的稀疏体素表示法,在统一框架中全面建模外观、语义和几何。具体而言,我们使用3D稀疏体素作为基本单元,并采用外观场、密度场、特征场和置信场来全面表示3D场景。为了促进外观、密度和特征场之间的协同作用,我们构建了一个特征调制模块,并将2D基础模型的语言特征提炼到我们的3D场景模型中。此外,我们还将几何提炼整合到特征场提炼中,通过深度相关正则化和模式一致性正则化,将几何知识从几何基础模型转移到我们的3D场景表示中。这些组件共同作用,在统一框架中协同建模3D场景的外观、语义和几何。大量实验表明,我们的方法在整体场景理解和重建方面优于最先进的方法。
Summary / 总结
This work addresses the limitations of existing 3D scene understanding methods by proposing a novel approach that integrates language and geometry into sparse voxel representations. The method uses 3D sparse voxels to represent a scene with appearance, density, feature, and confidence fields. It promotes synergy among these fields through a feature modulation module and integrates geometric knowledge via depth correlation and pattern consistency regularization. Experiments show that this approach outperforms state-of-the-art methods in both scene understanding and reconstruction.
该研究针对现有3D场景理解方法的局限性,提出了一种将语言和几何学整合到稀疏体素表示中的新方法。该方法使用3D稀疏体素来表示外观、密度、特征和置信度字段,并包含一个特征调制模块来提取语言特征,以及通过正则化技术进行几何提取,以增强模型性能。实验表明,该方法在场景理解和重建任务中均优于现有最先进的方法。
MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction
Authors: Qiang Zhang, Jiahao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zifan Wang, Jingkai Sun, Wei Cui, Jialin Yu, Gang Han, Wen Zhao, Pihai Sun, Kangning Yin, Jiaxu Wang, Jiahang Cao, Lingfeng Zhang, Hao Cheng, Xiaoshuai Hao, Yiding Ji, Junwei Liang, Jian Tang, Renjing Xu, Yijie Guo
First: 2026-02-17T17:09:45+00:00 · Latest: 2026-02-17T17:09:45+00:00
Comments: 17 pages, 6 figures
Abstract
Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
中文标题/摘要
标题:MeshMimic:通过三维场景重建实现几何感知的人形运动学习
近年来,人形运动控制取得了显著突破,深度强化学习(RL)成为实现复杂、类人行为的主要催化剂。然而,人形机器人的高维度和复杂动力学使得手动设计运动变得不切实际,导致对昂贵的运动捕捉(MoCap)数据的高度依赖。这些数据不仅成本高昂,而且经常缺乏周围物理环境的必要几何上下文。因此,现有的运动合成框架往往在运动和场景之间存在脱节,导致诸如地形感知任务中的接触滑移或网格穿透等物理不一致现象。在本文中,我们提出了一种名为MeshMimic的创新框架,该框架将三维场景重建与本体智能相结合,使人形机器人能够直接从视频中学习耦合的“运动-地形”交互。通过利用先进的三维视觉模型,我们的框架能够精确地分割和重建人类轨迹以及地形和物体的底层三维几何结构。我们提出了一种基于运动学一致性的优化算法,从嘈杂的视觉重建中提取高质量的运动数据,以及一种接触不变的重新定位方法,将人类与环境的交互特征转移到人形代理上。实验结果表明,MeshMimic在各种复杂地形上实现了稳健且高度动态的性能。我们的方法证明,仅使用消费级单目传感器的低成本管道可以促进复杂物理交互的训练,为在非结构化环境中人形机器人的自主进化提供了可扩展的路径。
Summary / 总结
MeshMimic is a framework that combines 3D scene reconstruction with deep reinforcement learning to enable humanoid robots to learn complex, terrain-aware motions directly from video. It uses 3D vision models to reconstruct both human trajectories and the 3D geometry of terrains, and an optimization algorithm to extract high-quality motion data. The approach demonstrates robust performance across various terrains and shows that a low-cost, consumer-grade sensor setup can be used to train humanoid robots for complex physical interactions in unstructured environments.
MeshMimic 是一个结合 3D 场景重建与深度强化学习的框架,使类人机器人能够直接从视频中学习复杂的地形相关动作。该方法使用 3D 视觉模型重建人类轨迹和地形的 3D 几何结构,并通过优化算法提取高质量的动作数据。实验结果表明,该方法在多种地形上表现出色,并证明了使用低成本的消费级单目传感器可以训练类人机器人进行复杂的物理交互,从而在未结构化的环境中实现自主进化。
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Authors: Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu
Venue: WWW
First: 2025-02-25T03:37:43+00:00 · Latest: 2026-02-17T17:04:01+00:00
Comments: ACM Web Conference 2026 (WWW'26)
Abstract
Time series anomaly detection (TSAD) has been a long-standing pillar problem in Web-scale systems and online infrastructures, such as service reliability monitoring, system fault diagnosis, and performance optimization. Large language models (LLMs) have demonstrated unprecedented capabilities in time series analysis, the potential of multimodal LLMs (MLLMs), particularly vision-language models, in TSAD remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. It motivates our research question: Can multimodal LLMs perform time series anomaly detection? Existing studies often oversimplify the problem by treating point-wise anomalies as special cases of range-wise ones or by aggregating point anomalies to approximate range-wise scenarios. They limit our understanding for realistic scenarios such as multi-granular anomalies and irregular time series. To address the gap, we build a VisualTimeAnomaly benchmark to comprehensively investigate zero-shot capabilities of MLLMs for TSAD, progressively from point-, range-, to variate-wise anomalies, and extends to irregular sampling conditions. Our study reveals several key insights in multimodal MLLMs for TSAD. Built on these findings, we propose a MLLMs-based multi-agent framework TSAD-Agents to achieve automatic TSAD. Our framework comprises scanning, planning, detection, and checking agents that synergistically collaborate to reason, plan, and self-reflect to enable automatic TSAD. These agents adaptively invoke tools such as traditional methods and MLLMs and dynamically switch between text and image modalities to optimize detection performance.
中文标题/摘要
标题:多模态LLM能否进行时间序列异常检测?
时间序列异常检测(TSAD)一直是大规模系统和在线基础设施中的长期核心问题,如服务可靠性监控、系统故障诊断和性能优化。大型语言模型(LLMs)在时间序列分析方面展现了前所未有的能力,而多模态LLM(MLLMs),尤其是视觉-语言模型,在TSAD方面的潜力尚未得到充分探索。人类自然地通过可视化和文本描述来检测时间序列异常。这激发了我们的研究问题:多模态LLM能否进行时间序列异常检测?现有研究往往通过将点异常视为区间异常的特殊情况或通过聚合点异常来近似区间场景来简化问题,这限制了我们对多粒度异常和不规则时间序列等现实场景的理解。为解决这一差距,我们构建了一个VisualTimeAnomaly基准,全面调查MLLMs在TSAD中的零样本能力,从点、区间到变量层面的异常,进一步扩展到不规则采样条件。我们的研究揭示了多模态MLLMs在TSAD中的几个关键见解。基于这些发现,我们提出了一种基于MLLMs的多智能体框架TSAD-Agents,以实现自动TSAD。该框架包括扫描、规划、检测和检查智能体,它们协同合作,进行推理、规划和自我反思,以实现自动TSAD。这些智能体能够适当地调用传统方法和MLLMs,并动态切换文本和图像模态,以优化检测性能。
Summary / 总结
This study investigates whether multimodal large language models (MLLMs) can perform time series anomaly detection (TSAD), addressing limitations of existing methods. The research builds a VisualTimeAnomaly benchmark to explore zero-shot capabilities of MLLMs for TSAD, covering point-wise, range-wise, and variate-wise anomalies under irregular sampling conditions. Key findings reveal MLLMs' potential in TSAD, leading to the development of a multi-agent framework, TSAD-Agents, which collaboratively uses traditional methods and MLLMs to optimize anomaly detection performance.
研究探讨了多模态大型语言模型(MLLMs)在时间序列异常检测(TSAD)中的能力,解决了现有研究的局限性。构建了VisualTimeAnomaly基准来探索MLLMs在零样本TSAD中的能力,涵盖了点异常、区间异常和变量异常,并在不规则采样条件下进行扩展。研究发现MLLMs能够有效进行TSAD,进而开发了TSAD-Agents多代理框架,该框架通过适应性工具调用和动态模态切换协同合作实现自动TSAD。
Spanning the Visual Analogy Space with a Weight Basis of LoRAs
Authors: Hila Manor, Rinon Gal, Haggai Maron, Tomer Michaeli, Gal Chechik
First: 2026-02-17T17:02:38+00:00 · Latest: 2026-02-17T17:02:38+00:00
Comments: Code and data are in https://research.nvidia.com/labs/par/lorweb
Abstract
Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet $\{\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}\}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb
Summary / 总结
The paper aims to enhance visual analogy learning by addressing the limitations of single LoRA modules in capturing diverse visual transformations. It introduces LoRWeB, which dynamically composes learned transformation primitives at inference time. Key components include a learnable basis of LoRAs and a lightweight encoder that selects and weights these basis LoRAs based on input analogy pairs. Experimental results show that LoRWeB achieves state-of-the-art performance and better generalization to unseen transformations. This suggests that LoRA basis decompositions are a promising approach for flexible visual manipulation.
该研究旨在通过解决单一LoRA模块在捕捉多样视觉变换方面的局限性,提升视觉类比学习。提出了LoRWeB方法,在推理时动态组合学习到的变换基元。关键组成部分包括一个可学习的LoRA基底和一个轻量级编码器,该编码器根据输入的类比对选择并加权这些基底LoRA。实验结果表明,LoRWeB在性能上达到最新水平,并且在处理未见过的变换时表现出更好的泛化能力。这表明LoRA基底分解是灵活视觉操作的一个有前景的方向。
Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Authors: Sarim Chaudhry
First: 2026-02-17T17:01:42+00:00 · Latest: 2026-02-17T17:01:42+00:00
Abstract
Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model's latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.
中文标题/摘要
标题:递归概念进化以实现大型语言模型的组合推理
大型语言模型在许多复杂推理任务中表现出色,但在需要组合推理的基准测试中,如ARC-AGI-2、GPQA、MATH、BBH和HLE,其准确性急剧下降。现有方法通过链式思考提示、自我一致性或强化学习扩展了令牌级搜索,但它们使模型的潜在表示空间保持不变。当所需的抽象尚未编码在该空间中时,性能会崩溃。我们提出了递归概念进化(RCE)框架,该框架使预训练语言模型能够在推理过程中修改其内部表示几何结构。RCE引入了动态生成的低秩概念子空间,当检测到表示不足时生成,通过最小描述长度标准选择,当协同时合并,并通过约束优化巩固以保持稳定性。这一过程使模型能够构建新的抽象,而不是重新组合现有抽象。我们将RCE与Mistral-7B集成,并在组合推理基准测试中对其进行评估。RCE在ARC-AGI-2上获得了12-18分的提升,在GPQA和BBH上获得了8-14分的改进,并在MATH和HLE上一致地减少了深度诱导的错误。
Summary / 总结
The paper addresses the limitation of large language models in compositional reasoning tasks, where their performance significantly drops. It introduces Recursive Concept Evolution (RCE), a framework that allows models to dynamically modify their internal representation during inference. RCE generates low-rank concept subspaces, merges them when synergistic, and optimizes them to maintain stability, enabling the model to construct new abstractions. Experiments show RCE improves performance by 12-18 points on ARC-AGI-2, 8-14 points on GPQA and BBH, and reduces depth-induced errors on MATH and HLE.
论文针对大型语言模型在组合推理任务中的表现显著下降的问题,提出了一种名为递归概念进化(RCE)的框架,该框架允许模型在推理过程中动态修改其内部表示。RCE生成低秩概念子空间,当它们协同工作时进行合并,并通过约束优化来保持稳定性,从而让模型能够构建新的抽象。实验结果显示,RCE在ARC-AGI-2上提高了12-18分,在GPQA和BBH上提高了8-14分,并在MATH和HLE上减少了深度引起的错误。
Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
Authors: Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao
First: 2026-02-17T17:00:11+00:00 · Latest: 2026-02-17T17:00:11+00:00
Abstract
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.
中文标题/摘要
标题:学习检索可导航候选对象以提高视觉-语言导航效率
视觉-语言导航(VLN)要求代理遵循自然语言指令并导航通过之前未见过的环境。最近的方法越来越多地采用大型语言模型(LLM)作为高级导航器,因为它们具有灵活性和推理能力。然而,基于提示的LLM导航往往在决策效率方面存在不足,因为模型必须反复从头解释指令,并在每一步中对嘈杂且冗长的可导航候选对象进行推理。在本文中,我们提出了一种检索增强框架,以在不修改或微调底层语言模型的情况下提高基于LLM的VLN的效率和稳定性。我们的方法在两个互补的层次上引入了检索。在集层面,一个指令级嵌入检索器选择语义上相似的成功导航轨迹作为上下文示例,为指令定位提供特定任务的先验。在步骤层面,一个模仿学习的候选检索器在LLM推理之前修剪无关的可导航方向,减少动作的模糊性和提示的复杂性。两个检索模块都是轻量级、模块化的,并且独立于LLM进行训练。我们在Room-to-Room(R2R)基准上评估了我们的方法。实验结果表明,在已见和未见环境中,我们的方法在成功率、Oracle成功率和SPL方面都表现出一致的改进。消融研究进一步表明,指令级示例检索和候选修剪分别对全局指导和步骤级决策效率提供了互补的好处。这些结果表明,检索增强的决策支持是提高基于LLM的视觉-语言导航的有效且可扩展的策略。
Summary / 总结
This paper addresses the inefficiency of prompt-based large language models (LLMs) in Vision-and-Language Navigation (VLN) by proposing a retrieval-augmented framework. The method introduces two levels of retrieval: at the episode level, it uses an instruction-level embedding retriever to select similar successful navigation trajectories as in-context exemplars for task-specific priors; at the step level, it employs an imitation-learned candidate retriever to prune irrelevant navigable directions before LLM inference. The experiments show consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments, indicating that retrieval-augmented decision support enhances LLM-based VLN efficiency and stability.
本文提出了一种检索增强框架,以解决基于提示的大语言模型(LLM)在视觉语言导航(VLN)中的低效问题。该框架包括一个指令级嵌入检索器,用于选择相似的成功导航轨迹作为上下文示例,以及一个步骤级的模仿学习候选检索器,用于在LLM推理前剔除无关的可导航方向,从而减少动作的不确定性并简化提示。该方法在已见和未见环境中均提高了成功率、Oracle成功率和SPL,表明检索增强的决策支持是增强基于LLM的VLN的有效且可扩展的策略。
Non-intrusive data-driven model order reduction for circuits based on Hammerstein architectures
Authors: Joshua Hanson, Paul Kuberry, Biliana Paskaleva, Pavel Bochev
Venue: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 44(6) (2025) 2314-2327
First: 2024-05-30T15:47:48+00:00 · Latest: 2026-02-17T16:51:43+00:00
Comments: 14 pages, 18 figures; accepted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract
We demonstrate that system identification techniques can provide a basis for effective, non-intrusive model order reduction (MOR) for common circuits that are key building blocks in microelectronics. Our approach is motivated by the practical operation of these circuits and utilizes a canonical Hammerstein architecture. To demonstrate the approach we develop parsimonious Hammerstein models for a nonlinear CMOS differential amplifier and an operational amplifier circuit. We train these models on a combination of direct current (DC) and transient Spice circuit simulation data using a novel sequential strategy to identify their static nonlinear and linear dynamical parts. Simulation results show that the Hammerstein model is an effective surrogate for for these types of circuits that accurately and efficiently reproduces their behavior over a wide range of operating points and input frequencies.
中文标题/摘要
标题:基于Hammerstein架构的无侵入数据驱动模型降阶方法
我们证明了系统辨识技术可以为常见的微电子学中关键构建块电路提供有效且无侵入的模型降阶(MOR)的基础。我们的方法受到这些电路实际操作的启发,并利用了标准的Hammerstein架构。为了展示这种方法,我们为一个非线性CMOS差分放大器和一个运算放大器电路开发了简约的Hammerstein模型。我们使用一种新颖的顺序策略,基于直流(DC)和瞬态Spice电路仿真数据训练这些模型,以识别它们的静态非线性和线性动态部分。仿真结果表明,Hammerstein模型是这些类型电路的有效代理模型,能够在广泛的运行点和输入频率范围内准确且高效地再现其行为。
Summary / 总结
The research aims to develop a non-intrusive model order reduction method for common circuits using system identification techniques based on Hammerstein architectures. The method involves training parsimonious Hammerstein models on DC and transient Spice simulation data to capture both the static nonlinear and linear dynamical parts of the circuits. The results show that these models effectively surrogate the behavior of nonlinear CMOS differential amplifiers and operational amplifiers over a wide range of operating points and input frequencies.
该研究旨在利用基于Hammerstein架构的系统识别方法开发一种非侵入式的模型阶次缩减技术,以适用于常见的电路。该方法通过顺序训练策略在直流和瞬态Spice仿真数据上识别电路的静态非线性和线性动态部分。主要发现表明,Hammerstein模型能够有效地替代非线性CMOS差分放大器和运算放大器的行为,准确地在各种工作点和输入频率下重现其行为。
On the Role of Iterative Computation in Reinforcement Learning
Authors: Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach
First: 2026-02-05T18:45:57+00:00 · Latest: 2026-02-17T16:47:18+00:00
Abstract
How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
中文标题/摘要
标题:关于强化学习中迭代计算的作用
可用的计算量对强化学习(RL)策略的学习有何影响?使用固定数量参数的策略是否仍能从额外的计算中受益?标准的RL框架无法正式回答这些问题。从经验上看,深度RL策略通常被参数化为具有静态架构的神经网络,混淆了计算量和参数数量。在本文中,我们形式化了计算受限策略,并证明使用更多计算的策略可以解决计算量较少的策略无法解决的问题,并且在更长时序的任务上具有更好的泛化能力。基于先前的工作,我们提出了一种可以使用可变计算量的最小架构。我们的实验补充了我们的理论。在31个不同的任务集上,包括在线和离线RL,我们展示了(1)这种架构仅通过使用更多的计算量就能实现更强的性能,(2)在更长时序的测试任务上具有更强的泛化能力,与标准的前馈网络或使用多达5倍参数的深度残差网络相比。
Summary / 总结
This paper investigates how the amount of computational resources affects reinforcement learning policies. It formalizes compute-bounded policies and demonstrates that policies using more compute can solve longer-horizon tasks that are beyond the capabilities of policies with less compute. Experiments on 31 different tasks show that the proposed architecture, which can use a variable amount of compute, outperforms standard feedforward networks and deep residual networks, achieving better performance and generalization on longer-horizon tasks.
该论文研究了额外计算资源如何影响强化学习策略。它形式化了计算受限的策略,并展示了使用更多计算资源的策略能够解决那些计算资源较少的策略无法解决的长期任务。实验表明,所提出的架构在31个任务上优于标准的前馈网络和具有多达5倍参数的深度残差网络,不仅在性能上更优,而且在长期任务上的泛化能力更强。
LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases
Authors: Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach
First: 2026-02-14T08:10:27+00:00 · Latest: 2026-02-17T16:47:13+00:00
Comments: 26 pages, 13 figures and 8 tables
Abstract
Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
中文标题/摘要
标题:LeafNet:植物病害基础视觉-语言理解的大规模数据集和全面基准
基础模型和视觉-语言预训练显著推动了视觉-语言模型(VLMs)的发展,使其能够处理视觉和语言数据。然而,由于缺乏大规模、全面的多模态图像-文本数据集和基准,它们在特定农业任务中的应用,如植物病理学,仍然受到限制。为解决这一问题,我们引入了LeafNet,一个全面的多模态数据集,以及LeafBench,一个视觉问答基准,旨在系统评估VLMs在理解植物病害方面的能力。该数据集包含186,000张叶子的数字图像,涵盖97个病害类别,配以元数据,生成了13,950个问题-答案对,覆盖六个关键农业任务。问题评估了植物病理学理解的各个方面,包括视觉症状识别、分类关系和诊断推理。在我们的LeafBench数据集上对12个最先进的VLMs进行基准测试,我们揭示了它们在病害理解能力上的巨大差异。我们的研究表明,任务之间的性能差异显著:二元健康-病害分类的准确率超过90%,而细粒度病原体和物种识别的准确率低于65%。视觉模型与VLMs之间的直接比较表明,多模态架构具有关键优势:微调后的VLMs优于传统的视觉模型,证实了整合语言表示显著提高了诊断精度。这些发现突显了当前VLMs在植物病理学应用中的关键差距,并强调了LeafBench作为严格框架的重要性,用于方法学进步和可靠AI辅助植物病害诊断的进展评估。代码可在https://github.com/EnalisUs/LeafBench获取。
Summary / 总结
The paper introduces LeafNet, a large-scale multimodal dataset for plant disease understanding, and LeafBench, a benchmark for evaluating Vision-Language Models (VLMs). The dataset includes 186,000 leaf images with 97 disease classes and 13,950 question-answer pairs. Benchmarking 12 state-of-the-art VLMs on LeafBench, the study reveals significant disparities in disease understanding, with binary classification tasks performing better than fine-grained identification. Multimodal architectures outperform vision-only models in diagnostic precision, highlighting the need for further research in this domain.
研究引入了LeafNet,这是一个大规模的多模态数据集,用于植物疾病,以及LeafBench,一个用于评估Vision-Language模型在理解植物病理方面能力的基准。该数据集包含186,000张叶片图像和13,950个问题-答案对,涵盖了97种疾病类别。在LeafBench数据集上对12个最先进的VLM进行基准测试,研究揭示了在疾病理解方面的显著差异,二元分类任务的表现优于细粒度识别。研究结果强调了在植物病理应用中,多模态架构比纯视觉模型的优势。
From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV
Authors: Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han
First: 2025-06-03T09:01:33+00:00 · Latest: 2026-02-17T16:45:25+00:00
Abstract
A public safety Uncrewed Aerial Vehicle (UAV) enhances situational awareness during emergency response. Its agility, mobility optimization, and ability to establish Line-of-Sight (LoS) communication make it increasingly important for managing emergencies such as disaster response, search and rescue, and wildfire monitoring. Although Deep Reinforcement Learning (DRL) has been used to optimize UAV navigation and control, its high training complexity, low sample efficiency, and the simulation-to-reality gap limit its practicality in public safety applications. Recent advances in Large Language Models (LLMs) present a promising alternative. With strong reasoning and generalization abilities, LLMs can adapt to new tasks through In-Context Learning (ICL), enabling task adaptation via natural language prompts and example-based guidance without retraining. Deploying LLMs at the network edge, rather than in the cloud, further reduces latency and preserves data privacy, making them suitable for real-time, mission-critical public safety UAVs. This paper proposes integrating LLM-assisted ICL with public safety UAVs to address key functions such as path planning and velocity control in emergency response. We present a case study on data collection scheduling, demonstrating that the LLM-assisted ICL framework can significantly reduce packet loss compared to conventional approaches while also mitigating potential jailbreaking vulnerabilities. Finally, we discuss LLM optimizers and outline future research directions. The ICL framework enables adaptive, context-aware decision-making for public safety UAVs, offering a lightweight and efficient solution to enhance UAV autonomy and responsiveness in emergencies.
中文标题/摘要
标题:从提示到保护:大型语言模型赋能的智能公共安全无人机上下文学习
公共安全无人驾驶航空器(UAV)在应急响应期间增强态势感知。其灵活性、移动优化以及建立视距(LoS)通信的能力使其在灾害响应、搜索与救援以及野火监测等紧急管理中变得越来越重要。尽管深度强化学习(DRL)已被用于优化UAV导航和控制,但其高训练复杂性、低样本效率以及模拟与现实之间的差距限制了其在公共安全应用中的实用性。最近在大型语言模型(LLMs)方面的进展提供了一种有前景的替代方案。凭借强大的推理和泛化能力,LLMs可以通过上下文学习(ICL)适应新任务,通过自然语言提示和示例指导实现任务适应,而无需重新训练。将LLMs部署在网络边缘而非云端,进一步减少了延迟并保护了数据隐私,使其适合用于实时、任务关键的公共安全UAV。本文提出将LLM辅助的ICL与公共安全UAV集成,以解决应急响应中的关键功能,如路径规划和速度控制。我们通过数据收集调度案例研究展示了LLM辅助的ICL框架可以显著减少数据包丢失,同时缓解潜在的破解漏洞。最后,我们讨论了LLM优化器并概述了未来的研究方向。ICL框架使公共安全UAV能够进行适应性强、上下文感知的决策,提供了一种轻量级且高效的解决方案,以增强UAV在紧急情况下的自主性和响应性。
Summary / 总结
This paper addresses the need for enhanced situational awareness in public safety UAVs during emergencies by integrating Large Language Model-Enabled In-Context Learning (LLM-ICL). The method leverages LLMs for task adaptation through natural language prompts and example-based guidance, reducing the need for retraining. Key experimental findings show that the LLM-assisted ICL framework can significantly reduce packet loss in data collection scheduling compared to conventional approaches, while also mitigating potential security vulnerabilities.
本文通过将大型语言模型(LLMs)与In-Context Learning(ICL)结合,解决公共安全无人机在应急响应中的优化问题。方法利用LLMs通过自然语言提示和示例指导来执行任务适应,减少重新训练的需要。关键实验发现表明,LLM辅助的ICL框架在数据收集调度中可以显著减少数据包丢失,提高无人机在紧急情况下的响应能力和自主性。
Arbor: A Framework for Reliable Navigation of Critical Conversation Flows
Authors: Luís Silva, Diogo Gonçalves, Catarina Farinha, Clara Matos, Luís Ungaro
First: 2026-02-16T11:09:02+00:00 · Latest: 2026-02-17T16:44:27+00:00
Abstract
Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.
中文标题/摘要
标题:Arbor:一种可靠的批判性对话流程导航框架
大型语言模型在医疗急救等高风险领域难以严格遵循结构化工作流程。单一提示中编码整个决策结构的 monolithic 方法随着提示长度增加而容易出现指令遵循退化,包括中间迷失效应和上下文窗口溢出。为了解决这一问题,我们提出了 Arbor,一种将决策树导航分解为专门的节点级任务的框架。决策树被标准化为边列表表示并存储以供动态检索。在运行时,基于有向无环图(DAG)的编排机制会迭代地检索当前节点的出边,通过专用的LLM调用评估有效的过渡,并将响应生成委托给单独的推理步骤。该框架对底层决策逻辑和模型提供商是无感知的。在10个基础模型上使用实际临床分诊对话的标注轮次与单提示基线进行评估。Arbor 将平均轮次准确性提高了29.4个百分点,将每轮次延迟减少了57.1%,并实现了平均每轮次成本14.4倍的降低。这些结果表明,架构分解减少了对内在模型能力的依赖,使较小的模型能够匹配或超越在单提示基线下运行的较大模型。
Summary / 总结
The paper introduces Arbor, a framework designed to enhance the reliability of decision-making in high-stakes domains like healthcare triage by decomposing decision tree navigation into specialized tasks. It converts decision trees into an edge-list format for dynamic retrieval and uses a DAG-based orchestration mechanism to iteratively evaluate and delegate tasks. Evaluations across 10 foundation models show that Arbor improves turn accuracy by 29.4 percentage points, reduces latency by 57.1%, and decreases cost by an average of 14.4 times compared to single-prompt baselines.
论文提出了名为Arbor的框架,旨在通过将决策树导航分解为专门任务来提高大型语言模型在高风险领域如医疗急救中的可靠性。该框架将决策树转换为边列表表示,并使用基于有向无环图(DAG)的编排机制,逐迭代检索和评估有效的转换。在10个基础模型上的评估显示,Arbor将平均回合准确率提高了29.4个百分点,将每回合的延迟降低了57.1%,并且将每回合的成本平均降低了14.4倍,相比单指令基线模型。
Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
Authors: Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili
First: 2026-02-17T16:41:51+00:00 · Latest: 2026-02-17T16:41:51+00:00
Comments: 3 figures
Abstract
Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.
中文标题/摘要
标题:基于音频和IMU的程序性任务前瞻对话助手
实时对话助手通常依赖视频输入,这会带来计算成本高昂并损害用户隐私的问题。我们首次提出了一种仅使用用户穿戴设备提供的轻量级隐私保护模态(如音频和IMU输入)来提供程序性任务全面指导的实时对话助手。该助手主动向正在进行家具组装任务的用户提供逐步指令,并回答用户问题。我们构建了一个包含助手指导用户完成任务的对话数据集。观察到现成的语言模型是一个非常健谈的助手后,我们设计了一种名为用户意愿无关(UWA)的LoRA微调方法,该方法提高了模型抑制不相关信息对话的能力,同时保持其传达重要指令的倾向。这导致F分数提高了超过30%。微调模型还导致了16倍的速度提升,因为不再需要在提示中提供上下文示例。我们还描述了如何在不依赖云服务的情况下将此类助手部署在边缘设备上。
Summary / 总结
The research addresses the limitations of video-based conversational assistants by proposing a real-time assistant that uses audio and IMU inputs from a wearable device. It constructs a dataset for a furniture assembly task and develops a User Whim Agnostic (UWA) LoRA finetuning method to improve the assistant's ability to provide relevant instructions and suppress unnecessary dialogues, achieving a 30% improvement in F-score and a 16x speedup in response time. The assistant is implemented on edge devices without cloud dependency.
本文通过提出一种使用穿戴设备的音频和IMU输入的实时助手,解决了基于视频的对话助手的局限性。该助手在进行家具组装任务时提供步骤指导并回答问题。开发了一种名为User Whim Agnostic (UWA) LoRA微调方法,以提高模型抑制不相关信息的能力,同时保持传达重要指令的倾向,F分数提高了30%,并通过消除提示中的上下文示例实现了16倍的速度提升。该助手还实现了边缘设备上,无需依赖云服务。
History
20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553